CN105468744B - Big data platform for realizing tax public opinion analysis and full text retrieval - Google Patents

Big data platform for realizing tax public opinion analysis and full text retrieval Download PDF

Info

Publication number
CN105468744B
CN105468744B CN201510831784.7A CN201510831784A CN105468744B CN 105468744 B CN105468744 B CN 105468744B CN 201510831784 A CN201510831784 A CN 201510831784A CN 105468744 B CN105468744 B CN 105468744B
Authority
CN
China
Prior art keywords
big data
layer
search
information
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510831784.7A
Other languages
Chinese (zh)
Other versions
CN105468744A (en
Inventor
唐旋
王传超
左少标
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Original Assignee
Shandong ICity Information Technology Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong ICity Information Technology Co., Ltd. filed Critical Shandong ICity Information Technology Co., Ltd.
Priority to CN201510831784.7A priority Critical patent/CN105468744B/en
Publication of CN105468744A publication Critical patent/CN105468744A/en
Application granted granted Critical
Publication of CN105468744B publication Critical patent/CN105468744B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

the invention discloses a big data platform for realizing tax public opinion analysis and full text retrieval, which comprises the following concrete implementation processes: the platform comprises a basic data layer, a big data acquisition layer, a big data storage layer, a big data analysis layer, a big data index layer, a big data search layer and a big data application layer from bottom to top. Compared with the prior art, the big data platform for realizing tax public opinion analysis and full text retrieval improves the public opinion monitoring and handling capacity of tax bureau, and has strong practicability, wide application range and easy popularization.

Description

Big data platform for realizing tax public opinion analysis and full text retrieval
Technical Field
The invention relates to the technical field of big data, in particular to a big data platform which has strong practicability and can realize tax public opinion analysis and full text retrieval.
Background
in order to fully utilize internet network resources of tax public opinions and create a big data platform for internet plus tax public opinions, a whole set of public opinion big data platform framework design for collecting, processing, analyzing, storing, indexing and searching public opinion big data is required to be provided for users.
Disclosure of Invention
The technical task of the invention is to provide a big data platform which has strong practicability and can realize tax public opinion analysis and full text retrieval aiming at the defects.
A big data platform for realizing tax public opinion analysis and full text retrieval comprises a basic data layer, a big data acquisition layer, a big data storage layer, a big data analysis layer, a big data index layer, a big data search layer and a big data application layer from bottom to top; wherein, the basic data layer is a basic data source related to public sentiment in the tax industry; the big data acquisition layer provides a public opinion information acquisition mode for the user to complete information acquisition of the basic data layer; the big data analysis layer processes, analyzes and excavates the big data collected by the big data collection layer, and then sends the big data to the big data storage layer; the big data storage layer is used for storing public opinion webpage big data information and basic data information; the big data index layer establishes an inverted index channel for the data stored in the big data storage layer and provides a quick search data source for the big data search layer; the big data search layer provides a full text search engine of big data; and the big data application layer is used for effectively utilizing the resources of the searched tax public opinion big data.
The basic data sources arranged in the basic data layer comprise a taxpayer name list library, a tax administration machine unit library, a policy and regulation library and a tax seed library, and the basis is provided for the whole platform to directionally collect the network tax public opinion information through the basic data sources.
The big data acquisition layer provides an acquisition mode comprising directional acquisition, a web crawler and an acquisition rule configuration module, wherein the directional acquisition refers to the directional acquisition of basic data, the web crawler is used for capturing the information of a public opinion website, and the web crawler is a deep web crawler for crawling a seed website; the acquisition rule configuration module provides a basic webpage acquisition configuration rule function.
The big data analysis layer comprises three parts of preprocessing, word segmentation and emotion analysis, wherein the preprocessing is the preprocessing operation of removing labels, extracting characters and eliminating noise on the acquired webpage; the word segmentation is the operation of carrying out word segmentation, word deactivation, entity recognition and word vector construction on the text; the emotion analysis is to analyze the text through emotion information extraction and emotion information classification, and judge the emotion tendency of the text: i.e., positive, negative, neutral.
The big data storage layer comprises a public opinion webpage information base and a basic resource base, wherein the public opinion webpage information base stores unstructured public opinion information comprising collected original webpages, processed webpages, pictures, videos, style files and extracted webpage contents; the basic resource library is used for supplementing information of four basic resource layers, and the supplemented information comprises tax payer official networks and news information, tax institution unit official networks and news information, policy and regulation original text webpages, file information and network resources including tax encyclopedias.
The big data index layer comprises an increment index, a full index, index addition and deletion and index updating, wherein the increment index provides increment index establishing operation for a data source; the full index provides all reconstruction index operations for the data source; index add and delete provides index add and delete operations; index updates provide update operations to the established index.
The big data search layer comprises a whole text search module, a synonym search module, a grouping statistic module and a fuzzy matching module, wherein the whole text search module provides matching query for each word and each word of index content and feeds back a query result; the synonym search provides a synonym function for simultaneously inquiring a word when inquiring the word; the grouping statistics provides grouping statistics functions classified by types or other appointed classifications for the search results; and the fuzzy matching carries out word segmentation on the search phrase and then carries out query.
The big data application layer comprises classified public sentiments, negative public sentiments, public sentiment search and public sentiment reports, wherein the classified public sentiments are based on four basic types of public sentiment information of taxpayers, organ units, policy and regulation and taxes and the subdivision combination of positive, negative and neutral subclasses; negative public sentiment provides real-time negative public sentiment monitoring and tracking; public opinion search provides a multi-condition search mode of users, wherein the search comprises single search, combined search, advanced search and the like; the public opinion report provides a public opinion statistical analysis report on a monthly, weekly, and daily basis.
The big data platform for realizing tax public opinion analysis and full text retrieval has the following advantages:
The big data platform for realizing tax public opinion analysis and full text retrieval fully integrates tax internal data and internet tax public opinion big data, and enables tax departments to master public opinion information on the internet such as taxpayers, tax administration units, policy and regulation, various tax types and the like in advance, thereby improving the public opinion monitoring and responding capability of tax authorities, and having strong practicability, wide application range and easy popularization.
drawings
FIG. 1 is a schematic diagram of the overall structure of the present invention.
Detailed Description
The invention is further described with reference to the following figures and specific examples.
The invention provides a big data platform for realizing tax public opinion analysis and full text retrieval, which comprises a basic data layer, a big data acquisition layer, a big data storage layer, a big data analysis layer, a big data index layer, a big data search layer and a big data application layer from bottom to top, as shown in figure 1; wherein, the basic data layer is a basic data source related to public sentiment in the tax industry; the big data acquisition layer provides a public opinion information acquisition mode for the user to complete information acquisition of the basic data layer; the big data analysis layer processes, analyzes and excavates the big data collected by the big data collection layer, and then sends the big data to the big data storage layer; the big data storage layer is used for storing public opinion webpage big data information and basic data information; the big data index layer establishes an inverted index channel for the data stored in the big data storage layer and provides a quick search data source for the big data search layer; the big data search layer provides a full text search engine of big data; and the big data application layer is used for effectively utilizing the resources of the searched tax public opinion big data.
the basic data sources arranged in the basic data layer comprise a taxpayer name list library, a tax administration machine unit library, a policy and regulation library and a tax seed library, and the basis is provided for the whole platform to directionally collect the network tax public opinion information through the basic data sources.
the big data acquisition layer provides an acquisition mode comprising directional acquisition, a web crawler and an acquisition rule configuration module, wherein the directional acquisition refers to the directional acquisition of basic data, the web crawler is used for capturing the information of a public opinion website, and the web crawler is a deep web crawler for crawling a seed website; the acquisition rule configuration module provides a basic webpage acquisition configuration rule function.
The big data analysis layer comprises three parts of preprocessing, word segmentation and emotion analysis, wherein the preprocessing is the preprocessing operation of removing labels, extracting characters and eliminating noise on the acquired webpage; the word segmentation is the operation of carrying out word segmentation, word deactivation, entity recognition and word vector construction on the text; the emotion analysis is to analyze the text through emotion information extraction and emotion information classification, and judge the emotion tendency of the text: i.e., positive, negative, neutral.
The big data storage layer comprises a public opinion webpage information base and a basic resource base, wherein the public opinion webpage information base stores unstructured public opinion information comprising collected original webpages, processed webpages, pictures, videos, style files and extracted webpage contents; the basic resource library is used for supplementing information of four basic resource layers, and the supplemented information comprises tax payer official networks and news information, tax institution unit official networks and news information, policy and regulation original text webpages, file information and network resources including tax encyclopedias.
The big data index layer comprises an increment index, a full index, index addition and deletion and index updating, wherein the increment index provides increment index establishing operation for a data source; the full index provides all reconstruction index operations for the data source; index add and delete provides index add and delete operations; index updates provide update operations to the established index.
the big data search layer comprises a whole text search module, a synonym search module, a grouping statistic module and a fuzzy matching module, wherein the whole text search module provides matching query for each word and each word of index content and feeds back a query result; the synonym search provides a synonym function for simultaneously inquiring a word when inquiring the word; the grouping statistics provides grouping statistics functions classified by types or other appointed classifications for the search results; and the fuzzy matching carries out word segmentation on the search phrase and then carries out query.
The big data application layer comprises classified public sentiments, negative public sentiments, public sentiment search and public sentiment reports, wherein the classified public sentiments are based on four basic types of public sentiment information of taxpayers, organ units, policy and regulation and taxes and the subdivision combination of positive, negative and neutral subclasses; negative public sentiment provides real-time negative public sentiment monitoring and tracking; public opinion search provides a multi-condition search mode of users, wherein the search comprises single search, combined search, advanced search and the like; the public opinion report provides a public opinion statistical analysis report on a monthly, weekly, and daily basis.
In the actual design process, the distributed internet data acquisition Nutch platform is used for carrying out secondary development and integration, a public opinion webpage information mode and a crawling type whole network searching and matching tax public opinion webpage information mode are associated and acquired according to a tax payers name list library, a policy and regulation library and a tax variety library in a targeted mode, comprehensive tax public opinion big data information is collected and stored into an Hbase distributed database cluster, then a Hadoop platform is used for carrying out batch text preprocessing operation on a webpage document data set, then JAVA integrated open source natural language processing tools OpenNLP, FudanNLP, LingPipe, IKAnalyzer, word2vec and other tools are used for collecting Mathout algorithm libraries to complete text word segmentation and emotion analysis, and three types of positive, neutral and negative labels are marked on each piece of public opinion webpage information. And finally, establishing an index and providing a full-text search engine by using a SolrCloud distributed full-text search engine tool for a user to perform full-text retrieval.
The above embodiments are only specific cases of the present invention, and the protection scope of the present invention includes but is not limited to the above embodiments, and any suitable changes or substitutions by those skilled in the art, which are consistent with the claims of the present invention for a big data platform for realizing tax public opinion analysis and full text search, should fall into the protection scope of the present invention.

Claims (1)

1. A big data platform for realizing tax public opinion analysis and full text retrieval is characterized by comprising a basic data layer, a big data acquisition layer, a big data storage layer, a big data analysis layer, a big data index layer, a big data search layer and a big data application layer from bottom to top; wherein the content of the first and second substances,
The basic data layer is a basic data source related to public sentiment in the tax industry, the basic data source built in the basic data layer comprises a taxpayer name list library, a tax machine customs unit library, a policy and regulation library and a tax type library, and a basis is provided for directionally acquiring network tax public sentiment information by the whole platform through the basic data source;
The big data acquisition layer provides a public opinion information acquisition mode for a user to finish information acquisition of a basic data layer, and provides an acquisition mode comprising directional acquisition, a web crawler and an acquisition rule configuration module, wherein the directional acquisition refers to the directional acquisition of basic data, the web crawler is used for capturing information of a public opinion website, and the web crawler is a deep web crawler for crawling a seed website; the acquisition rule configuration module provides a basic webpage acquisition configuration rule function;
The big data analysis layer processes, analyzes and excavates big data collected by the big data collection layer and sends the big data to the big data storage layer, and the big data analysis layer comprises three parts of preprocessing, word segmentation and emotion analysis, wherein the preprocessing is preprocessing operations of label removal, character extraction and noise elimination on collected webpages; the word segmentation is the operation of carrying out word segmentation, word deactivation, entity recognition and word vector construction on the text; the emotion analysis is to analyze the text through emotion information extraction and emotion information classification, and judge the emotion tendency of the text: i.e., positive, negative, neutral;
The big data storage layer is used for storing big data information and basic data information of public opinion webpages, and comprises a public opinion webpage information base and a basic resource base, wherein the public opinion webpage information base stores unstructured public opinion information comprising collected original webpages, processed webpages, pictures, videos, style files and extracted webpage contents; the basic resource library is used for supplementing information of four basic resource layers, and the supplemented information comprises tax payer official networks and news information, tax institution unit official networks and news information, policy and regulation original text webpages, file information and network resources including tax encyclopedias;
The big data index layer establishes a channel of inverted indexes for data stored in the big data storage layer and provides a fast search data source for the big data search layer, and the big data index layer comprises an increment index, a full index, index addition and deletion and index updating, wherein the increment index provides increment index establishing operation for the data source; the full index provides all reconstruction index operations for the data source; index add and delete provides index add and delete operations; the index updating provides updating operation for the established index;
The big data search layer provides a full text search engine of big data, and comprises a full text search module, a synonym search module, a grouping statistic module and a fuzzy matching module, wherein the full text search module provides matching query for each word of index content and feeds back the query result; the synonym search provides a synonym function for simultaneously inquiring a word when inquiring the word; grouping statistics provides a grouping by type statistical function for the search results; carrying out fuzzy matching on the search phrase, and then inquiring;
the big data application layer comprises classified public sentiments, negative public sentiments, public sentiment search and public sentiment reports, wherein the classified public sentiments are based on four basic types of public sentiment information of taxpayers, organ units, policy and regulation and taxes and the subdivision combination of positive, negative and neutral subclasses; negative public sentiment provides real-time negative public sentiment monitoring and tracking; public opinion search provides a multi-condition search mode for users, wherein the search comprises single search, combined search and advanced search; the public opinion report provides a public opinion statistical analysis report according to the month, the week and the day;
The platform carries out secondary development and integration through a distributed internet data acquisition Nutch platform, carries out correlation acquisition on public opinion webpage information modes of a tax authority tax payer name list library, a policy and regulation library and a tax variety library and a crawling type full-network search matching tax public opinion webpage information mode, collects comprehensive tax public opinion big data information, stores the comprehensive tax public opinion big data information into an Hbase distributed database cluster, then uses a Hadoop platform to carry out batch text preprocessing operation on a webpage document data set, uses JAVA integrated Java open source natural language processing tools OpenNLP, FudanNLP, LingPipe, IKALyzer and word2vec integrated Mathout algorithm library to complete text word segmentation and emotion analysis, and marks a positive label, a neutral label and a negative label for each piece of public opinion webpage information; and finally, establishing an index and providing a full-text search engine by using a SolrCloud distributed full-text search engine tool for a user to perform full-text retrieval.
CN201510831784.7A 2015-11-25 2015-11-25 Big data platform for realizing tax public opinion analysis and full text retrieval Active CN105468744B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510831784.7A CN105468744B (en) 2015-11-25 2015-11-25 Big data platform for realizing tax public opinion analysis and full text retrieval

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510831784.7A CN105468744B (en) 2015-11-25 2015-11-25 Big data platform for realizing tax public opinion analysis and full text retrieval

Publications (2)

Publication Number Publication Date
CN105468744A CN105468744A (en) 2016-04-06
CN105468744B true CN105468744B (en) 2019-12-10

Family

ID=55606445

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510831784.7A Active CN105468744B (en) 2015-11-25 2015-11-25 Big data platform for realizing tax public opinion analysis and full text retrieval

Country Status (1)

Country Link
CN (1) CN105468744B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106815307A (en) * 2016-12-16 2017-06-09 中国科学院自动化研究所 Public Culture knowledge mapping platform and its use method
CN106649223A (en) * 2016-12-23 2017-05-10 北京文因互联科技有限公司 Financial report automatic generation method based on natural language processing
CN106649773A (en) * 2016-12-27 2017-05-10 北京大数有容科技有限公司 Big data collaborative analysis tool platform
CN106874016A (en) * 2017-03-07 2017-06-20 长江大学 A kind of new customizable big data platform architecture method
CN107220367A (en) * 2017-06-09 2017-09-29 成都布林特信息技术有限公司 Internet data full-text search method
CN109408805A (en) * 2018-09-07 2019-03-01 青海大学 A kind of Tibetan language sentiment analysis method and system based on interacting depth study
CN109544003A (en) * 2018-11-23 2019-03-29 北京国信宏数科技有限责任公司 Index of economic development evaluation method based on internet big data
CN110472119A (en) * 2019-07-17 2019-11-19 广东鼎义互联科技股份有限公司 One kind being applied to government affairs the analysis of public opinion platform
CN112836046A (en) * 2021-01-13 2021-05-25 哈尔滨工程大学 Four-risk one-gold-field policy and regulation text entity identification method
CN113536133B (en) * 2021-07-30 2023-04-11 西安康奈网络科技有限公司 Internet data processing method based on single public opinion event
CN116861058B (en) * 2023-09-04 2024-04-12 浪潮软件股份有限公司 Public opinion monitoring system and method applied to government affair field
CN117312634B (en) * 2023-11-29 2024-02-20 大文传媒集团(山东)有限公司 Artificial intelligence data integration and propagation processing system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751458A (en) * 2009-12-31 2010-06-23 暨南大学 Network public sentiment monitoring system and method
US9092789B2 (en) * 2008-04-03 2015-07-28 Infosys Limited Method and system for semantic analysis of unstructured data
CN104899268A (en) * 2015-05-25 2015-09-09 浪潮集团有限公司 Distributed enterprise information vertical search method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9092789B2 (en) * 2008-04-03 2015-07-28 Infosys Limited Method and system for semantic analysis of unstructured data
CN101751458A (en) * 2009-12-31 2010-06-23 暨南大学 Network public sentiment monitoring system and method
CN104899268A (en) * 2015-05-25 2015-09-09 浪潮集团有限公司 Distributed enterprise information vertical search method

Also Published As

Publication number Publication date
CN105468744A (en) 2016-04-06

Similar Documents

Publication Publication Date Title
CN105468744B (en) Big data platform for realizing tax public opinion analysis and full text retrieval
CN106815307A (en) Public Culture knowledge mapping platform and its use method
CN103425687A (en) Retrieval method and system based on queries
CN103838785A (en) Vertical search engine in patent field
CN111967761A (en) Monitoring and early warning method and device based on knowledge graph and electronic equipment
Banić et al. Using big data and sentiment analysis in product evaluation
CN105760524A (en) Multi-level and multi-class classification method for science news headlines
Sabri et al. Web data extraction approach for deep web using WEIDJ
CN109783619A (en) A kind of data filtering method for digging
CN104536830A (en) KNN text classification method based on MapReduce
CN103744954A (en) Word relevancy network model establishing method and establishing device thereof
CN106649498A (en) Network public opinion analysis system based on crawler and text clustering analysis
CN116384889A (en) Intelligent analysis method for information big data based on natural language processing technology
CN106844588A (en) A kind of analysis method and system of the user behavior data based on web crawlers
CN105095436A (en) Automatic modeling method for data of data sources
CN104598536A (en) Structured processing method of distributed network information
CN112149422A (en) Enterprise news dynamic monitoring method based on natural language
CN112214615A (en) Policy document processing method and device based on knowledge graph and storage medium
Wulandhari et al. Corruption Cases Mapping Based on Indonesia’s Corruption Perception Index
Homocianu et al. An Analysis of Scientific Publications on'Decision Support Systems' and'Business Intelligence'Regarding Related Concepts Using Natural Language Processing Tools
Wang et al. A government policy analysis platform based on knowledge graph
CN116467291A (en) Knowledge graph storage and search method and system
Liao et al. Improving farm management optimization: Application of text data analysis and semantic networks
CN102214179A (en) Method for capturing network information
Le et al. Exploring Relationship Between Social ICT Issues And Academic Research Interests Through Text Mining Analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20191114

Address after: 250100 Room 3110, S01 Building, Tidal Building, 1036 Tidal Road, Jinan High-tech Zone, Shandong Province

Applicant after: Shandong Aicheng Network Information Technology Co., Ltd.

Address before: 250100 Ji'nan science and Technology Development Zone, Shandong Branch Road No. 2877

Applicant before: Wave Software Group Co., Ltd.

GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200812

Address after: 214029 No. 999 Gaolang East Road, Binhu District, Wuxi City, Jiangsu Province (Software Development Building) 707

Patentee after: Chaozhou Zhuoshu Big Data Industry Development Co.,Ltd.

Address before: 250100 Room 3110, S01 Building, Tidal Building, 1036 Tidal Road, Jinan High-tech Zone, Shandong Province

Patentee before: Shandong Aicheng Network Information Technology Co.,Ltd.