CN112100535A - Network public opinion analysis system and method based on DFA algorithm - Google Patents

Network public opinion analysis system and method based on DFA algorithm Download PDF

Info

Publication number
CN112100535A
CN112100535A CN202010971747.7A CN202010971747A CN112100535A CN 112100535 A CN112100535 A CN 112100535A CN 202010971747 A CN202010971747 A CN 202010971747A CN 112100535 A CN112100535 A CN 112100535A
Authority
CN
China
Prior art keywords
data
analysis
layer
content
original
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010971747.7A
Other languages
Chinese (zh)
Inventor
卢宪政
左赋斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Zhishuyun Information Technology Co ltd
Original Assignee
Nanjing Zhishuyun Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Zhishuyun Information Technology Co ltd filed Critical Nanjing Zhishuyun Information Technology Co ltd
Priority to CN202010971747.7A priority Critical patent/CN112100535A/en
Publication of CN112100535A publication Critical patent/CN112100535A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/243Natural language query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a system and a method for analyzing network public sentiment based on DFA algorithm, comprising a data capturing layer, a data analyzing layer and a data analyzing layer, wherein the data capturing layer is used for capturing content of source data to be monitored according to a preset rule and sending the captured content to an original data storage layer; the original data storage layer is used for storing the received data and comprises a relational database and a distributed file system; the data analysis layer is used for carrying out data analysis on the data stored in the original data storage layer according to a preset DFA algorithm and sending an analysis result to the analysis result layer; the analysis result layer is used for storing the received analysis results; and a control layer and a front end display layer. The invention has clear and simple structure, can realize targeted monitoring according to the requirements of users, supports dynamic configuration by keywords and has high identification efficiency.

Description

Network public opinion analysis system and method based on DFA algorithm
Technical Field
The invention relates to a system and a method for network public opinion analysis based on a DFA algorithm, belonging to the technical field of data analysis.
Background
With the rapid popularization and application of computer information technology, more and more channels for information transmission are provided, the current popular network provides a free public opinion platform for vast netizens, the network public opinion can be quickly formed according to the thought and viewpoint of the netizens aiming at important current affairs at home and abroad, the generated huge influence causes the attention of related department mechanisms, and the problems existing in the network public opinion monitoring system are gradually shown.
The public opinion monitoring means that a network monitoring system classifies and sorts various information on the internet, screens out hot topics and sensitive topic trend data, and visually displays the analyzed result in the modes of charts and the like so as to determine the change condition of the public opinion in a website.
The existing public opinion analysis systems are many, but most of them monitor and analyze the whole network, and in order to realize comprehensive monitoring, the architecture is relatively complex and not simple enough. For some targeted monitoring, such as only monitoring some local forums and some local websites, dynamic monitoring of local or local residents is achieved, and the existing public opinion analysis system is not suitable for the scene due to the fact that the structure is complex and not concise, and recognition efficiency is low. Therefore, there is an urgent need for a public opinion analysis system with a clear and concise structure and high recognition rate, which can be monitored in a targeted manner.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a system and a method for network public opinion analysis based on a DFA algorithm, which have clear and concise architecture, can realize targeted monitoring according to the requirements of users, and have high identification efficiency and keyword support dynamic configuration.
In order to achieve the purpose, the invention adopts the following technical scheme: a network public opinion analysis system based on DFA algorithm includes:
the data capturing layer is used for capturing content of source data to be monitored according to a preset rule and sending the captured content to the original data storage layer;
the original data storage layer is used for storing the received data and comprises a relational database and a distributed file system;
the data analysis layer is used for carrying out data analysis on the data stored in the original data storage layer according to a preset DFA algorithm and sending an analysis result to the analysis result layer;
the analysis result layer is used for storing the received analysis results;
the control layer is used for controlling the access authority of the data warehouse and the related business functions of the analysis result layer;
the front-end display layer is used for displaying public opinion analysis results and providing API for calling and inquiring;
and data transmission is sequentially carried out among the data capturing layer, the original data storage layer, the data analysis layer, the analysis result layer and the front end display layer.
In the system for analyzing the network public sentiment based on the DFA algorithm, the source data comprises news, forum discussion posts, blog contents, microblogs and public number contents of all portal sites.
In the network public opinion analysis system based on the DFA algorithm, the data capture layer downloads, pre-cleans and analyzes source data by executing a designed script through a timing task, preprocesses effective data obtained after analysis and stores the effective data in an original data storage layer; the whole data grabbing layer comprises:
the source management module is used for managing and maintaining list information of the data source websites needing to be monitored;
the capturing rule module is used for configuring capturing rules matched with the internal pages of different data source websites;
the content analysis script module is used for configuring corresponding analysis strategies according to the webpage characteristics and the source code elements of different data source websites, and the script is configured by using xpath;
the timing task module is used for setting an execution plan of a grabbing task and an analyzing task and executing related tasks regularly according to preset time and period;
the downloader is used for downloading the page content from the Internet and transmitting the downloaded content to the pre-cleaning module;
the pre-cleaning module is used for pre-cleaning the received content and delivering the pre-cleaned data to the analyzer for processing;
the analyzer is used for analyzing the pre-cleaned data according to the analysis script and extracting useful information; the result generated by the resolver is output and stored through an output pipeline, and output to a file and a database is supported;
the scheduler is used for managing a URL list to be downloaded, removing the duplicate of the URL and calling the downloader to download corresponding content; specifically, the URL lists are stored and managed by taking Redis as a message queue, are processed one by one through a first-in first-out algorithm, and a downloader is called to download corresponding contents.
In the system for analyzing network public sentiment based on DFA algorithm, the data analysis layer comprises:
the preset keyword module is used for managing and maintaining a keyword list needing to be monitored;
the timer is used for executing the data analysis task at fixed time and setting the execution frequency of the fixed time task by combining the size of the data volume;
the data loader is used for loading text contents from the original database and the file system, acquiring a content list to be analyzed in an SQL (structured query language) statement or file reading mode, and filtering processed data according to file names and data identifications;
the word frequency analyzer is used for performing word frequency analysis and statistics on the captured original content by utilizing a DFA algorithm and combining preset keywords;
and the result output module is used for outputting the analysis and statistical results to a data warehouse or a file system and storing the analysis and statistical results according to different subject libraries.
In the network public opinion analysis system based on the DFA algorithm, data storage in the whole system comprises an original data table, a public opinion sensitive word table, a sensitive data table and a public opinion data table, wherein the original data table is used for storing captured original data, the public opinion sensitive word table is used for storing public opinion sensitive words, and the sensitive data table is used for storing sensitive data automatically analyzed by the system; the public opinion data table is used for storing public opinion data.
A method for network public opinion analysis based on DFA algorithm, the method comprising:
data capture: capturing contents from a data source website to be monitored according to a preset rule, and storing the captured contents into an original data storage layer;
and (3) data analysis: performing keyword analysis on data stored in an original data storage layer according to a preset DFA algorithm, outputting an analysis result to a data warehouse or a file system, and storing the analysis result according to different subject libraries;
and (3) data display: judging whether the content which is not analyzed still exists, if so, returning to the data analysis for continuous analysis; and if not, displaying the analyzed result according to the requirement.
In the method for performing network public opinion analysis based on the DFA algorithm, the data capturing and data analysis specifically comprises the following steps:
capturing content from a data source website, removing html tags from the content, only keeping original characters, and storing the original characters in an original data table;
acquiring the content to be analyzed from the original data table, analyzing the content according to the configured keywords, acquiring sensitive data containing sensitive words, storing the sensitive data into the sensitive data table, and updating the state of the original data into an analyzed state;
the administrator obtains data to be processed from the sensitive data table through the page and manually judges whether to switch to public sentiment processing; storing the public sentiment transferring data into a public sentiment information table, and updating the state into the transferred public sentiment; the data of the opinion is marked as ignored.
In the method for performing network public opinion analysis based on the DFA algorithm, the data capturing specifically comprises:
downloading webpage content from a data source website to be monitored through a downloader, and sending the downloaded webpage content to a pre-cleaning module for processing;
the analyzer analyzes the pre-cleaned data according to the analysis script, extracts useful information in the data and sends analysis contents to an output pipeline; if the resolver finds a new link in the resolving process, the newly found link is transmitted to a scheduler for managing the URL list to be downloaded, the scheduler performs duplication removal on the URL list and calls a downloading area to download corresponding content; the analysis script is compiled based on xpath, and the parser supports a Jsoup analysis tool;
and the output pipeline outputs and stores the result generated by the resolver, and supports the output to files and databases.
In the method for performing network public opinion analysis based on the DFA algorithm, the data analysis specifically comprises:
loading text contents from an original database and a file system by using a data loader, acquiring a content list to be analyzed in an SQL (structured query language) statement or file reading mode, and filtering processed data according to a file name and a data identifier;
calling a word frequency analyzer, and performing word frequency analysis and statistics on the content to be analyzed by using a DFA algorithm and combining preset keywords;
and outputting the analysis and statistical results to a data warehouse or a file system, and storing according to different subject libraries.
In the method for analyzing the network public sentiment based on the DFA algorithm, the data source websites comprise portal websites, forums, blogs, microblogs and public numbers.
Compared with the prior art, the method mainly comprises a data capturing layer, an original data storage layer, a data analysis layer, an analysis result layer, a control layer and a front end display layer, wherein the data capturing layer is used for capturing the content of source data to be monitored according to a preset rule and sending the captured content to the original data storage layer; the data analysis layer is used for carrying out data analysis on data stored in the original data storage layer according to a preset DFA algorithm and sending an analysis result to the analysis result layer, the whole framework is clear and concise, targeted monitoring can be achieved according to the requirements of users, the keywords support dynamic configuration, and the recognition efficiency is high.
Drawings
FIG. 1 is an overall architecture diagram of the present invention;
FIG. 2 is a flow chart of data capture and data analysis according to the present invention;
FIG. 3 is a flow chart of data capture according to the present invention;
FIG. 4 is a schematic diagram of the relationship between tables of the data store of the present invention;
FIG. 5 is a diagram of a text-to-analog state converter according to the present invention;
fig. 6 is a diagram illustrating a HashMap data structure according to the present invention.
Detailed Description
The technical solutions in the implementation of the present invention will be made clear and fully described below with reference to the accompanying drawings, and the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1 to fig. 3, the method for performing online public opinion analysis based on DFA algorithm provided by the present invention includes:
data capture: capturing contents from a data source website to be monitored according to a preset rule, and storing the captured contents into an original data storage layer; specifically, a scheduling task is executed through a timing task, then webpage content is downloaded through a URL, the webpage content is analyzed, effective information is obtained, the effective information is preprocessed, and the effective information is stored in an original database (namely an original data storage layer); the data source website can be set according to monitoring requirements, such as various portal websites, forums, blogs, microblogs, public numbers and the like;
and (3) data analysis: the method comprises the steps of performing data analysis on original data according to a designed algorithm to obtain an expected result, specifically, performing keyword analysis on the data stored in an original data storage layer according to a preset DFA algorithm, outputting the analysis result to a data warehouse or a file system, and storing the analysis result according to different subject libraries;
and (3) data display: judging whether the content which is not analyzed still exists, if so, returning to the data analysis for continuous analysis; and if not, displaying the analyzed result according to the requirement.
In the method for network public opinion analysis based on the DFA algorithm, the specific flow of data capture and data analysis is as follows:
capturing content from a data source website, removing html tags from the content, only keeping original characters, and storing the original characters in an original data table;
acquiring the content to be analyzed from the original data table, analyzing the content according to the configured keywords, acquiring sensitive data containing sensitive words, storing the sensitive data into the sensitive data table, and updating the state of the original data into an analyzed state;
the administrator obtains data to be processed from the sensitive data table through the page and manually judges whether to switch to public sentiment processing; storing the public sentiment transferring data into a public sentiment information table, and updating the state into the transferred public sentiment; the data of the opinion is marked as ignored.
In the method for performing network public opinion analysis based on the DFA algorithm, the data capturing specifically comprises:
downloading webpage content from a data source website to be monitored through a downloader, and sending the downloaded webpage content to a pre-cleaning module for processing;
the analyzer analyzes the pre-cleaned data according to the analysis script, extracts useful information in the data and sends analysis contents to an output pipeline; if the resolver finds a new link in the resolving process, the newly found link is transmitted to a scheduler for managing the URL list to be downloaded, the scheduler performs duplication removal on the URL list and calls a downloading area to download corresponding content; the analysis script is compiled based on xpath, and the parser supports a Jsoup analysis tool;
and the output pipeline outputs and stores the result generated by the resolver, and supports the output to files and databases.
In the method for performing network public opinion analysis based on the DFA algorithm, the data analysis specifically comprises:
loading text contents from an original database and a file system by using a data loader, acquiring a content list to be analyzed in an SQL (structured query language) statement or file reading mode, and filtering processed data according to a file name and a data identifier;
calling a word frequency analyzer, and performing word frequency analysis and statistics on the content to be analyzed by using a DFA algorithm and combining preset keywords;
and outputting the analysis and statistical results to a data warehouse or a file system, and storing according to different subject libraries.
In the method for analyzing the network public sentiment based on the DFA algorithm, the data source websites comprise portal websites, forums, blogs, microblogs and public numbers.
A network public opinion analysis system based on DFA algorithm includes:
the data capturing layer is used for capturing content of source data to be monitored according to a preset rule and sending the captured content to the original data storage layer; wherein, the source data comprises news, forum discussion posts, blog content, micro blogs and public number content of each portal website;
the original data storage layer is used for storing the received data and comprises a relational database and a distributed file system;
the data analysis layer is used for carrying out data analysis on the data stored in the original data storage layer according to a preset DFA algorithm, such as counting the occurrence frequency, frequency and heat distribution of a certain keyword and the like, and sending an analysis result to the analysis result layer;
the analysis result layer is used for storing the received analysis results;
the control layer is used for controlling the access authority of the data warehouse and the related business functions of the analysis result layer;
the front-end display layer is used for displaying public opinion analysis results and providing API for calling and inquiring;
and data transmission is sequentially carried out among the data capturing layer, the original data storage layer, the data analysis layer, the analysis result layer and the front end display layer.
In the system for network public opinion analysis based on the DFA algorithm, the source data comprises.
In the network public opinion analysis system based on the DFA algorithm, the data capture layer downloads, pre-cleans and analyzes source data by executing a designed script through a timing task, preprocesses effective data obtained after analysis and stores the effective data in an original data storage layer; the whole data grabbing layer comprises:
the source management module is used for managing and maintaining list information of the data source websites needing to be monitored;
the capturing rule module is used for configuring capturing rules matched with the internal pages of different data source websites;
the content analysis script module is used for configuring corresponding analysis strategies according to the webpage characteristics and the source code elements of different data source websites, and the script is configured by using xpath;
the timing task module is used for setting an execution plan of a grabbing task and an analyzing task and executing related tasks regularly according to preset time and period;
the downloader is used for downloading the page content from the Internet and transmitting the downloaded content to the pre-cleaning module;
the pre-cleaning module is used for pre-cleaning the received content and delivering the pre-cleaned data to the analyzer for processing;
the analyzer is used for analyzing the pre-cleaned data according to the analysis script and extracting useful information; the result generated by the resolver is output and stored through an output pipeline, and output to a file and a database is supported;
the scheduler is used for managing a URL list to be downloaded, removing the duplicate of the URL and calling the downloader to download corresponding content; specifically, the URL lists are stored and managed by taking Redis as a message queue, are processed one by one through a first-in first-out algorithm, and a downloader is called to download corresponding contents.
In the system for analyzing network public sentiment based on DFA algorithm, the data analysis layer comprises:
the preset keyword module is used for managing and maintaining a keyword list needing to be monitored;
the timer is used for executing the data analysis task at regular time, setting the execution frequency of the timing task by combining the size of data volume, and generally setting the execution frequency to be once per minute and analyzing 10000 pieces of data once;
the data loader is used for loading text contents from the original database and the file system, acquiring a content list to be analyzed in an SQL (structured query language) statement or file reading mode, and filtering processed data according to file names and data identifications;
the word frequency analyzer is used for performing word frequency analysis and statistics on the captured original content by utilizing a DFA algorithm and combining preset keywords;
and the result output module is used for outputting the analysis and statistical results to a data warehouse or a file system and storing the analysis and statistical results according to different subject libraries.
As shown in fig. 4, in the network public opinion analysis system based on the DFA algorithm, the data storage in the whole system includes an original data table, a public opinion sensitive word table, a sensitive data table and a public opinion data table, the original data table is used for storing captured original data, the public opinion sensitive word table is used for storing public opinion sensitive words, and the sensitive data table is used for storing sensitive data automatically analyzed by the system; the public opinion data table is used for storing public opinion data.
The implementation manner of the word frequency analyzer is as follows:
the DFA algorithm is often used for sensitive word recognition and filtering, is simple and efficient, and is mainly realized by means of Java of the algorithm, and word frequency and distribution of preset keywords are counted, so that the purpose of public opinion monitoring is achieved.
DFA, named Desteristic finish Automaton, determines Finite Automaton: the flow change of the DFA from one state to another state, i.e. state- > event- > state, through a series of events belongs to the prior art and is not described in detail herein.
Determining: the state and the events that caused the state transitions are determinable, with no "surprise".
The method comprises the following steps: the number of states and events is infinite.
The key to implementing keyword recognition in Java is the implementation of DFA algorithm, for example, we need to monitor the following keywords: china, chinese people, chinese longevity, china, chinese people, and chinese civilization, it is necessary to convert the keyword into a structure as shown in fig. 5.
The construction of the character state conversion machine in fig. 5 requires a HashMap data structure in the Java language, and the specific process is as follows:
(1) querying "in HashMap to see if it exists in HashMap, if not, stating that the sensitive word starting with" in "does not exist yet, we need to construct a tree starting with" in "and jump to (3).
(2) If found in a HashMap, indicating that there is a key that starts with "medium", set HashMap.
(3) And judging whether the character is the last character in the word. If the word indicates that the sensitive word is finished, setting the flag bit isEnd to 1, otherwise, setting the flag bit isEnd to 0.
The keywords of China, Chinese people, Chinese longevity, China, Chinese people, Chinese civilization and the like are sequentially constructed according to the above processes, and the data structure shown in FIG. 6 is obtained:
according to the data structure, the keywords can be conveniently retrieved from the original text, for example, the content of the original text is as follows:
china is a multi-nationality country based on Chinese civilization as a source and Chinese culture and mainly based on Chinese nationalities, and generally uses Chinese and Chinese characters, wherein the Chinese nationalities and the minority nationalities are collectively called Chinese nationalities and are also called as the propaganda of Yanhuang offspring and dragon.
Circularly matching the whole original text content, sequentially matching 'middle', 'country', 'yes', 'Hua', 'summer' and the like, and processing according to the following processes:
inquiring whether a map at the beginning of the word exists in the HashMap, if so, obtaining a new map ═ HashMap.get ("$ current word $"), and carrying out the second step; if not, then the loop is tripped out
Judging whether the isEnd in the map is 1 or not, if not, returning to the first step, processing the next word after the current word, and sequentially and circularly processing until the isEnd is 1, matching the keyword, and adding 1 to the accumulated occurrence number of the keyword.
And after the circulation processing is finished, the keywords in the original content and the occurrence times thereof are counted, and the statistical result is stored in a theme library corresponding to the data warehouse for subsequent query and display.
In summary, in the invention, through the cooperation between the data capture layer, the original data storage layer, the data analysis layer, the analysis result layer, the control layer and the front-end display layer, the data capture layer is used for capturing the content of the source data to be monitored according to the preset rule, and the captured content is sent to the original data storage layer; the data analysis layer is used for carrying out data analysis on the data stored in the original data storage layer according to a preset DFA algorithm, and the analysis result is sent to the analysis result layer.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should make the description as a whole, and the embodiments may be appropriately combined to form other embodiments understood by those skilled in the art.

Claims (10)

1. A network public opinion analysis system based on DFA algorithm is characterized by comprising:
the data capturing layer is used for capturing content of source data to be monitored according to a preset rule and sending the captured content to the original data storage layer;
the original data storage layer is used for storing the received data and comprises a relational database and a distributed file system;
the data analysis layer is used for carrying out data analysis on the data stored in the original data storage layer according to a preset DFA algorithm and sending an analysis result to the analysis result layer;
the analysis result layer is used for storing the received analysis results;
the control layer is used for controlling the access authority of the data warehouse and the related business functions of the analysis result layer;
the front-end display layer is used for displaying public opinion analysis results and providing API for calling and inquiring;
and data transmission is sequentially carried out among the data capturing layer, the original data storage layer, the data analysis layer, the analysis result layer and the front end display layer.
2. The system of claim 1, wherein the source data comprises news, forum discussion posts, blog content, micro-blogs and public content of each portal site.
3. The system for analyzing the network public sentiment based on the DFA algorithm according to claim 1, wherein the data capturing layer downloads, pre-cleans and analyzes source data by executing a designed script through a timing task, and stores effective data obtained after analysis to an original data storage layer after preprocessing; the whole data grabbing layer comprises:
the source management module is used for managing and maintaining list information of the data source websites needing to be monitored;
the capturing rule module is used for configuring capturing rules matched with the internal pages of different data source websites;
the content analysis script module is used for configuring corresponding analysis strategies according to the webpage characteristics and the source code elements of different data source websites, and the script is configured by using xpath;
the timing task module is used for setting an execution plan of a grabbing task and an analyzing task and executing related tasks regularly according to preset time and period;
the downloader is used for downloading the page content from the Internet and transmitting the downloaded content to the pre-cleaning module;
the pre-cleaning module is used for pre-cleaning the received content and delivering the pre-cleaned data to the analyzer for processing;
the analyzer is used for analyzing the pre-cleaned data according to the analysis script and extracting useful information; the result generated by the resolver is output and stored through an output pipeline, and output to a file and a database is supported;
the scheduler is used for managing a URL list to be downloaded, removing the duplicate of the URL and calling the downloader to download corresponding content; specifically, the URL lists are stored and managed by taking Redis as a message queue, are processed one by one through a first-in first-out algorithm, and a downloader is called to download corresponding contents.
4. The system for internet public opinion analysis based on DFA algorithm as claimed in claim 1, wherein the data analysis layer comprises:
the preset keyword module is used for managing and maintaining a keyword list needing to be monitored;
the timer is used for executing the data analysis task at fixed time and setting the execution frequency of the fixed time task by combining the size of the data volume;
the data loader is used for loading text contents from the original database and the file system, acquiring a content list to be analyzed in an SQL (structured query language) statement or file reading mode, and filtering processed data according to file names and data identifications;
the word frequency analyzer is used for performing word frequency analysis and statistics on the captured original content by utilizing a DFA algorithm and combining preset keywords;
and the result output module is used for outputting the analysis and statistical results to a data warehouse or a file system and storing the analysis and statistical results according to different subject libraries.
5. The system for network public opinion analysis based on DFA algorithm as claimed in claim 1, wherein the data storage in the whole system comprises an original data table, a public opinion sensitive word table, a sensitive data table and a public opinion data table, the original data table is used for storing captured original data, the public opinion sensitive word table is used for storing public opinion sensitive words, the sensitive data table is used for storing sensitive data analyzed automatically by the system; the public opinion data table is used for storing public opinion data.
6. A method for network public opinion analysis based on DFA algorithm is characterized in that the method comprises the following steps:
data capture: capturing contents from a data source website to be monitored according to a preset rule, and storing the captured contents into an original data storage layer;
and (3) data analysis: performing keyword analysis on data stored in an original data storage layer according to a preset DFA algorithm, outputting an analysis result to a data warehouse or a file system, and storing the analysis result according to different subject libraries;
and (3) data display: judging whether the content which is not analyzed still exists, if so, returning to the data analysis for continuous analysis; and if not, displaying the analyzed result according to the requirement.
7. The method of claim 6, wherein the data crawling and data analysis specifically comprises:
capturing content from a data source website, removing html tags from the content, only keeping original characters, and storing the original characters in an original data table;
acquiring the content to be analyzed from the original data table, analyzing the content according to the configured keywords, acquiring sensitive data containing sensitive words, storing the sensitive data into the sensitive data table, and updating the state of the original data into an analyzed state;
the administrator obtains data to be processed from the sensitive data table through the page and manually judges whether to switch to public sentiment processing; storing the public sentiment transferring data into a public sentiment information table, and updating the state into the transferred public sentiment; the data of the opinion is marked as ignored.
8. The method of claim 6, wherein the data mining specifically comprises:
downloading webpage content from a data source website to be monitored through a downloader, and sending the downloaded webpage content to a pre-cleaning module for processing;
the analyzer analyzes the pre-cleaned data according to the analysis script, extracts useful information in the data and sends analysis contents to an output pipeline; if the resolver finds a new link in the resolving process, the newly found link is transmitted to a scheduler for managing the URL list to be downloaded, the scheduler performs duplication removal on the URL list and calls a downloading area to download corresponding content; the analysis script is compiled based on xpath, and the parser supports a Jsoup analysis tool;
and the output pipeline outputs and stores the result generated by the resolver, and supports the output to files and databases.
9. The method of claim 8, wherein the data analysis specifically includes:
loading text contents from an original database and a file system by using a data loader, acquiring a content list to be analyzed in an SQL (structured query language) statement or file reading mode, and filtering processed data according to a file name and a data identifier;
calling a word frequency analyzer, and performing word frequency analysis and statistics on the content to be analyzed by using a DFA algorithm and combining preset keywords;
and outputting the analysis and statistical results to a data warehouse or a file system, and storing according to different subject libraries.
10. The method for internet public opinion analysis based on DFA algorithm as claimed in claim 6, wherein the data source websites include portals, forums, blogs, microblogs and public numbers.
CN202010971747.7A 2020-09-16 2020-09-16 Network public opinion analysis system and method based on DFA algorithm Pending CN112100535A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010971747.7A CN112100535A (en) 2020-09-16 2020-09-16 Network public opinion analysis system and method based on DFA algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010971747.7A CN112100535A (en) 2020-09-16 2020-09-16 Network public opinion analysis system and method based on DFA algorithm

Publications (1)

Publication Number Publication Date
CN112100535A true CN112100535A (en) 2020-12-18

Family

ID=73759224

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010971747.7A Pending CN112100535A (en) 2020-09-16 2020-09-16 Network public opinion analysis system and method based on DFA algorithm

Country Status (1)

Country Link
CN (1) CN112100535A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933093A (en) * 2015-05-19 2015-09-23 武汉泰迪智慧科技有限公司 Regional public opinion monitoring and decision-making auxiliary system and method based on big data
CN107908694A (en) * 2017-11-01 2018-04-13 平安科技(深圳)有限公司 Public sentiment clustering method, application server and the computer-readable recording medium of internet news
CN109325161A (en) * 2018-09-11 2019-02-12 五八有限公司 Public sentiment data grasping means, device, equipment and storage medium
US20200074300A1 (en) * 2018-08-28 2020-03-05 Patabid Inc. Artificial-intelligence-augmented classification system and method for tender search and analysis
CN111310014A (en) * 2020-02-21 2020-06-19 深圳中兴网信科技有限公司 Scenic spot public opinion monitoring system, method, device and storage medium based on deep learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933093A (en) * 2015-05-19 2015-09-23 武汉泰迪智慧科技有限公司 Regional public opinion monitoring and decision-making auxiliary system and method based on big data
CN107908694A (en) * 2017-11-01 2018-04-13 平安科技(深圳)有限公司 Public sentiment clustering method, application server and the computer-readable recording medium of internet news
US20200074300A1 (en) * 2018-08-28 2020-03-05 Patabid Inc. Artificial-intelligence-augmented classification system and method for tender search and analysis
CN109325161A (en) * 2018-09-11 2019-02-12 五八有限公司 Public sentiment data grasping means, device, equipment and storage medium
CN111310014A (en) * 2020-02-21 2020-06-19 深圳中兴网信科技有限公司 Scenic spot public opinion monitoring system, method, device and storage medium based on deep learning

Similar Documents

Publication Publication Date Title
CN106982150B (en) Hadoop-based mobile internet user behavior analysis method
CN110362544B (en) Log processing system, log processing method, terminal and storage medium
CN107908694A (en) Public sentiment clustering method, application server and the computer-readable recording medium of internet news
CN107943838B (en) Method and system for automatically acquiring xpath generated crawler script
US20050223027A1 (en) Methods and systems for structuring event data in a database for location and retrieval
CN105243159A (en) Visual script editor-based distributed web crawler system
CN107145556B (en) Universal distributed acquisition system
CN112860727B (en) Data query method, device, equipment and medium based on big data query engine
CN102710795A (en) Hotspot collecting method and device
WO2015096609A1 (en) Method and system for creating inverted index file of video resource
CN102117331B (en) Video search method and system
RU2701040C1 (en) Method and a computer for informing on malicious web resources
CN104391978A (en) Method and device for storing and processing web pages of browsers
CN110968571A (en) Big data analysis and processing platform for financial information service
CN103902667A (en) Simple network information collector achieving method based on meta-search
CN103078854A (en) Message filtering method and device
CN113051460A (en) Elasticissearch-based data retrieval method and system, electronic device and storage medium
CN113377372A (en) Business rule analysis method and device, computer equipment and storage medium
CN101008946A (en) Search method of Chinese mobile communication information and device thereof
CN111125485A (en) Website URL crawling method based on Scapy
CN114637903A (en) Public opinion data acquisition system for directional target data expansion
CN117251414B (en) Data storage and processing method based on heterogeneous technology
CN114661823A (en) Data synchronization method and device, electronic equipment and readable storage medium
CN1808428A (en) Information searching criteria presentation and editing system and method
Hurst et al. Social streams blog crawler

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination