CN111581480B - News information aggregation analysis method and system, terminal and storage medium - Google Patents

News information aggregation analysis method and system, terminal and storage medium Download PDF

Info

Publication number
CN111581480B
CN111581480B CN202010397390.6A CN202010397390A CN111581480B CN 111581480 B CN111581480 B CN 111581480B CN 202010397390 A CN202010397390 A CN 202010397390A CN 111581480 B CN111581480 B CN 111581480B
Authority
CN
China
Prior art keywords
data
analysis
aggregation
interface
aggregator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010397390.6A
Other languages
Chinese (zh)
Other versions
CN111581480A (en
Inventor
舒胜宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Fengyuan Technology Co ltd
Original Assignee
Hangzhou Fengyuan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Fengyuan Technology Co ltd filed Critical Hangzhou Fengyuan Technology Co ltd
Priority to CN202010397390.6A priority Critical patent/CN111581480B/en
Publication of CN111581480A publication Critical patent/CN111581480A/en
Application granted granted Critical
Publication of CN111581480B publication Critical patent/CN111581480B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a news information aggregation analysis method and system, a terminal and a storage medium, wherein the method mainly comprises the following steps: collecting original data based on data collection configuration defined by a standard aggregation interface, and carrying out data aggregation according to a structuring rule defined by the standard aggregation interface; performing de-duplication preprocessing on the aggregated data according to a structuring rule defined by an interface of an aggregator and a Chinese word segmentation technology to obtain structured data; and classifying the data according to the definition of the structured data in different dimensions, and outputting a classification report. By adopting the application, after data acquisition and aggregation, the data is subjected to de-duplication pretreatment and then analyzed and classified, so that a sustainable analysis and processing process which can conveniently meet specific requirements can be provided for large data analysis, and the acquisition problems of messy current news sources, non-uniform content structure and huge news information quantity are solved.

Description

News information aggregation analysis method and system, terminal and storage medium
Technical Field
The application relates to the technical field of big data analysis, in particular to a news information aggregation analysis method and system, a terminal and a storage medium.
Background
With the increasing popularization of the Internet, network data and news information enter a large explosion age, news aggregation analysis is needed in more places, from public opinion analysis to mass information reading, the current common technical scheme mainly comprises rss subscription, train head and other webpage grabbing tools or open source python grabbing scripts and the like, most of the problems are solved, the rss subscription is eliminated gradually, the train head and other grabbing tools cannot be systemized, continuous maintenance operation and open source scripts are more functional shortages, and practical application requirements cannot be met; meanwhile, as the acquisition source is continuously updated, an early warning function is required to be implemented so as to continuously maintain and ensure the normal operation of the system; and finally, most schemes only achieve collection and aggregation, do not have data processing, and are piled up with scattered and mixed information, so that the method has a great obstacle to the next data processing.
Disclosure of Invention
The embodiment of the application provides a news information aggregation analysis method, a system, a terminal and a storage medium, which are used for carrying out de-duplication pretreatment on data after data acquisition and aggregation and then analyzing and classifying, so that a sustainable analysis processing process which can conveniently meet specific requirements can be provided for large data analysis, and the problems of messy current news sources, non-uniform content structure and huge news information quantity are solved.
A first aspect of the present application provides a news information aggregation analysis method, which may include:
collecting original data based on data collection configuration defined by a standard aggregation interface, and carrying out data aggregation according to a structuring rule defined by the standard aggregation interface;
performing de-duplication preprocessing on the aggregated data according to a structuring rule defined by an interface of an aggregator and a Chinese word segmentation technology to obtain structured data;
and classifying the data according to the definition of the structured data in different dimensions, and outputting a classification report.
Further, the method further comprises:
and storing the aggregated data to a database cluster.
Further, performing deduplication preprocessing on the aggregated data according to a structuring rule and a Chinese word segmentation technology defined by an interface of the aggregator to obtain structured data, including:
carrying out structural analysis and intelligent semantic analysis on the aggregation data corresponding to each article according to the structural rules and Chinese word segmentation technology defined by the interfaces of the aggregators;
and obtaining keywords according to the analysis result, simulating the manual reading article to infer the meaning of the article expression, and automatically obtaining the analyzed structured data.
Further, the report template of the classification report is a custom content template.
Further, the aggregator includes a standard aggregator and a scalable custom aggregator.
Further, the method further comprises:
and adopting an extensible custom aggregator to carry out real-time alarm and outputting an acquisition report.
A second aspect of the present application provides a news information syndication and analysis system, which may include:
the acquisition and aggregation module is used for acquiring original data based on data acquisition configuration defined by a standard aggregation interface and carrying out data aggregation according to a structuring rule defined by the standard aggregation interface;
the de-duplication preprocessing module is used for performing de-duplication preprocessing on the aggregated data according to the structuring rules defined by the interface of the aggregator and the Chinese word segmentation technology to obtain structured data;
and the data classifying module is used for classifying the data according to the definition of the structured data in different dimensionalities and outputting a classifying report.
Further, the system further comprises:
and the data storage module is used for storing the aggregated data to the database cluster.
Further, the deduplication preprocessing module includes:
the data analysis unit is used for carrying out structural analysis and intelligent semantic analysis on the aggregation data corresponding to each article according to the structural rules and the Chinese word segmentation technology defined by the interface of the aggregator;
and the structural analysis unit is used for acquiring keywords according to the analysis result, simulating the meaning of the article expression inferred by the manual reading article, and automatically acquiring the analyzed structural data.
Further, the report template of the classified report is a custom content template.
Further, aggregators include standard aggregators and extensible custom aggregators.
Further, the system further comprises:
and the real-time alarm module is used for carrying out real-time alarm by adopting the extensible custom aggregator and outputting an acquisition report.
A third aspect of the embodiments of the present application provides a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the steps of:
collecting original data based on data collection configuration defined by a standard aggregation interface, and carrying out data aggregation according to a structuring rule defined by the standard aggregation interface;
performing de-duplication preprocessing on the aggregated data according to a structuring rule defined by an interface of an aggregator and a Chinese word segmentation technology to obtain structured data;
and classifying the data according to the definition of the structured data in different dimensions, and outputting a classification report. A fourth aspect of an embodiment of the present application provides a terminal, which may include: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the steps of:
collecting original data based on data collection configuration defined by a standard aggregation interface, and carrying out data aggregation according to a structuring rule defined by the standard aggregation interface;
performing de-duplication preprocessing on the aggregated data according to a structuring rule defined by an interface of an aggregator and a Chinese word segmentation technology to obtain structured data;
and classifying the data according to the definition of the structured data in different dimensions, and outputting a classification report.
The application has the beneficial effects that: the aggregation analysis process of news information data is divided into three steps of acquisition and aggregation, duplicate removal preprocessing and automatic classification, an aggregator which is provided with a standardized interface and can be freely expanded and defined is adopted for data aggregation, and a freely expandable classification strategy and a report content template are adopted during data classification. The method and the device greatly solve the difficult problems of messy current news sources, disagreeable content structure and huge news information quantity collection. Meanwhile, noise and homogeneous content can be removed to the greatest extent during automatic duplicate removal pretreatment, so that the pressure of the next data processing is greatly reduced; through structured data analysis and storage, a regular data source is provided for further big data analysis; and the maintenance and the update of a later-stage system are facilitated through real-time alarming.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a news information aggregation analysis method according to an embodiment of the present application;
FIG. 2 is a block diagram of standard aggregation interface definition parameters provided by an embodiment of the present application;
FIG. 3 is a diagram of a deduplication preprocessing architecture provided by an embodiment of the present application;
FIG. 4 is a schematic diagram of a data classification structure according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a news information syndication and analysis system according to an embodiment of the present application;
FIG. 6 is a schematic structural diagram of a deduplication preprocessing module according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a terminal according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The terms "comprising" and "having" and any variations thereof in the description and claims of the application and in the foregoing drawings are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
Furthermore, the terms "mounted," "configured," "provided," "connected," "coupled," and "sleeved" are to be construed broadly. For example, it may be a fixed connection, a removable connection, or a unitary construction; may be a mechanical connection, or an electrical connection; may be directly connected, or indirectly connected through intervening media, or may be in internal communication between two devices, elements, or components. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art according to the specific circumstances.
It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.
The terminal related to the embodiment of the application can be a mainframe computer, a PC, a tablet personal computer, a palm computer, a Mobile Internet Device (MID) and other terminal devices with data processing capability.
As shown in fig. 1, in this embodiment, the news information-gathering analysis method at least includes the following steps:
s101, acquiring original data based on data acquisition configuration defined by a standard aggregation interface, and performing data aggregation according to a structuring rule defined by the standard aggregation interface.
In a specific implementation, the system can build a data acquisition aggregation part in an interface plug-in mode, and the system is not only defined with common standard aggregators such as large news sources (Internet news, newwave news, phoenix news, fox search news, xinhua network, people network and the like), but also can freely expand the custom aggregators. Optionally, the system also has the functions of real-time alarming, acquisition report notification and the like.
It should be noted that, the essence of the standard aggregation interface is to define the data collection configuration at the time of original data collection and the standardized structure generation rule at the time of data aggregation. Specifically, as shown in fig. 2, the data collection configuration may include collection source address, collection frequency, and other collection information, and the structured rules may include title rules, character rules, time rules, location rules, and other custom structured rules.
In an alternative implementation, the system may store the aggregated data into a database cluster after collecting and aggregating the data.
S102, performing de-duplication pretreatment on the aggregated data according to a structuring rule and a Chinese word segmentation technology defined by an interface of an aggregator to obtain structured data.
In a specific implementation, the data analysis processing part of the system is firstly deduplication preprocessing. As shown in fig. 3, the process of deduplication preprocessing is a process of obtaining structured data through rule analysis, chinese word segmentation and semantic word segmentation technologies. Specifically, according to the structuring rule defined by the aggregator interface and the Chinese automatic word segmentation technology, each article is subjected to structuring analysis and intelligent semantic analysis, keywords are obtained according to weights, the article is read manually in a simulation mode, the meaning of the expression of the article is deduced, and the parsed structuring data are obtained automatically. And finally, the deduplication engine automatically deduplicates the articles with the multiple structure data approaching according to the structure data.
S103, classifying the data according to the definition of the structured data in different dimensions, and outputting a classification report.
It will be appreciated that after the deduplication pre-process, the system enters the automatic article categorization stage. Specifically, as shown in fig. 4, the system may define a collection manner according to each dimension of the obtained structured data, such as event category, person name, event name, time, place, etc., and meanwhile, define a report content template, where the report engine performs automatic classification processing, and outputs a report for review and real-time notification.
In the embodiment of the application, the aggregation analysis process of news information data is divided into three steps of acquisition and aggregation, duplication elimination pretreatment and automatic classification, an aggregator which is provided with a standardized interface and can be freely expanded and defined is adopted for data aggregation, and a freely expandable classification strategy and a report content template are adopted during data classification. The method and the device greatly solve the difficult problems of messy current news sources, disagreeable content structure and huge news information quantity collection. Meanwhile, noise and homogeneous content can be removed to the greatest extent during automatic duplicate removal pretreatment, so that the pressure of the next data processing is greatly reduced; through structured data analysis and storage, a regular data source is provided for further big data analysis; and the maintenance and the update of a later-stage system are facilitated through real-time alarming.
The news information syndication and analysis system according to the embodiment of the present application will be described in detail with reference to fig. 5 and 6. It should be noted that, the news information-gathering analysis systems shown in fig. 5 and fig. 6 are used to execute the methods of the embodiments shown in fig. 1 to fig. 4, and for convenience of explanation, only the portions relevant to the embodiments of the present application are shown, and specific technical details are not disclosed, please refer to the embodiments shown in fig. 1 to fig. 4 of the present application.
Referring to fig. 5, a schematic structural diagram of a news information aggregation analysis system is provided in an embodiment of the present application. As shown in fig. 5, the news information-syndication and analysis system 10 according to an embodiment of the present application may include: the system comprises an acquisition and aggregation module 101, a deduplication preprocessing module 102, a data classification module 103, a data storage module 104 and a real-time warning module 105. The deduplication preprocessing module 102 includes a data analysis unit 1021 and a structural analysis unit 1022, as shown in fig. 6.
The collection and aggregation module 101 is configured to collect raw data based on a data collection configuration defined by a standard aggregation interface, and aggregate the data according to a structuring rule defined by the standard aggregation interface.
The data storage module 104 is configured to store the aggregated data to a database cluster.
The deduplication preprocessing module 102 is configured to perform deduplication preprocessing on the aggregated data according to the structuring rule and the chinese word segmentation technique defined by the interface of the aggregator to obtain structured data.
In a specific implementation, the aggregator includes a standard aggregator and a scalable custom aggregator.
In alternative embodiments, the deduplication preprocessing module 102 may include the following elements:
and the data analysis unit 1021 is used for carrying out structural analysis and intelligent semantic analysis on the aggregated data corresponding to each article according to the structural rules and Chinese word segmentation technology defined by the interface of the aggregator.
And the structural analysis unit 1022 is used for acquiring keywords according to the analysis result, simulating the meaning of the article expression inferred by the manual reading article, and automatically acquiring the analyzed structural data.
The data classifying module 103 is configured to classify data according to definitions of the structured data in different dimensions, and output a classification report.
It should be noted that, the report template of the categorizing report is a custom content template.
The real-time alarm module 105 is configured to perform real-time alarm by using the extensible custom aggregator and output an acquisition report.
It should be noted that, the detailed execution process of each unit and module in the above system may be referred to the description in the above method embodiment, and will not be repeated here.
In the embodiment of the application, the aggregation analysis process of news information data is divided into three steps of acquisition and aggregation, duplication elimination pretreatment and automatic classification, an aggregator which is provided with a standardized interface and can be freely expanded and defined is adopted for data aggregation, and a freely expandable classification strategy and a report content template are adopted during data classification. The method and the device greatly solve the difficult problems of messy current news sources, disagreeable content structure and huge news information quantity collection. Meanwhile, noise and homogeneous content can be removed to the greatest extent during automatic duplicate removal pretreatment, so that the pressure of the next data processing is greatly reduced; through structured data analysis and storage, a regular data source is provided for further big data analysis; and the maintenance and the update of a later-stage system are facilitated through real-time alarming.
The embodiment of the present application further provides a computer storage medium, where the computer storage medium may store a plurality of instructions, where the instructions are adapted to be loaded by a processor and execute the method steps of the embodiment shown in fig. 1 to fig. 4, and the specific execution process may refer to the specific description of the embodiment shown in fig. 1 to fig. 4, which is not repeated herein.
Referring to fig. 7, a schematic structural diagram of a terminal is provided in an embodiment of the present application. As shown in fig. 7, the terminal 1000 may include: at least one processor 1001, such as a CPU, at least one network interface 1004, a user interface 1003, a memory 1005, at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display (Display), a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface, among others. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may also optionally be at least one storage device located remotely from the processor 1001. As shown in fig. 7, an operating system, a network communication module, a user interface module, and a news information-gathering analysis application may be included in the memory 1005, which is a computer storage medium.
In terminal 1000 shown in fig. 7, user interface 1003 is mainly used for providing an input interface for a user, and acquiring data input by the user; the network interface 1004 is used for data communication with a user terminal; and the processor 1001 may be configured to invoke the news information-syndication and analysis application stored in the memory 1005, and specifically perform the following operations:
collecting original data based on data collection configuration defined by a standard aggregation interface, and carrying out data aggregation according to a structuring rule defined by the standard aggregation interface;
performing de-duplication preprocessing on the aggregated data according to a structuring rule defined by an interface of an aggregator and a Chinese word segmentation technology to obtain structured data;
and classifying the data according to the definition of the structured data in different dimensions, and outputting a classification report.
In some embodiments, the processor 1001 is further configured to:
and storing the aggregated data to a database cluster.
In some embodiments, the processor 1001 performs the following operations specifically when performing deduplication preprocessing on the aggregated data according to the structuring rules and chinese word segmentation techniques defined by the interface of the aggregator to obtain structured data:
carrying out structural analysis and intelligent semantic analysis on the aggregation data corresponding to each article according to the structural rules and Chinese word segmentation technology defined by the interfaces of the aggregators;
and obtaining keywords according to the analysis result, simulating the manual reading article to infer the meaning of the article expression, and automatically obtaining the analyzed structured data.
In some embodiments, the report template of the categorization report is a custom content template.
In some embodiments, the aggregator includes a standard aggregator and a scalable custom aggregator.
In some embodiments, the processor 1001 is further configured to:
and adopting an extensible custom aggregator to carry out real-time alarm and outputting an acquisition report.
In the embodiment of the application, the aggregation analysis process of news information data is divided into three steps of acquisition and aggregation, duplication elimination pretreatment and automatic classification, an aggregator which is provided with a standardized interface and can be freely expanded and defined is adopted for data aggregation, and a freely expandable classification strategy and a report content template are adopted during data classification. The method and the device greatly solve the difficult problems of messy current news sources, disagreeable content structure and huge news information quantity collection. Meanwhile, noise and homogeneous content can be removed to the greatest extent during automatic duplicate removal pretreatment, so that the pressure of the next data processing is greatly reduced; through structured data analysis and storage, a regular data source is provided for further big data analysis; and the maintenance and the update of a later-stage system are facilitated through real-time alarming.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in the embodiments may be accomplished by computer programs stored in a computer-readable storage medium, which when executed, may include the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.
The foregoing disclosure is illustrative of the present application and is not to be construed as limiting the scope of the application, which is defined by the appended claims.

Claims (9)

1. A news information syndication and analysis method, comprising:
collecting original data based on data collection configuration defined by a standard aggregation interface, and carrying out data aggregation according to a structuring rule defined by the standard aggregation interface;
performing de-duplication preprocessing on the aggregated data according to a structuring rule defined by an interface of an aggregator and a Chinese word segmentation technology to obtain structured data;
classifying the data according to the definition of the structured data in different dimensions, outputting a classification report,
the method for preprocessing the aggregated data to obtain the structured data by de-duplication according to the structuring rules and Chinese word segmentation technology defined by the interface of the aggregator comprises the following steps:
carrying out structural analysis and intelligent semantic analysis on the aggregation data corresponding to each article according to the structural rules and Chinese word segmentation technology defined by the interfaces of the aggregators;
and obtaining keywords according to analysis results, simulating manual reading articles to infer meanings of the article expressions, automatically obtaining analyzed structured data, and performing automatic duplicate removal operation on articles with a plurality of structure data close to each other according to the structured data.
2. The method of polymerization analysis according to claim 1, wherein the method further comprises:
and storing the aggregated data to a database cluster.
3. The method of polymerization analysis according to claim 1, wherein,
the report template of the classification report is a custom content template.
4. The method of polymerization analysis according to claim 1, wherein,
the aggregator includes a standard aggregator and a scalable custom aggregator.
5. The method of polymerization analysis according to claim 4, wherein the method further comprises:
and adopting an extensible custom aggregator to carry out real-time alarm and outputting an acquisition report.
6. A news information syndication and analysis system, comprising:
the acquisition and aggregation module is used for acquiring original data based on data acquisition configuration defined by a standard aggregation interface and carrying out data aggregation according to a structuring rule defined by the standard aggregation interface;
the de-duplication preprocessing module is used for performing de-duplication preprocessing on the aggregated data according to the structuring rules defined by the interface of the aggregator and the Chinese word segmentation technology to obtain structured data;
the data classifying module is used for classifying the data according to the definition of the structured data in different dimensionalities and outputting a classifying report,
the deduplication preprocessing module comprises:
the data analysis unit is used for carrying out structural analysis and intelligent semantic analysis on the aggregation data corresponding to each article according to the structural rules and the Chinese word segmentation technology defined by the interface of the aggregator;
and the structural analysis unit is used for acquiring keywords according to the analysis result, simulating the meaning of the article expression inferred by the manual reading article, automatically acquiring the analyzed structural data, and automatically performing duplicate removal operation on the articles with the multiple close structural data according to the structural data.
7. The aggregate analysis system of claim 6, wherein the system further comprises:
and the data storage module is used for storing the aggregated data to the database cluster.
8. A terminal, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the steps of:
collecting original data and performing data aggregation according to a standardized structure generation rule of a standard aggregation interface;
performing de-duplication preprocessing on the aggregated data according to a structuring rule defined by an interface of an aggregator and a Chinese word segmentation technology to obtain structured data;
classifying the data according to the definition of the structured data in different dimensions, outputting a classification report,
the method for preprocessing the aggregated data to obtain the structured data by de-duplication according to the structuring rules and Chinese word segmentation technology defined by the interface of the aggregator comprises the following steps:
carrying out structural analysis and intelligent semantic analysis on the aggregation data corresponding to each article according to the structural rules and Chinese word segmentation technology defined by the interfaces of the aggregators;
and obtaining keywords according to analysis results, simulating manual reading articles to infer meanings of the article expressions, automatically obtaining analyzed structured data, and performing automatic duplicate removal operation on articles with a plurality of structure data close to each other according to the structured data.
9. A computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the steps of:
collecting original data and performing data aggregation according to a standardized structure generation rule of a standard aggregation interface;
performing de-duplication preprocessing on the aggregated data according to a structuring rule defined by an interface of an aggregator and a Chinese word segmentation technology to obtain structured data;
classifying the data according to the definition of the structured data in different dimensions, outputting a classification report,
the method for preprocessing the aggregated data to obtain the structured data by de-duplication according to the structuring rules and Chinese word segmentation technology defined by the interface of the aggregator comprises the following steps:
carrying out structural analysis and intelligent semantic analysis on the aggregation data corresponding to each article according to the structural rules and Chinese word segmentation technology defined by the interfaces of the aggregators;
and obtaining keywords according to analysis results, simulating manual reading articles to infer meanings of the article expressions, automatically obtaining analyzed structured data, and performing automatic duplicate removal operation on articles with a plurality of structure data close to each other according to the structured data.
CN202010397390.6A 2020-05-12 2020-05-12 News information aggregation analysis method and system, terminal and storage medium Active CN111581480B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010397390.6A CN111581480B (en) 2020-05-12 2020-05-12 News information aggregation analysis method and system, terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010397390.6A CN111581480B (en) 2020-05-12 2020-05-12 News information aggregation analysis method and system, terminal and storage medium

Publications (2)

Publication Number Publication Date
CN111581480A CN111581480A (en) 2020-08-25
CN111581480B true CN111581480B (en) 2023-09-08

Family

ID=72118970

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010397390.6A Active CN111581480B (en) 2020-05-12 2020-05-12 News information aggregation analysis method and system, terminal and storage medium

Country Status (1)

Country Link
CN (1) CN111581480B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231661A (en) * 2008-02-19 2008-07-30 上海估家网络科技有限公司 Method and system for digging object grade knowledge
CN102546771A (en) * 2011-12-27 2012-07-04 西安博构电子信息科技有限公司 Cloud mining network public opinion monitoring system based on characteristic model
CN104933093A (en) * 2015-05-19 2015-09-23 武汉泰迪智慧科技有限公司 Regional public opinion monitoring and decision-making auxiliary system and method based on big data
CN106959944A (en) * 2017-02-14 2017-07-18 中国电子科技集团公司第二十八研究所 A kind of Event Distillation method and system based on Chinese syntax rule
CN107656995A (en) * 2017-09-20 2018-02-02 温州市鹿城区中津先进科技研究院 Towards the data management system of big data
CN109657068A (en) * 2018-11-30 2019-04-19 北京航空航天大学 Historical relic knowledge mapping towards wisdom museum generates and method for visualizing
KR20190047941A (en) * 2017-10-30 2019-05-09 한림대학교 산학협력단 Method and apparatus for integration of text data collection and analysis
CN109800350A (en) * 2018-12-21 2019-05-24 中国电子科技集团公司信息科学研究院 A kind of Personalize News recommended method and system, storage medium
WO2019133157A1 (en) * 2017-12-28 2019-07-04 Microsoft Technology Licensing, Llc Enhanced data aggregation techniques for anomaly detection and analysis
CN110147439A (en) * 2018-07-18 2019-08-20 中山大学 A kind of news event detecting method and system based on big data processing technique
CN110289095A (en) * 2019-06-28 2019-09-27 青岛百洋智能科技股份有限公司 A kind of fracture of neck of femur clinic intelligence aided decision method and system
CN110674296A (en) * 2019-09-17 2020-01-10 上海仪电(集团)有限公司中央研究院 Information abstract extraction method and system based on keywords

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9165085B2 (en) * 2009-11-06 2015-10-20 Kipcast Corporation System and method for publishing aggregated content on mobile devices

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231661A (en) * 2008-02-19 2008-07-30 上海估家网络科技有限公司 Method and system for digging object grade knowledge
CN102546771A (en) * 2011-12-27 2012-07-04 西安博构电子信息科技有限公司 Cloud mining network public opinion monitoring system based on characteristic model
CN104933093A (en) * 2015-05-19 2015-09-23 武汉泰迪智慧科技有限公司 Regional public opinion monitoring and decision-making auxiliary system and method based on big data
CN106959944A (en) * 2017-02-14 2017-07-18 中国电子科技集团公司第二十八研究所 A kind of Event Distillation method and system based on Chinese syntax rule
CN107656995A (en) * 2017-09-20 2018-02-02 温州市鹿城区中津先进科技研究院 Towards the data management system of big data
KR20190047941A (en) * 2017-10-30 2019-05-09 한림대학교 산학협력단 Method and apparatus for integration of text data collection and analysis
WO2019133157A1 (en) * 2017-12-28 2019-07-04 Microsoft Technology Licensing, Llc Enhanced data aggregation techniques for anomaly detection and analysis
CN110147439A (en) * 2018-07-18 2019-08-20 中山大学 A kind of news event detecting method and system based on big data processing technique
CN109657068A (en) * 2018-11-30 2019-04-19 北京航空航天大学 Historical relic knowledge mapping towards wisdom museum generates and method for visualizing
CN109800350A (en) * 2018-12-21 2019-05-24 中国电子科技集团公司信息科学研究院 A kind of Personalize News recommended method and system, storage medium
CN110289095A (en) * 2019-06-28 2019-09-27 青岛百洋智能科技股份有限公司 A kind of fracture of neck of femur clinic intelligence aided decision method and system
CN110674296A (en) * 2019-09-17 2020-01-10 上海仪电(集团)有限公司中央研究院 Information abstract extraction method and system based on keywords

Also Published As

Publication number Publication date
CN111581480A (en) 2020-08-25

Similar Documents

Publication Publication Date Title
CN109033387B (en) Internet of things searching system and method fusing multi-source data and storage medium
US9454615B2 (en) System and methods for predicting user behaviors based on phrase connections
CN103514183B (en) Information search method and system based on interactive document clustering
CN108021651B (en) Network public opinion risk assessment method and device
WO2017097231A1 (en) Topic processing method and device
US9946775B2 (en) System and methods thereof for detection of user demographic information
CN108182175B (en) Text quality index obtaining method and device
US20240012863A1 (en) Systems and methods for intelligent content filtering and persistence
KR20190062848A (en) System of big data mining using incremental learning and a method thereof
CN113780007A (en) Corpus screening method, intention recognition model optimization method, equipment and storage medium
CN115795030A (en) Text classification method and device, computer equipment and storage medium
Wei et al. Online education recommendation model based on user behavior data analysis
CN111581480B (en) News information aggregation analysis method and system, terminal and storage medium
CN110377706B (en) Search sentence mining method and device based on deep learning
Truskinger et al. Decision support for the efficient annotation of bioacoustic events
CN111026940A (en) Network public opinion and risk information monitoring system and electronic equipment for power grid electromagnetic environment
KR101880474B1 (en) Keyword-based service provide method for high value added content information service and method and recording medium storing program for executing the same and recording medium storing program for executing the same
CN114780712B (en) News thematic generation method and device based on quality evaluation
KR20230059364A (en) Public opinion poll system using language model and method thereof
KR20220105792A (en) AI-based Decision Making Support System utilizing Dynamic Text Sources
Li et al. Research on hot news discovery model based on user interest and topic discovery
US11726972B2 (en) Directed data indexing based on conceptual relevance
Yang et al. Topic audiolization: A model for rumor detection inspired by lie detection technology
Vidyasagar et al. Emotion Based Music Recommendation System by Using Different ML Approach
CN117389998B (en) Data storage method and device based on large model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant