CN112836110B

CN112836110B - Hotspot information mining method and device, computer equipment and storage medium

Info

Publication number: CN112836110B
Application number: CN202110169266.9A
Authority: CN
Inventors: 高登科; 徐桢虎; 李少博; 陈涵宇; 余伟
Original assignee: Sichuan Cover Media Co ltd
Current assignee: Sichuan Cover Media Co ltd
Priority date: 2021-02-07
Filing date: 2021-02-07
Publication date: 2022-09-16
Anticipated expiration: 2041-02-07
Also published as: CN112836110A

Abstract

The invention relates to the technical field of data mining, and discloses a hot spot information mining method, a device, computer equipment and a storage medium, which can crawl multi-source hot-list topics and news information in real time only from network public data, screen and filter the news information by using poor auditing and deep duplicate removal technologies, and finally realize hot spot topic discovery by adopting hot spot fusion and construct a hot spot topic news library, so that a mining result has the characteristic of high precision, the data quantity depended on in the mining process can be greatly reduced, the method has high reliability, high timeliness and non-unfavorable robustness, and the application of a real scene can be well met. In addition, multi-mode matching of news and hot topics can be performed on the whole network of real-time news through multiple dimensions such as texts, pictures and videos, the hot topic news library is directly enriched according to matching results, and the library magnitude and diversity of the news library under the hot topics can be greatly improved.

Description

Hotspot information mining method and device, computer equipment and storage medium

Technical Field

The invention belongs to the technical field of data mining, and particularly relates to a hotspot information mining method and device, computer equipment and a storage medium.

Background

The hot spot refers to news or information which is relatively concerned and welcomed by the masses and also refers to a place or a problem which is noticed at a certain time; the method has the characteristics of high attention, high timeliness and broad propagation, and is an important information type in the current society.

The hot spot mining, which is a key technology for discovering hot spot information, is a method for identifying hot spots and constructing a hot spot news library according to hot spot information contents and social group reactions, and is more widely concerned and applied in recent years. At present, the accuracy of the technology for identifying based on the hot information content is low, the technology for fusing social group reaction and information transmission excessively depends on massive, high-latitude and high-quality feedback transmission data, both the technology and the technology are easily influenced by false information, noise data and bad information, and the generalization capability of real world application is lacked.

Disclosure of Invention

In order to solve the problems of low precision, low availability of network data (false, noise and bad information), high delay of time effectiveness, excessive dependence on massive hot content and the need of transmitting feedback data of the existing hot information mining method, the invention aims to provide a hot information mining method, a device, computer equipment and a storage medium, the multi-source hot list topics and news information can be crawled in real time only from network public data, the news information is screened and filtered by using poor auditing and deep deduplication technologies, hot topic discovery is realized by adopting hot spot fusion finally, a hot topic news library is constructed, thereby not only ensuring that the mining result has the characteristic of high precision, but also greatly reducing the data volume depended on by the mining process, the method has high reliability, high timeliness and robustness of non-badness, and can well meet the application of real scenes.

In a first aspect, the present invention provides a method for mining hotspot information, including:

screening a plurality of target news sources from the whole-network news sources corresponding to the whole-network hot topics;

for each target news source in the target news sources, crawling to obtain at least one topic and news information corresponding to the target news source in real time based on a web crawler technology, wherein the topic and news information comprises one topic and at least one topic news belonging to the topic;

performing bad audit processing including negative information filtering processing, sensitive information filtering processing and false information filtering processing on all the crawled topics and news information to obtain at least one piece of compliant topic and news information;

performing semantic coding on each topic and news information in a text dimension, a picture dimension and a video dimension respectively aiming at the at least one piece of compliant topic and news information to obtain topic text semantic vectors, topic picture semantic vectors and topic video semantic vectors of each topic and news information, and then performing de-duplication processing on the at least one piece of compliant topic and news information in a text similar dimension, a picture similar dimension and a video similar dimension according to the topic text semantic vectors, the topic picture semantic vectors and the topic video semantic vectors of all the topic and news information to obtain at least one piece of non-repetitive topic and news information;

aiming at the at least one non-repeated topic and news information, respectively obtaining topic mapping vectors and topic distribution of each topic and news information according to at least one topic news belonging to the topic, then performing semantic clustering fusion on the at least one non-repeated topic and news information according to the topic mapping vectors, the topic distribution, the topic text semantic vector, the topic picture semantic vector and the topic video semantic vector of all the topics and news information to obtain at least one fused topic, at least one topic heat weight value and a one-to-one correspondence relationship between the at least one fused topic and the at least one topic heat weight value, and finally determining at least one hot topic from the at least one fused topic according to the magnitude degree of the at least one topic heat weight value;

and aiming at each hot topic in the at least one hot topic, constructing a corresponding hot topic news library, wherein the hot topic news library comprises all topic news belonging to the hot topic in the at least one non-repeated topic and news information.

Based on the content of the invention, a hot spot information mining scheme based on the whole network news data is provided, namely, the hot spot information and the news information can be crawled in real time only from network public data, the news information is screened and filtered by using poor auditing and deep duplicate removal technologies, hot spot fusion is finally adopted to realize hot spot topic discovery, and a hot spot topic news library is constructed, so that the mining result has the characteristic of high precision, the data volume depended on in the mining process can be greatly reduced, the robustness of high reliability, high timeliness and non-badness is realized, and the application of a real scene can be well met.

In one possible design, after constructing, for each of the at least one hot topic, a corresponding hot topic news library, the method further includes:

crawling at least one piece of real-time news of the whole network in real time based on a web crawler technology;

performing poor auditing treatment comprising negative information filtering treatment, sensitive information filtering treatment and false information filtering treatment on the at least one piece of real-time news obtained by crawling to obtain at least one piece of compliant real-time news;

performing semantic coding on each piece of real-time news in a text dimension, a picture dimension and a video dimension respectively aiming at the at least one piece of compliant real-time news to obtain a news text semantic vector, a news picture semantic vector and a news video semantic vector of each piece of real-time news, and then performing deduplication processing on the at least one piece of compliant real-time news in a text similar dimension, a picture similar dimension and a video similar dimension according to the news text semantic vector, the news picture semantic vector and the news video semantic vector of all pieces of real-time news to obtain at least one piece of non-repetitive real-time news;

aiming at the at least one piece of non-repeated real-time news, the news text semantic vector, the news picture semantic vector and the news video semantic vector of all the real-time news and the topic text semantic vector, the topic picture semantic vector and the topic video semantic vector of the at least one hot topic are led into a deep learning matching detection model constructed based on a text similar dimension, a picture similar dimension, a video similar dimension and a full connection layer, and a matching detection result of each piece of real-time news and each hot topic is obtained;

and for each hot topic in the at least one hot topic, supplementing the matched real-time news in the matching detection result to a corresponding hot topic news library.

Based on the possible design, multi-mode matching of news and hot topics can be performed on the whole network of real-time news through multiple dimensions such as texts, pictures and videos, the hot topic news library can be directly enriched according to matching results, the library magnitude and diversity of the news library under the hot topics can be greatly improved, and finally the hot topics and the hot news library which are high in precision, reliability and timeliness and are positive can be mined and constructed efficiently in real time.

In one possible design, a plurality of target news sources are screened from the full-network news sources corresponding to the full-network hot topics, including:

performing quantitative analysis on a news quality dimension, a news timeliness dimension and a content enrichment dimension aiming at each news source in the whole network news sources to obtain a corresponding multi-dimensional quantitative analysis result;

and aiming at each news source in the whole network news sources, if the corresponding multidimensional quantitative analysis result meets a preset condition, taking the news source as the target news source.

Based on the possible design, the whole network hot list news is screened in a multi-source real-time crawling mode, and diversity and high quality of hot sources can be guaranteed.

In one possible design, crawling, in real time, for each target news source in the plurality of target news sources, at least one topic and news information corresponding to the target news source based on a web crawler technology includes:

a dynamic real-time crawling algorithm based on topic heat dimension, updating frequency dimension, updating time dimension and news source dimension quantitative customization is applied to crawl at least one original topic and news information from the target news source;

and sequentially carrying out dirty data cleaning processing and duplicate removal processing aiming at the same news source on the at least one original topic and news information to obtain the at least one topic and news information corresponding to the target news source.

Based on the possible design, an exclusive crawling algorithm is customized according to the heat degree, the updating frequency, the information dimension and the credibility of each topic, and the timeliness, the credibility and the content richness of the hot spots are greatly improved.

In one possible design, performing bad audit processing including negative information filtering processing, sensitive information filtering processing and false information filtering processing on all the crawled topics and news information to obtain at least one piece of compliant topic and news information, including:

aiming at all the topics and news information obtained by crawling, a first bert pre-training model is applied to carry out sentiment polarity judgment on each topic and news information respectively, and then information corresponding to a negative judgment result is filtered out to obtain at least one non-negative topic and news information;

aiming at the at least one piece of non-negative topic and news information, respectively carrying out sensitive word detection on each topic and news information by applying a sensitive information detection algorithm based on a dictionary, pinyin, special-shaped words and/or a deep learning model, and then filtering information containing sensitive words to obtain at least one piece of insensitive topic and news information;

and aiming at the at least one piece of insensitive topic and news information, respectively carrying out false information judgment on each topic and news information by applying a false information judgment algorithm based on rules and/or a deep learning model, and then filtering information corresponding to a false judgment result to obtain the at least one piece of compliant topic and news information.

Based on the possible design, by screening the crawled topics and news information negatively, sensitively, falsely and badly, a large amount of dirty data can be cleaned, and the quality and desensitization of the hot data are greatly improved.

In one possible design, semantic coding is respectively carried out on each topic and news information in a text dimension, a picture dimension and a video dimension aiming at the at least one piece of compliant topic and news information to obtain topic text semantic vectors, topic picture semantic vectors and topic video semantic vectors of each topic and news information, and then de-duplication processing is carried out on the at least one piece of compliant topic and news information in a text similar dimension, a picture similar dimension and a video similar dimension according to the topic text semantic vectors, the topic picture semantic vectors and the topic video semantic vectors of all the topic and news information to obtain at least one piece of non-repetitive topic and news information, wherein the method comprises the following steps:

aiming at the at least one piece of compliant topic and news information, a second bert pre-training model is applied to map at least one text in each piece of topic and news information into a topic text semantic vector respectively, wherein the text comprises a title, a text body and/or an abstract;

aiming at the at least one piece of compliant topic and news information, mapping at least one picture in each piece of topic and news information into the topic picture semantic vector by applying a first ResNet101 pre-training model;

aiming at the at least one piece of compliant topic and news information, respectively extracting at least one video key frame in each topic and news information by using a video clustering algorithm, and then respectively mapping at least one video key frame in each topic and news information into a topic video semantic vector by using a second ResNet101 pre-training model;

and importing the topic text semantic vector, the topic picture semantic vector and the topic video semantic vector of all topic and news information into a deep learning repeated detection model constructed based on a text similar dimension, a picture similar dimension, a video similar dimension and a full connection layer to obtain an information repeated detection result, and then performing de-duplication processing on the at least one piece of compliant topic and news information according to the information repeated detection result to obtain the at least one piece of non-repeated topic and news information.

Based on the possible design, aiming at the filtered topics and news information, semantic information is extracted from multiple dimensions such as texts, pictures and videos by combining deep learning, and duplication is removed from the literal and semantic dimensions, so that the redundancy of hot spot data can be greatly reduced.

In one possible design, aiming at the at least one piece of non-repeated topic and news information, respectively obtaining topic mapping vectors and topic distribution of each topic and news information according to at least one topic news belonging to the topic, then carrying out semantic clustering fusion on the at least one piece of non-repeated topic and news information according to the topic mapping vectors, the topic distribution, the topic text semantic vectors, the topic picture semantic vectors and the topic video semantic vectors of all the topic and news information to obtain at least one fused topic, at least one topic heat weight value and a one-to-one corresponding relation between the at least one fused topic and the at least one topic heat weight value, and finally determining at least one hot topic from the at least one fused topic according to the magnitude degree of the at least one topic heat weight value, the method comprises the following steps:

aiming at the at least one non-repeated topic and news information, respectively extracting semantic vectors of at least one topic news belonging to the topic from each topic and news information by using a doc2vec paragraph vector method, and then respectively carrying out average weighting processing on all extracted semantic vectors of each topic and news information to obtain a topic mapping vector;

aiming at the at least one non-repeated topic and news information, respectively extracting topic information of at least one topic news belonging to the topic from the topic and news information by applying a plda topic model, and then respectively carrying out average weighting processing on all the topic information of the topic and news information and all the extracted topic information to obtain topic distribution;

according to the topic mapping vectors, the topic distribution, the topic text semantic vector, the topic picture semantic vector and the topic video semantic vector of all the topics and news information, performing density clustering analysis on the at least one non-repetitive topic and news information by applying a dbscan clustering algorithm to obtain a one-to-one correspondence relationship among the at least one fused topic, the at least one topic heat weight value and the at least one fused topic and the at least one topic heat weight value;

and aiming at the at least one fused topic, sequencing from large to small according to the corresponding topic heat weight value, and taking the fused topic sequenced in the front as the hot topic.

Based on the possible design, aiming at the topics and topic news after the duplication removal, fusion clustering is carried out by combining text, pictures and videos with information of a plurality of modes such as a topic model, a document vector and deep semantic coding, so that topic fusion can be realized, the redundancy is reduced, the heat weight of the topics can be counted based on a clustering result, and the heat is sequenced, so that the accuracy and the application range of the topic library are greatly improved.

In a second aspect, the invention provides a hot spot information mining device, which comprises a news source screening module, a multi-source real-time crawling module, a bad auditing processing module, an information duplication removing processing module, a hot topic finding module and a hot topic news library constructing module which are sequentially in communication connection;

the news source screening module is used for screening a plurality of target news sources from the whole network news sources corresponding to the whole network hot topics;

the multi-source real-time crawling module is used for crawling at least one topic and news information corresponding to each target news source in the target news sources in real time based on a web crawler technology, wherein the topic and news information comprises one topic and at least one topic news belonging to the topic;

the poor audit processing module is used for performing poor audit processing comprising negative information filtering processing, sensitive information filtering processing and false information filtering processing on all the crawled topics and news information to obtain at least one piece of compliant topic and news information;

the information duplication elimination processing module is used for respectively carrying out semantic coding on each topic and news information on a text dimension, a picture dimension and a video dimension aiming at the at least one piece of compliance topic and news information to obtain a topic text semantic vector, a topic picture semantic vector and a topic video semantic vector of each topic and news information, and then carrying out duplication elimination processing on the at least one piece of compliance topic and news information on a text similar dimension, a picture similar dimension and a video similar dimension according to the topic text semantic vector, the topic picture semantic vector and the topic video semantic vector of all topic and news information to obtain at least one piece of non-repeated topic and news information;

the hot topic finding module is used for aiming at the at least one non-repeated topic and news information, respectively obtaining topic mapping vectors and topic distribution of each topic and news information according to at least one topic news belonging to the topic, then performing semantic clustering fusion on the at least one non-repeated topic and news information according to the topic mapping vectors, the topic distribution, the topic text semantic vector, the topic picture semantic vector and the topic video semantic vector of all the topics and news information to obtain at least one fused topic, at least one topic heat weight value and a one-to-one correspondence relationship between the at least one fused topic and the at least one topic heat weight value, and finally determining at least one hot topic from the at least one fused topic according to the magnitude degree of the at least one topic heat weight value;

the hot topic news library construction module is configured to construct a corresponding hot topic news library for each hot topic in the at least one hot topic, where the hot topic news library includes all topic news belonging to the hot topic in the at least one non-repeated topic and news information.

In a third aspect, the present invention provides a computer device, comprising a memory, a processor and a transceiver, which are sequentially connected in communication, wherein the memory is used for storing a computer program, the transceiver is used for transmitting and receiving information, and the processor is used for reading the computer program and executing the hotspot information mining method according to the first aspect or any possible design.

In a fourth aspect, the present invention provides a storage medium having stored thereon instructions for performing the method of mining hotspot information as set forth in the first aspect or any possible design thereof, when the instructions are run on a computer.

In a fifth aspect, the present invention provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the hotspot information mining method of the first aspect or any possible design.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a hotspot information mining method provided by the present invention.

Fig. 2 is a schematic structural diagram of a hotspot information mining device provided by the present invention.

Fig. 3 is a schematic structural diagram of a computer device provided by the present invention.

Detailed Description

The invention is further described with reference to the following figures and specific embodiments. It should be noted that the description of the embodiments is provided to help understanding of the present invention, but the present invention is not limited thereto. Specific structural and functional details disclosed herein are merely representative of exemplary embodiments of the invention. This invention may, however, be embodied in many alternate forms and should not be construed as limited to the embodiments set forth herein.

It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of exemplary embodiments of the present invention.

It should be understood that, for the term "and/or" as may appear herein, it is merely an associative relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, B exists alone, and A and B exist at the same time; for the term "/and" as may appear herein, which describes another associative object relationship, it means that there may be two relationships, e.g., a/and B, which may mean: a exists independently, and A and B exist independently; in addition, for the character "/" that may appear herein, it generally means that the former and latter associated objects are in an "or" relationship.

It will be understood that when an element is referred to herein as being "connected," "connected," or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Conversely, if a unit is referred to herein as being "directly connected" or "directly coupled" to another unit, it is intended that no intervening units are present. In addition, other words used to describe the relationship between elements should be interpreted in a similar manner (e.g., "between … …" versus "directly between … …", "adjacent" versus "directly adjacent", etc.).

It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments of the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes" and/or "including," when used herein, specify the presence of stated features, quantities, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, quantities, steps, operations, elements, components, and/or groups thereof.

It should also be noted that in some alternative designs, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may, in fact, be executed substantially concurrently, or the figures may sometimes be executed in the reverse order, depending upon the functionality/acts involved.

It should be understood that specific details are provided in the following description to facilitate a thorough understanding of example embodiments. However, it will be understood by those of ordinary skill in the art that the example embodiments may be practiced without these specific details. For example, systems may be shown in block diagrams in order not to obscure the examples in unnecessary detail. In other instances, well-known processes, structures and techniques may be shown without unnecessary detail in order to avoid obscuring the example embodiments.

As shown in fig. 1, the hotspot information mining method provided in the first aspect of this embodiment may be, but is not limited to, executed by a computer device with certain computing resources, for example, a terminal device for collecting news/articles. The hotspot information mining method may include, but is not limited to, the following steps S101 to S106.

S101, screening a plurality of target news sources from the whole-network news sources corresponding to the whole-network hot topics.

In step S101, the global trending topic is a known trending topic published on each news portal in the global network, and thus the global news source is specifically each news portal in the global network. Specifically, the steps S1011 to S1012 include, but are not limited to, the following steps of screening out a plurality of target news sources from the full-network news sources corresponding to the full-network hot topics.

S1011, carrying out quantitative analysis on the news quality dimension, the news aging dimension, the content enrichment dimension and the like aiming at each news source in the whole network news sources to obtain corresponding multi-dimensional quantitative analysis results.

In the step S1011, each news source may be scored specifically but not limited to a news quality dimension, a news aging dimension, a content enrichment dimension, and the like, and then a corresponding multidimensional quantitative analysis result is obtained through a conventional quantitative analysis.

S1012, aiming at each news source in the whole network news sources, if the corresponding multidimensional quantitative analysis result meets a preset condition, the news source is used as the target news source.

S102, for each target news source in the target news sources, crawling to obtain at least one topic and news information corresponding to the target news source in real time based on a web crawler technology, wherein the topic and news information comprises one topic and at least one topic news belonging to the topic.

In the step S102, a specific multi-source crawling method may include, but is not limited to, the following steps S1021 to S1022.

And S1021, applying a dynamic real-time crawling algorithm quantitatively customized based on a topic heat dimension, an updating frequency dimension, an updating time dimension and a news source dimension, and crawling at least one original topic and news information from the target news source.

And S1022, sequentially carrying out dirty data cleaning processing and duplicate removal processing aiming at the same news source on the at least one original topic and news information to obtain the at least one topic and news information corresponding to the target news source.

After the step S102, all the crawled topic and news information may be stored in a database after being structured in a full field.

And S103, performing poor auditing treatment including negative information filtering treatment, sensitive information filtering treatment and false information filtering treatment on all the crawled topics and news information to obtain at least one piece of compliant topic and news information.

In step S103, a specific manner of the poor audit process may include, but is not limited to, the following steps S1031 to S1033.

And S1031, aiming at all the topics and news information obtained by crawling, applying a first bert pre-training model to respectively judge the emotional polarity of each topic and news information, and then filtering information corresponding to a negative judgment result to obtain at least one non-negative topic and news information.

In the step S1031, bert is called Bidirectional Encoder reproduction from Transformers, and is a pre-training model proposed in Google2018, that is, an Encoder of Bidirectional Transformer, because the decoder cannot obtain information to be predicted, the main innovation point of the model is on a pre-train method, that is, word and Sentence level Representation is captured by using Masked LM and Next sequence Prediction, respectively. The specific manner of filtering out the information corresponding to the negative determination result may be, but is not limited to: if a certain topic corresponds to a negative judgment result, filtering out the whole topic and news information; if a certain topic news corresponds to a negative judgment result, filtering the topic news; if all the topic news of a certain topic correspond to the negative judgment results, the whole topic and news information are filtered.

S1032, aiming at the at least one piece of non-negative topic and news information, sensitive word detection is carried out on each topic and news information respectively by applying a sensitive information detection algorithm based on dictionaries, pinyin, special-shaped characters and/or deep learning models and the like, and then information containing sensitive words is filtered out to obtain at least one piece of insensitive topic and news information.

In step S1032, the sensitive word may be, but not limited to, a yellow-related word, an storm-related word, or an advertisement word. The specific way of filtering out the information containing the sensitive words may be, but is not limited to: if a certain topic contains a sensitive word, filtering out the whole topic and news information; if a certain topic news contains sensitive words, filtering the topic news; if all the topic news of a certain topic contain sensitive words, the whole topic and news information are filtered.

S1033, aiming at the at least one piece of insensitive topic and news information, false information discrimination is respectively carried out on each topic and news information by applying a false information discrimination algorithm based on rules and/or a deep learning model and the like, and then information corresponding to a false discrimination result is filtered out to obtain the at least one piece of compliant topic and news information.

In step S1033, the specific manner of filtering out the information corresponding to the false determination result may be, but is not limited to: if a certain topic corresponds to a false judgment result, filtering out the whole topic and news information; if a certain topic news corresponds to a false judgment result, filtering the topic news; if all the topic contents of a certain topic correspond to the false judgment results, the whole topic and news information are filtered.

S104, performing semantic coding on each topic and news information in a text dimension, a picture dimension and a video dimension aiming at the at least one piece of compliant topic and news information respectively to obtain topic text semantic vectors, topic picture semantic vectors and topic video semantic vectors of each topic and news information, and then performing de-duplication processing on the at least one piece of compliant topic and news information in a text similar dimension, a picture similar dimension and a video similar dimension according to the topic text semantic vectors, the topic picture semantic vectors and the topic video semantic vectors of all topic and news information to obtain at least one piece of non-duplicated topic and news information.

In the step S104, a specific manner of the deduplication processing may include, but is not limited to, the following steps S1041 to S1044.

S1041, aiming at the at least one piece of compliant topic and news information, a second bert pre-training model is applied to map at least one text in each piece of topic and news information into a topic text semantic vector, wherein the text comprises a title, a text body and/or an abstract.

S1042, aiming at the at least one piece of compliant topic and news information, a first ResNet101 (namely a rapid training residual error network ResNet with 101 layers) pre-training model is applied to map at least one picture in each piece of topic and news information into the topic picture semantic vector.

S1043, aiming at the at least one piece of compliant topic and news information, firstly, a video clustering algorithm is applied to respectively extract at least one video key frame in each topic and news information, and then, a second ResNet101 pre-training model is applied to respectively map the at least one video key frame in each topic and news information into the topic video semantic vector.

S1044, importing the topic text semantic vectors, the topic picture semantic vectors and the topic video semantic vectors of all topic and news information into a deep learning repeated detection model constructed based on text similar dimensions, picture similar dimensions, video similar dimensions and a full connection layer to obtain an information repeated detection result, and then performing deduplication processing on at least one piece of compliant topic and news information according to the information repeated detection result to obtain at least one piece of non-repeated topic and news information.

S105, aiming at the at least one non-repeated topic and news information, respectively obtaining topic mapping vectors and topic distribution of each topic and news information according to at least one topic news belonging to the topic, and then performing semantic clustering fusion on the at least one non-repeated topic and news information according to the topic mapping vectors, the topic distribution, the topic text semantic vector, the topic picture semantic vector and the topic video semantic vector of all the topics and news information to obtain at least one fused topic, at least one topic heat weight value and a one-to-one correspondence relationship between the at least one fused topic and the at least one topic heat weight value, and finally determining at least one hot topic from the at least one fused topic according to the magnitude degree of the at least one topic heat weight value.

In the step S105, the specific topic fusion method may include, but is not limited to, the following steps S1051 to S1054.

S1051, aiming at the at least one piece of non-repetitive topic and news information, a doc2vec paragraph vector method (which is an unsupervised algorithm and can learn to obtain feature representation with fixed length from a text with long length) is applied to train a predicted word in a document so that each document can be represented by a single dense vector), semantic vectors of at least one piece of topic news belonging to the topic in each topic and news information are respectively extracted, and then all extracted semantic vectors of each topic and news information are subjected to average weighting processing to obtain the topic mapping vector.

And S1052, aiming at the at least one non-repeated topic and news information, respectively extracting topic information of at least one topic news belonging to the topic in each topic and news information by applying a plda topic model (a model proposed by Ramage et al is totally called partial laboratory threaded Allocation), and then respectively carrying out average weighting processing on all the topic information of each topic and news information and obtained by extraction to obtain topic distribution.

S1053, according to the topic mapping vectors, the topic distribution, the topic text semantic vectors, the topic picture semantic vectors and the topic video semantic vectors of all topics and news information, applying a dbscan Clustering algorithm (Density-Based Spatial Clustering of Applications with Noise, which is a typical Density Clustering algorithm) to perform Density Clustering analysis on the at least one non-repeated topic and news information to obtain the at least one fused topic, the at least one topic heat weight value and the one-to-one correspondence relationship between the at least one fused topic and the at least one topic heat weight value.

S1054, aiming at the at least one fused topic, sorting from big to small according to the corresponding topic heat weight value, and taking the fused topic sorted in the front as the hot topic.

S106, aiming at each hot topic in the at least one hot topic, a corresponding hot topic news library is constructed, wherein the hot topic news library comprises all topic news which belongs to the hot topic in the at least one non-repeated topic and news information.

Therefore, based on the hot spot information mining method described in detail in the foregoing steps S101 to S106, a hot spot information mining scheme based on the whole network news data is provided, that is, the multi-source hot-spot topic and the news information can be crawled in real time only from the network public data, the news information is screened and filtered by using the bad auditing and deep deduplication technologies, and finally hot spot fusion is adopted to discover the hot spot topic and construct a hot spot topic news library, so that not only is the mining result characterized by high precision, but also the data amount depended on in the mining process can be greatly reduced, and the mining method has high reliability, high timeliness and robustness of non-badness, and can well meet the application of a real scene. In addition, the whole network hot list news is screened in a multi-source real-time crawling mode, diversity and high quality of hot spot sources can be guaranteed, meanwhile, an exclusive crawling algorithm is customized according to each source and combined with topic heat, updating frequency, information dimensionality and credibility, and timeliness, credibility and content richness of the hot spots are greatly improved. And aiming at the crawled topics and news information, by carrying out negative, sensitive, false and bad screening, a large amount of dirty data can be cleaned, and the quality and desensitization of hot spot data are greatly improved. Aiming at the filtered topic and news information, semantic information is extracted from multiple dimensions such as texts, pictures and videos by combining deep learning, and duplication is removed from the literal and semantic dimensions, so that the redundancy of hot spot data can be greatly reduced. Aiming at the topics and topic news after the duplication removal, fusion clustering is carried out by combining text, pictures, videos and information of a plurality of modes such as a topic model, a document vector, deep semantic coding and the like, so that topic fusion can be realized, the redundancy is reduced, the heat weight of the topics can be counted based on a clustering result, and heat sequencing is carried out, so that the accuracy and the application range of a topic library are greatly improved.

On the basis of the technical solution of the first aspect, the present embodiment further specifically provides a possible design for enriching a hot topic news library, that is, after a corresponding hot topic news library is constructed for each hot topic in the at least one hot topic, the method further includes, but is not limited to, the following steps S201 to S205.

S201, crawling at least one piece of real-time news of the whole network in real time based on a web crawler technology.

S202, performing poor auditing treatment including negative information filtering treatment, sensitive information filtering treatment and false information filtering treatment on the at least one piece of real-time news obtained through crawling to obtain at least one piece of compliant real-time news.

In step S202, the specific manner of performing the bad review processing on the real-time news may refer to the foregoing steps S1031 to S1033, which are not described herein again.

S203, aiming at the at least one piece of compliant real-time news, performing semantic coding on each piece of real-time news on a text dimension, a picture dimension and a video dimension respectively to obtain a news text semantic vector, a news picture semantic vector and a news video semantic vector of each piece of real-time news, and then performing de-duplication processing on the at least one piece of compliant real-time news on a text similar dimension, a picture similar dimension and a video similar dimension according to the news text semantic vector, the news picture semantic vector and the news video semantic vector of all real-time news to obtain at least one piece of non-repetitive real-time news.

In step S203, the specific manner of performing deduplication processing on the real-time news may refer to the foregoing steps S1041 to S1044, which are not described herein again.

S204, aiming at the at least one piece of non-repeated real-time news, the news text semantic vector, the news picture semantic vector, the news video semantic vector and the topic text semantic vector, the topic picture semantic vector and the topic video semantic vector of the at least one hot topic are led into a deep learning matching detection model constructed based on a text similar dimension, a picture similar dimension, a video similar dimension and a full connection layer, and a matching detection result of each piece of real-time news and each hot topic is obtained.

S205, aiming at each hot topic in the at least one hot topic, supplementing the real-time news matched in the matching detection result to a corresponding hot topic news library.

Therefore, based on the possible design one described in detail in the foregoing steps S201 to S205, multi-mode matching of news and hot topics can be performed on the whole network of real-time news through multiple dimensions such as texts, pictures, videos and the like, and the hot topic news library is directly enriched according to the matching result, so that the library magnitude and diversity of the news library under the hot topics can be greatly improved, and finally, the hot topics and the hot news library which are high in precision, reliability, timeliness and front can be efficiently mined and constructed in real time.

As shown in fig. 2, a second aspect of this embodiment provides a virtual device for implementing the hot spot information mining method in the first aspect or a possible design aspect, and the virtual device includes a news source screening module, a multi-source real-time crawling module, a bad review processing module, an information duplication eliminating processing module, a hot topic finding module, and a hot topic news library constructing module, which are sequentially connected in a communication manner;

the information duplication removal processing module is used for respectively carrying out semantic coding on each topic and news information in a text dimension, a picture dimension and a video dimension aiming at the at least one piece of compliant topic and news information to obtain a topic text semantic vector, a topic picture semantic vector and a topic video semantic vector of each topic and news information, and then carrying out duplication removal processing on the at least one piece of compliant topic and news information in a text similar dimension, a picture similar dimension and a video similar dimension according to the topic text semantic vector, the topic picture semantic vector and the topic video semantic vector of all the topic and news information to obtain at least one piece of unrepeated topic and news information;

the hot topic finding module is used for aiming at the at least one non-repeated topic and news information, respectively obtaining topic mapping vectors and topic distribution of each topic and news information according to at least one topic news belonging to the topic, then carrying out semantic clustering fusion on the at least one non-repeated topic and news information according to the topic mapping vectors, the topic distribution, the topic text semantic vector, the topic picture semantic vector and the topic video semantic vector of all topics and news information to obtain at least one fused topic, at least one topic heat weight value and a one-to-one correspondence relationship between the at least one fused topic and the at least one topic heat weight value, and finally determining at least one hot topic from the at least one fused topic according to the magnitude degree of the at least one topic heat weight value;

The working process, working details and technical effects of the foregoing apparatus provided in the second aspect of this embodiment may refer to the hotspot information mining method described in the first aspect or the first possible design, which are not described herein again.

As shown in fig. 3, a third aspect of the present embodiment provides a computer device for executing the hotspot information mining method in the first aspect or the first possible design, where the computer device includes a memory, a processor, and a transceiver, which are sequentially connected in a communication manner, where the memory is used to store a computer program, and the transceiver is used to transmit and receive information to execute the hotspot information mining method in the first aspect or the first possible design. For example, the Memory may include, but is not limited to, a Random-Access Memory (RAM), a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a First-in First-out (FIFO), and/or a First-in Last-out (FILO), and the like; the transceiver may be, but is not limited to, a WiFi (wireless fidelity) wireless transceiver, a bluetooth wireless transceiver, a GPRS (General Packet Radio Service) wireless transceiver, and/or a ZigBee (ZigBee protocol, low power consumption local area network protocol based on ieee802.15.4 standard) wireless transceiver, etc.; the processor may not be limited to the use of a microprocessor of the model number STM32F105 family. In addition, the computer device may also include, but is not limited to, a power module, a display screen, and other necessary components.

For the working process, working details, and technical effects of the foregoing computer device provided in the third aspect of this embodiment, reference may be made to the first aspect or the hot spot information mining method described in the first possible design, which is not described herein again.

A fourth aspect of this embodiment provides a storage medium storing instructions including the hotspot information mining method in the first aspect or the first possible design, that is, the storage medium has instructions stored thereon, and when the instructions are run on a computer, the hotspot information mining method in the first aspect or the first possible design is executed. The storage medium refers to a carrier for storing data, and may include, but is not limited to, a floppy disk, an optical disk, a hard disk, a flash Memory, a flash disk and/or a Memory Stick (Memory Stick), etc., and the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.

For the working process, the working details, and the technical effects of the foregoing storage medium provided in the fourth aspect of this embodiment, reference may be made to the first aspect or the hot spot information mining method described in the first possible design, which is not described herein again.

A fifth aspect of the present embodiments provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the hotspot information mining method as described in the first aspect or the possible design one. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable devices.

The embodiments described above are merely illustrative, and may or may not be physically separate, if referring to units illustrated as separate components; if reference is made to a component displayed as a unit, it may or may not be a physical unit, and may be located in one place or distributed over a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: modifications may be made to the embodiments described above, or equivalents may be substituted for some of the features described. And such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Finally, it should be noted that the present invention is not limited to the above alternative embodiments, and that various other forms of products can be obtained by anyone in light of the present invention. The above detailed description should not be taken as limiting the scope of the invention, which is defined in the claims, and which the description is intended to be interpreted accordingly.

Claims

1. A hotspot information mining method is characterized by comprising the following steps:

performing bad auditing processing including negative information filtering processing, sensitive information filtering processing and false information filtering processing on all the topics and news information obtained by crawling to obtain at least one piece of compliant topic and news information;

2. The method of mining hot spot information of claim 1, wherein after constructing, for each of the at least one hot spot topic, a corresponding hot spot topic news library, the method further comprises:

aiming at the at least one piece of non-repeated real-time news, importing the news text semantic vector, the news picture semantic vector and the news video semantic vector of all the real-time news and the topic text semantic vector, the topic picture semantic vector and the topic video semantic vector of the at least one hot topic into a deep learning matching detection model constructed based on a text similar dimension, a picture similar dimension, a video similar dimension and a full connection layer to obtain a matching detection result of each piece of real-time news and each hot topic;

3. The method for mining hot spot information according to claim 1, wherein the step of screening out a plurality of target news sources from the whole network news sources corresponding to the whole network hot topics comprises the steps of:

4. The method for mining hot spot information according to claim 1, wherein crawling, in real time, at least one topic and news information corresponding to each of the target news sources based on a web crawler technology comprises:

5. The method for mining hot spot information according to claim 1, wherein performing a bad audit process including a negative information filtering process, a sensitive information filtering process, and a false information filtering process on all the crawled topic and news information to obtain at least one piece of compliant topic and news information includes:

6. The method for mining hot spot information according to claim 1, wherein, for the at least one piece of compliant topic and news information, semantic coding is performed on each topic and news information in a text dimension, a picture dimension and a video dimension respectively to obtain topic text semantic vectors, topic picture semantic vectors and topic video semantic vectors of each topic and news information, and then, according to the topic text semantic vectors, the topic picture semantic vectors and the topic video semantic vectors of all topic and news information, deduplication processing is performed on the at least one piece of compliant topic and news information in a text similar dimension, a picture similar dimension and a video similar dimension to obtain at least one piece of non-repetitive topic and news information, including:

7. The method for mining hot spot information according to claim 1, wherein for the at least one non-repeated topic and news information, a topic mapping vector and a topic distribution of each topic and news information are respectively obtained according to at least one topic news belonging to a topic, then semantic clustering fusion is performed on the at least one non-repeated topic and news information according to the topic mapping vectors, the topic distribution, the topic text semantic vectors, the topic picture semantic vectors and the topic video semantic vectors of all topic and news information to obtain at least one fused topic, at least one topic popularity weight value and a one-to-one correspondence between the at least one fused topic and the at least one topic popularity weight value, and finally according to the magnitude of the at least one topic popularity weight value, determining at least one hot topic from the at least one fused topic, including:

aiming at the at least one non-repeated topic and news information, respectively extracting semantic vectors of at least one topic news belonging to the topic in each topic and news information by using a doc2vec paragraph vector method, and then respectively carrying out average weighting processing on all extracted semantic vectors of each topic and news information to obtain a topic mapping vector;

8. A hot spot information mining device is characterized by comprising a news source screening module, a multi-source real-time crawling module, a bad auditing processing module, an information duplication removing processing module, a hot topic finding module and a hot topic news library construction module which are sequentially in communication connection;

9. A computer device, comprising a memory, a processor and a transceiver, which are connected in sequence in a communication manner, wherein the memory is used for storing a computer program, the transceiver is used for transmitting and receiving information, and the processor is used for reading the computer program and executing the hotspot information mining method according to any one of claims 1 to 7.

10. A storage medium having stored thereon instructions for performing the hotspot information mining method according to any one of claims 1 to 7 when the instructions are run on a computer.