CN111460153B - Hot topic extraction method, device, terminal equipment and storage medium - Google Patents

Hot topic extraction method, device, terminal equipment and storage medium Download PDF

Info

Publication number
CN111460153B
CN111460153B CN202010231954.9A CN202010231954A CN111460153B CN 111460153 B CN111460153 B CN 111460153B CN 202010231954 A CN202010231954 A CN 202010231954A CN 111460153 B CN111460153 B CN 111460153B
Authority
CN
China
Prior art keywords
cluster
news
text
news text
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010231954.9A
Other languages
Chinese (zh)
Other versions
CN111460153A (en
Inventor
赵洋
包荣鑫
王宇
魏世胜
朱继刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Valueonline Technology Co ltd
Original Assignee
Shenzhen Valueonline Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Valueonline Technology Co ltd filed Critical Shenzhen Valueonline Technology Co ltd
Priority to CN202010231954.9A priority Critical patent/CN111460153B/en
Publication of CN111460153A publication Critical patent/CN111460153A/en
Application granted granted Critical
Publication of CN111460153B publication Critical patent/CN111460153B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application is applicable to the technical field of information and provides a hot topic extraction method, a device, terminal equipment and a storage medium, wherein the method comprises the following steps: collecting a plurality of news texts; extracting a plurality of feature words of a news text aiming at any news text; generating sentence vectors corresponding to the news text according to the feature words; clustering the news texts based on sentence vectors corresponding to the news texts respectively to obtain a plurality of clustering clusters; hot topics are extracted from the plurality of clusters. By adopting the method, the accuracy and the instantaneity of hot topic extraction can be improved.

Description

Hot topic extraction method, device, terminal equipment and storage medium
Technical Field
The application belongs to the technical field of information, and particularly relates to a hot topic extraction method, a device, terminal equipment and a storage medium.
Background
Advances in internet technology have greatly driven the development of news media and web portals. The way of people to acquire information is changed from traditional television, newspaper and other channels into the way of reading news on the network at any time and any place through a computer and a mobile phone.
For the endless news content, the hot content which is popular or is widely focused can be introduced to the user by extracting the hot topics of the news. For some institutions, the hot topic may help it analyze social public opinion, providing advice for government public policies; for enterprises, the hot topics can help the decision maker of the enterprises to grasp the development direction and make correct decisions; for individuals, the hot topics are helpful for individuals to know social events and promote knowledge. Therefore, how to analyze and extract real-time hot topics has important research value.
Disclosure of Invention
In view of the above, the embodiments of the present application provide a method, an apparatus, a terminal device, and a storage medium for extracting hot topics, so as to solve the problems in the prior art that the accuracy of extracting hot topics is low and real-time performance is difficult to satisfy.
A first aspect of an embodiment of the present application provides a hot topic extraction method, including:
collecting a plurality of news texts;
extracting a plurality of feature words of a news text aiming at any news text;
generating sentence vectors corresponding to the news text according to the feature words;
clustering the news texts based on sentence vectors corresponding to the news texts respectively to obtain a plurality of clustering clusters;
hot topics are extracted from the plurality of clusters.
A second aspect of an embodiment of the present application provides a hot topic extraction apparatus, including:
the news text acquisition module is used for acquiring a plurality of news texts;
the feature word extraction module is used for extracting a plurality of feature words of the news text aiming at any news text;
the sentence vector generation module is used for generating sentence vectors corresponding to the news text according to the plurality of feature words;
the news text clustering module is used for clustering the news texts based on sentence vectors corresponding to the news texts to obtain a plurality of clustering clusters;
and the hot topic extraction module is used for extracting hot topics from the plurality of cluster clusters.
A third aspect of an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the hot topic extraction method described in the first aspect when executing the computer program.
A fourth aspect of the embodiments of the present application provides a computer-readable storage medium storing a computer program, which when executed by a processor implements the hot topic extraction method described in the first aspect.
A fifth aspect of an embodiment of the present application provides a computer program product, which when run on a terminal device, causes the terminal device to perform the hot topic extraction method described in the first aspect above.
Compared with the prior art, the embodiment of the application has the following advantages:
according to the embodiment of the application, based on the improved SinglePass clustering algorithm, the outlier can be detected after each node addition, if the distance is too large, the outlier is removed from the current cluster, the representativeness of the clustering center and the accuracy of the clustering result are ensured, and secondly, the history hot spot recall algorithm provided by the embodiment can effectively judge the relation between the new hot spot and the history hot spot, and the hot spot and similar news of the same theme are combined, so that the accuracy of real-time pushing is ensured. Thirdly, in the embodiment, the word2vec and the TF-IDF are used for vectorizing the sentence, so that the global feature of the sentence vector can be more accurately represented, the interference of irrelevant words is eliminated, and meanwhile, the real-time increment processing is supported, so that the time requirement of practical application can be met. The hot topic extraction method provided by the embodiment of the application realizes the functions of news sentence vector representation, hot topic clustering, hot topic title screening, historical hot recall and the like, solves the problems that sentence vector representation is inaccurate and incremental clustering is not supported in the existing algorithm, does not need priori knowledge for large-scale dynamic news data, does not need obvious characteristics of news, and has good universality in the whole algorithm.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following will briefly introduce the drawings that are required to be used in the embodiments or the description of the prior art. It is evident that the drawings in the following description are only some embodiments of the present application and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is a flowchart illustrating a hot topic extraction method according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating steps of another hot topic extraction method according to an embodiment of the present application;
FIG. 3 is a flow chart of an improved SinglePass clustering algorithm in accordance with one embodiment of the present application;
FIG. 4 is a flow diagram of a historical hot topic recall algorithm in accordance with one embodiment of the present application;
FIG. 5 is a schematic diagram of a hot topic extraction apparatus in accordance with one embodiment of the application;
fig. 6 is a schematic diagram of a terminal device according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
Aiming at the problems existing in various topic extraction algorithms in the prior art, the embodiment of the application only selects a representative part of word segmentation results to perform feature extraction by firstly cleaning and word segmentation of news texts and then screening the word segmentation results. Then, a first language model (word 2vec model) is trained by using a historical news corpus with a large data volume (such as 20 GB). Then, after the word segmentation result is mapped into a vector by the first language model, weighting is carried out according to the TF-IDF value of the word, and a corresponding sentence vector is generated. After that, the generated sentence vectors are clustered by a modified SinglePass algorithm to generate a plurality of clusters. And finally, generating a hot spot according to the size of the cluster, and selecting the central point of the cluster as a final hot topic. According to the method, the global features in the news text can be accurately extracted, and experimental results show that hot topics are extracted according to the method provided by the embodiment, the accuracy and recall rate are high, incremental hot topic extraction is supported, and the extraction requirement of online real-time hot topics can be met.
The technical scheme of the application is described below through specific examples.
Referring to fig. 1, a schematic step flow diagram of a hot topic extraction method according to an embodiment of the present application may specifically include the following steps:
s101, collecting a plurality of news texts;
in the embodiment of the application, the plurality of news texts can be used for clustering, and news information or news reports and the like of corresponding news hot topics can be extracted according to a clustering result.
In particular implementations, news text may be crawled from various types of news websites, portals, etc. by web crawlers or other forms.
In general, in order to ensure timeliness of subsequent topic extraction, news may be captured within a specific period of time according to the release or online time of the news. For example, capturing news released in the past one hour or two hours.
S102, extracting a plurality of characteristic words of a news text aiming at any news text;
in the embodiment of the application, all the collected news texts can be processed part by part, and each news is processed into a format which can be input into a subsequent model for processing.
In a particular implementation, for any news text, a plurality of feature words of the text may be extracted first.
In general, for a news story, the news headline should be a summary of the content of the entire news story; on the other hand, for several paragraphs of the news opening, a brief introduction of the entire news is often included. Thus, extracting feature words of news text may be performed mainly from news headlines and partial paragraphs that precede the entire news.
In a specific implementation, the news headline and the entire news may be combined together in order, i.e., in the form of "headline + body", and then a plurality of feature words within the preceding portion of the content are extracted from the combined text. The feature word may be any word in the part of content, or may be any word left after deleting a stop word or a single word that does not have an actual meaning in the part of content by performing data cleaning on the part of content, which is not limited in this embodiment.
To ensure that the feature windows of all news texts are similar, the length of all news is guaranteed to be the same, so the headlines and body of the news texts can be truncated into strings of "headlines + how many words before body".
For example, the first 500 words may be first extracted from the "title+body" and then the feature words may be extracted from the 500 words. Or firstly extracting the first 500 characters from the title and the text, cleaning and deleting the stop words, numbers and single words which have no practical meaning through data, and identifying the rest words as characteristic words.
S103, generating sentence vectors corresponding to the news text according to the feature words;
since the clustering algorithm cannot calculate the input of the text, the news text needs to be vectorized before clustering.
In the embodiment of the present application, the feature words may be expressed as a sentence vector according to the feature words extracted in the foregoing steps, where each value in the sentence vector corresponds to the one feature word.
Each news text obtained by collection can be processed according to the mode, and a sentence vector corresponding to the news text is obtained respectively and used for subsequent clustering.
S104, clustering the news texts based on sentence vectors corresponding to the news texts to obtain a plurality of clustering clusters;
in the embodiment of the application, after all news texts are represented by vectors, sentence vectors corresponding to all news texts can be used as input data of a clustering algorithm, and output data of the algorithm is a plurality of clusters obtained by clustering.
In a specific implementation, a singless (a text clustering algorithm) algorithm may be used to cluster the vectorized individual news texts.
The simple clustering algorithm has simple idea and high running speed. As with its name, the algorithm only needs to traverse all data once in the running process, and the comparison depends on the input sequence of the data, and the time complexity is O (n). In the clustering, each cluster has a dynamically updated cluster center, which is the average of all vectors, and can be used as a global feature representing the cluster.
S105, extracting hot topics from the plurality of clusters.
In the embodiment of the application, the cluster center of each cluster obtained by clustering can be used as a global feature representing the cluster, so that news text corresponding to cluster center vectors of a plurality of clusters can be used as a final topic generation result.
In a specific implementation, the title of the news text corresponding to the cluster center vector can be directly used as a final hot topic; the distance between other vectors in the cluster and the cluster center vector can be calculated respectively based on the cluster center vector, and the title of the news text corresponding to the vector with the smallest distance is selected as the final hot topic; the title of the news text corresponding to the cluster center vector and the title of the news text corresponding to the vector with the smallest distance may be subjected to certain combination processing, and the content obtained after the combination processing may be used as a final hot topic, which is not limited in this embodiment.
In the embodiment of the application, for a plurality of collected news texts, a plurality of feature words of the news texts can be extracted for any news text, sentence vectors corresponding to the news texts are generated according to the plurality of feature words, and the plurality of news texts are clustered based on the sentence vectors corresponding to the plurality of news texts, so that a plurality of cluster clusters can be obtained, hot topics can be conveniently extracted from the plurality of cluster clusters, and the accuracy and instantaneity of hot topic extraction can be improved.
Referring to fig. 2, a schematic step flow diagram of another hot topic extraction method according to an embodiment of the present application may specifically include the following steps:
s201, collecting a plurality of news texts, and extracting a plurality of characteristic words of the news texts aiming at any news text;
in the embodiment of the application, the plurality of news texts can be used for clustering, and news information or news reports of corresponding news hot topics can be extracted according to the clustering result. The news text may be crawled from various news websites, portals, etc. by web crawlers or other forms.
In a specific implementation, the collected whole news text may be processed in portions. For example, for any news text, the news text may be first segmented. For example, news text may be segmented using a nub (jieba) segmentation tool, and the segmentation results may be saved in a list.
And partial interference information is considered in the word segmentation result, so that the representation of the global features of the sentences is not facilitated. Therefore, non-target words such as stop words, pure numbers, single words and the like in the word segmentation result can be deleted, and because word vectors of the word segmentation are not representative, the generation precision of sentence vectors can be greatly influenced.
For example, for a certain news text "16-day company news focus: the 15% share right transfer and collection of the Grignard electric appliance can obtain the following results of 'company/news/focus/Grignard/electric appliance/share right/transfer/collection' after final word segmentation and processing.
For a target text obtained after word segmentation and deletion of non-target words, a plurality of words in a preset text position of the target text can be extracted as feature words. The preset position may be a position where the target text is forward, for example, the first 100 words, etc.
In the embodiment of the application, the 'title + 500 characters before the text' can be intercepted first, then the 500 characters are segmented, and partial non-target words which possibly affect the subsequent processing after the segmentation are deleted, so as to obtain a plurality of characteristic words; or firstly, word segmentation is carried out on the whole title and text, then non-target words are deleted, and a certain number of front words, such as 100 words, are extracted from the rest words to be used as characteristic words.
The plurality of feature words may be represented in the form of one sentence X. For example, x= [ X 0 ,x 1 ,...,x n ]。
S202, mapping each feature word into a dense vector with preset dimensions according to a preset first language model, wherein the first language model is obtained by training a sample news text by adopting a preset word jump model;
in an embodiment of the application, the first language model may be trained based on a word2vec model. word2vec is a deep learning algorithm proposed by Mikolov in 2013, which is based on the language model assumption that the meaning of a word can be inferred from its context, and changes the word into a dense vectorized representation according to corpus, including two word vectorization modes of continuous word bag model CBOW and Skip-Gram.
In a specific implementation, a certain amount of whole-web historical news, for example 20GB, may be used as sample news text, trained using the word2vec model. The parameters in training may be selected as follows: the word vector dimension is 100, the window number is 10, the minimum number of word occurrence is 8, the model is Skip-Gram model, the circulation number is 20, and the rest parameters adopt default parameters.
Finally, a first language model of 2.1GB is trained, and each word with the occurrence number larger than 8 in the corpus can be expressed as a 100-dimensional dense vector W (x) = [ W ] 0 ,w 1 ,...,w 99 ]。
Thus, for each feature word, the trained first language model may be employed to map the feature word into a 100-dimensional dense vector.
S203, determining a weight value of each feature word according to a preset second language model, wherein the second language model is obtained by counting the inverse document frequency of each word in the sample news text;
in an embodiment of the present application, the second language model may refer to a Term Frequency-inverse document Frequency index (TF-IDF) model. TF refers to word frequency and IDF refers to inverse document frequency index, which in combination are commonly used to evaluate the importance of a term to the whole document. If a word appears more frequently in a certain text and rarely in other documents, the word has better distinguishing ability and is suitable for representing the global features of the text.
According to the embodiment of the application, the inverse document frequency of each word can be counted by combining the sample news text and the word segmentation result thereof to form a dictionary, and then the IDF is modeled and stored. The sample news text may be news text in a historical corpus.
In the embodiment of the present application, the calculation formula of the IDF may be expressed as:
where N represents the total number of documents and N (x) represents the number of documents containing the word x. The stored dictionary may measure the importance of each term in the article for weighted generation of the sentence vectors thereafter.
S204, generating sentence vectors corresponding to the news texts according to the dense vectors of each feature word and the weight values;
since the clustering algorithm cannot calculate the input of the text, the news text needs to be vectorized before clustering.
In the embodiment of the application, the IDF value can be used for weighting the word vector and the importance degreeThe higher the word IDF value, the higher the weighted weight. Each news text S can be represented as a 100-dimensional dense vector S (X) = [ S ] using the previous training to obtain a first language model, namely word2vec model, and a second language model, namely TF-IDF model 0 ,s 1 ,...,s 99 ]. For each dimension in the sentence vector S (X), the value of the dimension is equal to the value of the dimension of each word vector multiplied by the value of the word IDF, and then the average value is obtained for the number of words contained in the sentence.
Therefore, in a specific implementation, for any feature word, the product of the value of the dense vector corresponding to the feature word and the weight value of the feature word, that is, the IDF value of the feature word, may be calculated. And then, calculating the ratio between the product and the number of all the feature words, and taking the ratio as a vector value of the dimension corresponding to the feature word in the sentence vector to obtain the sentence vector corresponding to the current news text.
The above calculation process can be expressed as the following formula:
wherein n is the number of feature words.
It should be noted that, for rare words that do not exist in the first language model and the second language model, skipping may be selected and weighting operation is not performed, so that all texts of news may be represented as dense sentence vectors in 100 dimensions as input data of the clustering algorithm.
S205, clustering the news texts based on sentence vectors corresponding to the news texts to obtain a plurality of clustering clusters;
in the embodiment of the application, the clustering can be performed based on a modified SinglePass algorithm. The SinglePass algorithm obtains cluster clusters, each cluster has a dynamically updated cluster center, and the cluster center is the average value of all vectors. The cluster center may serve as a global feature representing the cluster. The clusters to which different nodes belong can be judged by calculating the distance, and Euclidean distance can be used as a measure standard of similarity between the nodes in the embodiment.
In general, the accuracy of clustering cannot be ensured by a partial clustering algorithm due to interference caused by the existence of outliers. In order to eliminate the interference of outliers, the embodiment performs outlier detection after inserting a new node each time on the basis of SinglePass clustering, and reduces the influence of outliers on the final clustering result.
As shown in fig. 3, which is a flowchart of an improved singless clustering algorithm provided by an embodiment of the present application, according to the flowchart shown in fig. 3, a process of clustering news texts after vectorization representation in this embodiment may include the following steps:
algorithm input: clustering threshold values and text feature vectors;
step 1: adding the feature vector of the first text into a first cluster, and setting the feature vector as a cluster center;
step 2: traversing all text feature vectors;
step 3: traversing all cluster centers;
step 4: calculating Euclidean distance between the text feature vector and the center of the cluster;
step 5: recording a cluster with the smallest distance from the current text, and recording the value of the distance;
step 6: if the distance is smaller than the clustering threshold value, adding the text feature vector into the cluster with the smallest distance, updating the center of the cluster, and executing the step 7;
step 7: traversing the current cluster, if the distance between the vector and the center is greater than a threshold value, judging that the vector is an outlier, removing the vector from the current cluster, and executing the step 4 on the vector;
step 8: if the distance is greater than the clustering threshold, creating a cluster, inserting the vector into the cluster, and updating the cluster center;
algorithm output: a plurality of clusters, all vectors of each cluster, a center vector of each cluster.
According to the improved SinglePass clustering algorithm, during clustering, any sentence vector is firstly taken as a first cluster, the sentence vector is set as the center of the cluster, euclidean distances between other sentence vectors and the center of the cluster are sequentially calculated, and if the distance is smaller than a clustering threshold value, the sentence vector is added into the cluster, and the cluster center is updated; if the distance is greater than the clustering threshold, a cluster can be newly built and added into the newly built cluster. And (3) circularly calculating Euclidean distance between each sentence vector and the center of each cluster obtained by the method, and respectively adding all sentence vectors into one cluster to finish clustering all news texts.
In the embodiment of the application, for the newly acquired news text, the cluster to which the newly added news text belongs can be determined according to the clustering mode.
In a specific implementation, when a new news text is collected, distances between sentence vectors corresponding to the new news text and cluster center vectors of a plurality of clustered clusters which have been clustered can be calculated one by one. When the distance between the sentence vector corresponding to the newly added news text and the cluster center vector of the target cluster is smaller than a preset threshold, the newly added news text can be added into the target cluster, and the calculation of the distance between the sentence vector corresponding to the newly added news text and the cluster center vector of other clusters is stopped, wherein the target cluster can be any one of a plurality of clusters.
If the distance between the sentence vector corresponding to the newly added news text and the cluster center vectors of the plurality of clusters is larger than a preset threshold value, the newly added news text can be considered not to belong to any clustered cluster, at the moment, a cluster can be newly built, and the newly added news text is inserted into the newly built cluster.
S206, determining a cluster center vector of any cluster; respectively calculating the distance between each sentence vector in the cluster and the cluster center vector;
after the news text is clustered, corresponding hot topics can be generated according to each cluster obtained by clustering.
S207, extracting a target news text corresponding to a sentence vector with the minimum distance between the cluster center vectors;
in a specific implementation, for any cluster, a cluster center vector of the cluster can be found first, euclidean distance between each vector and the center vector in the cluster is then calculated respectively, and a target news text corresponding to the vector with the smallest distance is selected from the euclidean distances, wherein the target news text is a reference text for generating hot topics subsequently.
S208, determining hot topics according to the news headlines of the target news text.
In the embodiment of the application, for the identified target news text, the news headline of the target news text can be directly used as the final determined hot topic.
On the other hand, after the hot topic is generated, news headlines corresponding to the rest vectors can be selected from the cluster, and a similar news list can be generated.
It should be noted that, because there is a case of transferring news, there may be a plurality of news with the same title in the generated similar news list. For news of the same title in the list, only one of them is reserved.
In the embodiment of the application, the generated multiple clusters can respectively have corresponding time attributes, and the time attributes can indicate that each cluster is generated by collecting news in a specific time period.
That is, the hot topic may be generated by sliding according to a certain time window, for example, one hour may be selected as the time window, and hot topic extraction is performed on news of approximately one hour at a time.
Since topics in different time windows may be duplicated or similar, this requires recall of historical hotspots, categorizing new news text into a certain hot topic that has been previously extracted.
In the embodiment of the application, the historical cluster to be processed can be determined according to the time attribute, then the similarity between the cluster and the historical cluster is calculated for any cluster, and if the similarity is smaller than the similarity threshold, the cluster smaller than the similarity threshold can be combined with the historical cluster.
As shown in fig. 4, which is a flowchart of a historical hot topic recall algorithm according to an embodiment of the present application, recall of a historical hot topic according to the flowchart shown in fig. 4 may include the following steps:
algorithm input: historical hot cluster center vector, new hot cluster similar news, similarity threshold;
step 1: traversing all new hot spot center vectors;
step 2: traversing all historical hotspot cluster center vectors;
step 3: respectively calculating Euclidean distances between the center vectors, and recording and sequencing the distances;
step 4: selecting a history hot spot cluster with the smallest distance from the new hot spot center vector;
step 5: if the distance is smaller than the similarity threshold, recalling the new hot spot to the history hot spot, and combining similar news;
step 6: if the distance is greater than the similarity threshold, the recall fails, and a new hot spot is generated;
algorithm output: a history hot spot list, a new hot spot list.
According to the algorithm, for the new hot topics generated in each time window, the distance between the cluster to which the new hot topics belong and the history cluster, namely the distance between the center vectors of the two clusters, can be calculated respectively, and if the distance is smaller than a preset similarity threshold, the new hot topics can be combined with the hot topics of the similar history clusters, so that the hot topic pushing accuracy is ensured.
According to the embodiment of the application, based on the improved SinglePass clustering algorithm, the outlier can be detected after each node addition, if the distance is too large, the outlier is removed from the current cluster, the representativeness of the clustering center and the accuracy of the clustering result are ensured, and secondly, the history hot spot recall algorithm provided by the embodiment can effectively judge the relation between the new hot spot and the history hot spot, and the hot spot and similar news of the same theme are combined, so that the accuracy of real-time pushing is ensured. Thirdly, in the embodiment, the word2vec and the TF-IDF are used for vectorizing the sentence, so that the global feature of the sentence vector can be more accurately represented, the interference of irrelevant words is eliminated, and meanwhile, the real-time increment processing is supported, so that the time requirement of practical application can be met. The hot topic extraction method provided by the embodiment of the application realizes the functions of news sentence vector representation, hot topic clustering, hot topic title screening, historical hot recall and the like, solves the problems that sentence vector representation is inaccurate and incremental clustering is not supported in the existing algorithm, does not need priori knowledge for large-scale dynamic news data, does not need obvious characteristics of news, and has good universality in the whole algorithm.
It should be noted that, the sequence number of each step in the above embodiment does not mean the sequence of execution sequence, and the execution sequence of each process should be determined by its function and internal logic, and should not limit the implementation process of the embodiment of the present application in any way.
Referring to fig. 5, a schematic diagram of a hot topic extraction apparatus according to an embodiment of the present application may specifically include the following modules:
a news text collection module 501, configured to collect a plurality of news texts;
a feature word extracting module 502, configured to extract, for any news text, a plurality of feature words of the news text;
a sentence vector generating module 503, configured to generate sentence vectors corresponding to the news text according to the plurality of feature words;
a news text clustering module 504, configured to cluster the plurality of news texts based on sentence vectors corresponding to the plurality of news texts, to obtain a plurality of clusters;
a hot topic extraction module 505, configured to extract hot topics from the plurality of clusters.
In the embodiment of the present application, the feature word extracting module 502 may specifically include the following sub-modules:
the target text acquisition sub-module is used for word segmentation of the news text aiming at any news text, deleting non-target words after word segmentation to obtain target text, wherein the non-target words comprise at least one of stop words, numbers or single words;
and the feature word extraction sub-module is used for extracting a plurality of feature words in the preset text position of the target text.
In the embodiment of the present application, the sentence vector generating module 503 may specifically include the following sub-modules:
the dense vector mapping sub-module is used for mapping each feature word into dense vectors with preset dimensions according to a preset first language model, wherein the first language model is obtained by training a sample news text by adopting a preset word jump model;
the weight value determining submodule is used for determining the weight value of each characteristic word according to a preset second language model, wherein the second language model is obtained by counting the inverse document frequency of each word in the sample news text;
and the sentence vector generation sub-module is used for generating sentence vectors corresponding to the news text according to the dense vector of each feature word and the weight value.
In the embodiment of the present application, the sentence vector generating submodule may specifically include the following units:
the product calculation unit is used for calculating products of the values of the dense vectors corresponding to the feature words and the weight values of the feature words for any feature word respectively;
and the sentence vector generating unit is used for calculating the ratio between the product and the number of all the feature words, taking the ratio as a vector value of the dimension corresponding to the feature word in the sentence vector, and obtaining the sentence vector corresponding to the news text.
In the embodiment of the present application, the hot topic extraction module 505 may specifically include the following sub-modules:
a cluster center vector determination submodule, configured to determine, for any cluster, a cluster center vector of the cluster;
the distance calculation sub-module is used for calculating the distance between each sentence vector in the cluster and the cluster center vector respectively;
the target news text extraction sub-module is used for extracting target news texts corresponding to sentence vectors with the smallest distance between the cluster center vectors;
and the hot topic determination submodule is used for determining hot topics according to the news headlines of the target news text.
In an embodiment of the present application, the apparatus may further include the following modules:
the new news text distance calculation module is used for calculating the distance between sentence vectors corresponding to the new news text and cluster center vectors of the plurality of clusters one by one when the new news text is acquired;
the new news text classifying module is used for adding the new news text into the target cluster when the distance between the sentence vector corresponding to the new news text and the cluster center vector of the target cluster is smaller than a preset threshold value, and stopping calculating the distance between the sentence vector corresponding to the new news text and the cluster center vector of other clusters; if the distance between the sentence vector corresponding to the new news text and the cluster center vector of the clusters is larger than the preset threshold, a cluster is newly built, the new news text is inserted into the newly built cluster, and the target cluster is any one of the clusters.
In the embodiment of the present application, the plurality of clusters respectively have corresponding time attributes, and the apparatus may further include the following modules:
the historical cluster determining module is used for determining a historical cluster to be processed according to the time attribute;
the similarity calculation module is used for calculating the similarity between any cluster and the historical cluster according to any cluster;
and the cluster merging module is used for merging the cluster smaller than the similarity threshold with the history cluster if the similarity is smaller than the similarity threshold.
For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference should be made to the description of the method embodiments.
Referring to fig. 6, a schematic diagram of a terminal device according to an embodiment of the present application is shown. As shown in fig. 6, the terminal device 600 of the present embodiment includes: a processor 610, a memory 620, and a computer program 621 stored in the memory 620 and executable on the processor 610. The processor 610, when executing the computer program 621, implements the steps in the embodiments of the hot topic extraction method described above, such as steps S101 to S105 shown in fig. 1. Alternatively, the processor 610, when executing the computer program 621, performs the functions of the modules/units of the apparatus embodiments described above, such as the functions of the modules 501 to 505 shown in fig. 5.
Illustratively, the computer program 621 may be partitioned into one or more modules/units that are stored in the memory 620 and executed by the processor 610 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing the specified functions, which may be used to describe the execution of the computer program 621 in the terminal device 600. For example, the computer program 621 may be divided into a news text collection module, a feature word extraction module, a sentence vector generation module, a news text clustering module, and a hot topic extraction module, each of which specifically functions as follows:
the news text acquisition module is used for acquiring a plurality of news texts;
the feature word extraction module is used for extracting a plurality of feature words of the news text aiming at any news text;
the sentence vector generation module is used for generating sentence vectors corresponding to the news text according to the plurality of feature words;
the news text clustering module is used for clustering the news texts based on sentence vectors corresponding to the news texts to obtain a plurality of clustering clusters;
and the hot topic extraction module is used for extracting hot topics from the plurality of cluster clusters.
The terminal device 600 may be a computing device such as a desktop computer, a notebook computer, a palm computer, and a cloud server. The terminal device 600 may include, but is not limited to, a processor 610, a memory 620. It will be appreciated by those skilled in the art that fig. 6 is merely an example of a terminal device 600 and is not meant to be limiting of the terminal device 600, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the terminal device 600 may also include input and output devices, network access devices, buses, etc.
The processor 610 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 620 may be an internal storage unit of the terminal device 600, for example, a hard disk or a memory of the terminal device 600. The memory 620 may also be an external storage device of the terminal device 600, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal device 600. Further, the memory 620 may also include both an internal storage unit and an external storage device of the terminal device 600. The memory 620 is used to store the computer program 621 and other programs and data required by the terminal device 600. The memory 620 may also be used to temporarily store data that has been output or is to be output.
The above embodiments are only for illustrating the technical solution of the present application, and are not limited thereto. Although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims (10)

1. The hot topic extraction method is characterized by comprising the following steps of:
collecting a plurality of news texts;
extracting a plurality of feature words of a news text aiming at any news text;
generating sentence vectors corresponding to the news text according to the feature words;
clustering the news texts based on sentence vectors corresponding to the news texts respectively to obtain a plurality of clustering clusters;
extracting hot topics from the plurality of clusters;
the clustering the plurality of news texts based on the sentence vectors corresponding to the plurality of news texts respectively to obtain a plurality of clusters includes:
determining sentence vectors corresponding to any news text as a clustering center of a first cluster;
determining the distance between other sentence vectors and the clustering center of the first cluster, wherein the other sentence vectors are sentence vectors except the clustering center;
traversing the other sentence vectors, adding the sentence vectors into clusters corresponding to the cluster centers when the distance between any other sentence vector and the cluster center is smaller than a cluster threshold value, and updating the cluster centers of the clusters corresponding to the cluster centers;
and after the cluster center of any cluster is updated, determining an outlier in the cluster, and removing the outlier from the cluster corresponding to the cluster center, wherein the outlier is a sentence vector with a distance from the updated cluster center being greater than the cluster threshold value.
2. The method of claim 1, wherein extracting a plurality of feature words of the news text for any news text comprises:
aiming at any news text, word segmentation is carried out on the news text, non-target words after word segmentation are deleted, and target text is obtained, wherein the non-target words comprise at least one of stop words, numbers or single words;
and extracting a plurality of feature words in the preset text position of the target text.
3. The method of claim 1 or 2, wherein generating sentence vectors corresponding to the news text from the plurality of feature words comprises:
mapping each feature word into a dense vector with a preset dimension according to a preset first language model, wherein the first language model is obtained by training a sample news text by adopting a preset skip word model;
determining a weight value of each feature word according to a preset second language model, wherein the second language model is obtained by counting the inverse document frequency of each word in the sample news text;
and generating sentence vectors corresponding to the news text according to the dense vector of each feature word and the weight value.
4. The method of claim 3, wherein generating sentence vectors corresponding to the news text based on the weight value and the dense vector of each feature word comprises:
for any feature word, calculating the product of the value of the dense vector corresponding to the feature word and the weight value of the feature word;
and calculating the ratio between the product and the number of all the feature words, and taking the ratio as a vector value of the dimension corresponding to the feature word in the sentence vector to obtain the sentence vector corresponding to the news text.
5. The method of claim 1 or 2 or 4, wherein the extracting hot topics from the plurality of clusters comprises:
determining a cluster center vector of any cluster;
respectively calculating the distance between each sentence vector in the cluster and the cluster center vector;
extracting a target news text corresponding to a sentence vector with the minimum distance between the cluster center vectors;
and determining hot topics according to the news headlines of the target news text.
6. The method as recited in claim 5, further comprising:
when a new news text is acquired, calculating the distance between sentence vectors corresponding to the new news text and cluster center vectors of the plurality of clusters one by one;
when the distance between the sentence vector corresponding to the newly added news text and the cluster center vector of the target cluster is smaller than a preset threshold value, adding the newly added news text into the target cluster, and stopping calculating the distance between the sentence vector corresponding to the newly added news text and the cluster center vector of other clusters, wherein the target cluster is any one of the plurality of clusters;
and if the distances between the sentence vectors corresponding to the new news text and the cluster center vectors of the clusters are larger than the preset threshold value, creating a cluster, and inserting the new news text into the created cluster.
7. The method of claim 1 or 2 or 4 or 6, wherein the plurality of clusters each have a corresponding temporal attribute, the method further comprising:
determining a history cluster to be processed according to the time attribute;
for any cluster, calculating the similarity between the cluster and the historical cluster respectively;
and if the similarity is smaller than the similarity threshold, merging the cluster clusters smaller than the similarity threshold with the historical cluster clusters.
8. A hot topic extraction apparatus, comprising:
the news text acquisition module is used for acquiring a plurality of news texts;
the feature word extraction module is used for extracting a plurality of feature words of the news text aiming at any news text;
the sentence vector generation module is used for generating sentence vectors corresponding to the news text according to the plurality of feature words;
the news text clustering module is used for clustering the news texts based on sentence vectors corresponding to the news texts to obtain a plurality of clustering clusters;
a hot topic extraction module, configured to extract hot topics from the plurality of clusters;
the news text clustering module is specifically used for:
determining sentence vectors corresponding to any news text as a clustering center of a first cluster;
determining the distance between other sentence vectors and the clustering center of the first cluster, wherein the other sentence vectors are sentence vectors except the clustering center;
traversing the other sentence vectors, adding the sentence vectors into clusters corresponding to the cluster centers when the distance between any other sentence vector and the cluster center is smaller than a cluster threshold value, and updating the cluster centers of the clusters corresponding to the cluster centers;
and after the cluster center of any cluster is updated, determining an outlier in the cluster, and removing the outlier from the cluster corresponding to the cluster center, wherein the outlier is a sentence vector with a distance from the updated cluster center being greater than the cluster threshold value.
9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the hot topic extraction method as claimed in any one of claims 1 to 7 when the computer program is executed by the processor.
10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the hot topic extraction method as claimed in any one of claims 1 to 7.
CN202010231954.9A 2020-03-27 2020-03-27 Hot topic extraction method, device, terminal equipment and storage medium Active CN111460153B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010231954.9A CN111460153B (en) 2020-03-27 2020-03-27 Hot topic extraction method, device, terminal equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010231954.9A CN111460153B (en) 2020-03-27 2020-03-27 Hot topic extraction method, device, terminal equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111460153A CN111460153A (en) 2020-07-28
CN111460153B true CN111460153B (en) 2023-09-22

Family

ID=71681517

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010231954.9A Active CN111460153B (en) 2020-03-27 2020-03-27 Hot topic extraction method, device, terminal equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111460153B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914536B (en) * 2020-08-06 2021-12-17 北京嘀嘀无限科技发展有限公司 Viewpoint analysis method, viewpoint analysis device, viewpoint analysis equipment and storage medium
CN112257801B (en) * 2020-10-30 2022-04-29 浙江商汤科技开发有限公司 Incremental clustering method and device for images, electronic equipment and storage medium
CN112613296A (en) * 2020-12-07 2021-04-06 深圳价值在线信息科技股份有限公司 News importance degree acquisition method and device, terminal equipment and storage medium
CN112989042B (en) * 2021-03-15 2024-03-15 平安科技(深圳)有限公司 Hot topic extraction method and device, computer equipment and storage medium
CN113407679B (en) * 2021-06-30 2023-10-03 竹间智能科技(上海)有限公司 Text topic mining method and device, electronic equipment and storage medium
CN113420723A (en) * 2021-07-21 2021-09-21 北京有竹居网络技术有限公司 Method and device for acquiring video hotspot, readable medium and electronic equipment
US20230029058A1 (en) * 2021-07-26 2023-01-26 Microsoft Technology Licensing, Llc Computing system for news aggregation
CN113761196B (en) * 2021-07-28 2024-02-20 北京中科模识科技有限公司 Text clustering method and system, electronic equipment and storage medium
CN116361470B (en) * 2023-04-03 2024-05-14 北京中科闻歌科技股份有限公司 Text clustering cleaning and merging method based on topic description
CN116049414B (en) * 2023-04-03 2023-06-06 北京中科闻歌科技股份有限公司 Topic description-based text clustering method, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102937960A (en) * 2012-09-06 2013-02-20 北京邮电大学 Device and method for identifying and evaluating emergency hot topic
CN109710728A (en) * 2018-11-26 2019-05-03 西南电子技术研究所(中国电子科技集团公司第十研究所) News topic automatic discovering method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8078450B2 (en) * 2006-10-10 2011-12-13 Abbyy Software Ltd. Method and system for analyzing various languages and constructing language-independent semantic structures

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102937960A (en) * 2012-09-06 2013-02-20 北京邮电大学 Device and method for identifying and evaluating emergency hot topic
CN109710728A (en) * 2018-11-26 2019-05-03 西南电子技术研究所(中国电子科技集团公司第十研究所) News topic automatic discovering method

Also Published As

Publication number Publication date
CN111460153A (en) 2020-07-28

Similar Documents

Publication Publication Date Title
CN111460153B (en) Hot topic extraction method, device, terminal equipment and storage medium
CN106874292B (en) Topic processing method and device
CN107463605B (en) Method and device for identifying low-quality news resource, computer equipment and readable medium
CN111814770B (en) Content keyword extraction method of news video, terminal device and medium
TWI653542B (en) Method, system and device for discovering and tracking hot topics based on network media data flow
WO2022095374A1 (en) Keyword extraction method and apparatus, and terminal device and storage medium
CN110874530B (en) Keyword extraction method, keyword extraction device, terminal equipment and storage medium
CN111581355B (en) Threat information topic detection method, device and computer storage medium
CN108563655B (en) Text-based event recognition method and device
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
CN110413787B (en) Text clustering method, device, terminal and storage medium
CN108090216B (en) Label prediction method, device and storage medium
CN110309251B (en) Text data processing method, device and computer readable storage medium
CN111291177A (en) Information processing method and device and computer storage medium
CN112559747B (en) Event classification processing method, device, electronic equipment and storage medium
CN111708942B (en) Multimedia resource pushing method, device, server and storage medium
CN109117477B (en) Chinese field-oriented non-classification relation extraction method, device, equipment and medium
CN111061837A (en) Topic identification method, device, equipment and medium
CN110825868A (en) Topic popularity based text pushing method, terminal device and storage medium
CN114416998A (en) Text label identification method and device, electronic equipment and storage medium
CN111859079B (en) Information searching method, device, computer equipment and storage medium
CN109885831B (en) Keyword extraction method, device, equipment and computer readable storage medium
CN108875050B (en) Text-oriented digital evidence-obtaining analysis method and device and computer readable medium
CN110019763B (en) Text filtering method, system, equipment and computer readable storage medium
CN114328800A (en) Text processing method and device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant