CN107688596B - Burst topic detection method and burst topic detection equipment - Google Patents

Burst topic detection method and burst topic detection equipment Download PDF

Info

Publication number
CN107688596B
CN107688596B CN201710433359.1A CN201710433359A CN107688596B CN 107688596 B CN107688596 B CN 107688596B CN 201710433359 A CN201710433359 A CN 201710433359A CN 107688596 B CN107688596 B CN 107688596B
Authority
CN
China
Prior art keywords
word
topic
topic data
word segmentation
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710433359.1A
Other languages
Chinese (zh)
Other versions
CN107688596A (en
Inventor
王健宗
黄章成
吴天博
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201710433359.1A priority Critical patent/CN107688596B/en
Priority to PCT/CN2018/074870 priority patent/WO2018223718A1/en
Publication of CN107688596A publication Critical patent/CN107688596A/en
Application granted granted Critical
Publication of CN107688596B publication Critical patent/CN107688596B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2358Change logging, detection, and notification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a device for detecting a burst topic, which are suitable for the technical field of Internet, and the method comprises the following steps: continuously acquiring topic data in the information sharing platform; when each topic data is obtained, matching the topic data with each word in a preset word bank to output a plurality of word segmentation results; outputting a plurality of word segmentation included in the word segmentation result with the highest matching degree as the keyword corresponding to the topic data; updating summary information associated with the topic data according to the key words; and displaying the key words and the abstract information so as to enable a user to know the burst topics at the current moment. According to the method and the device, the keyword corresponding to the topic data can be determined, and the abstract information is updated based on the keyword, so that a user can quickly know the burst topic on the information sharing platform from the output keyword and the abstract information.

Description

Burst topic detection method and burst topic detection equipment
Technical Field
The invention belongs to the technical field of internet, and particularly relates to a burst topic detection method and a burst topic detection device.
Background
On information sharing platforms such as microblogs, Twitter and forums, users can share and forward various information anytime and anywhere based on the openness of the platforms. In a short time, if a large number of users share or forward the same information, the specific topic corresponding to the information is changed into a sudden topic with higher popularity. These outburst topics, if related to a specific enterprise, may bring a huge public opinion impact to the enterprise. If the enterprise cannot timely find and track the emergent topic events related to the company, the optimal time for eliminating the negative public opinion influence can be missed, so that the self soft strength of the enterprise is reduced.
However, in the prior art, it is difficult to quickly know the burst topics on the information sharing platform through technical means, and it is also difficult to determine whether each burst topic is related to the enterprise itself.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method for detecting a sudden topic and a device for detecting a hotness event, so as to solve the problems in the prior art that it is difficult to quickly know the sudden topic on an information sharing platform through a technical means and to determine whether each sudden topic is related to an enterprise itself.
A first aspect of an embodiment of the present invention provides a method for detecting a sudden topic, including:
continuously acquiring topic data in the information sharing platform;
when each topic data is obtained, matching the topic data with each word in a preset word bank so as to output a plurality of word segmentation results;
outputting a plurality of word segmentation included in the word segmentation result with the highest matching degree as the keyword corresponding to the topic data;
updating summary information associated with the topic data according to the key words;
and displaying the key words and the abstract information so as to enable a user to know the burst topics at the current moment.
A second aspect of the embodiments of the present invention provides a sudden-topic detection apparatus, which includes a memory, a processor, and a sudden-topic detection program that is stored on the memory and can be executed on the processor, and when the processor executes the sudden-topic detection program, the following steps are implemented:
continuously acquiring topic data in the information sharing platform;
when each topic data is obtained, matching the topic data with each word in a preset word bank so as to output a plurality of word segmentation results;
outputting a plurality of word segmentation included in the word segmentation result with the highest matching degree as the keyword corresponding to the topic data;
updating summary information associated with the topic data according to the key words;
and displaying the key words and the abstract information so as to enable a user to know the burst topics at the current moment.
A third aspect of embodiments of the present invention provides a computer-readable storage medium storing a sudden-topic detection program, which when executed by at least one processor, implements the steps of:
continuously acquiring topic data in the information sharing platform;
when each topic data is obtained, matching the topic data with each word in a preset word bank so as to output a plurality of word segmentation results;
outputting a plurality of word segmentation included in the word segmentation result with the highest matching degree as the keyword corresponding to the topic data;
updating summary information associated with the topic data according to the key words;
and displaying the key words and the abstract information so as to enable a user to know the burst topics at the current moment.
In the embodiment of the invention, when the topic data in the information sharing platform is acquired each time, the keyword corresponding to the topic data is determined, and the abstract information is updated in real time based on the keyword, so that a user can know about what the emergent topic on the information sharing platform is probably from the output keyword and the abstract information at the first time, and can rapidly determine whether the emergent topic is related to the enterprise per se based on the abstract information, thereby effectively finding and tracking and processing the emergent topic event related to the enterprise, and improving the soft strength of the enterprise.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a flowchart of an implementation of a burst topic detection method provided in an embodiment of the present invention;
fig. 2 is a flowchart of a specific implementation of the burst topic detection method S103 according to an embodiment of the present invention;
fig. 3 is a flowchart of a specific implementation of the burst topic detection method S104 according to an embodiment of the present invention;
fig. 4 is a flowchart of a specific implementation of the burst topic detection method S303 according to an embodiment of the present invention;
fig. 5 is a flowchart of a specific implementation of the method S305 for detecting a sudden topic according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a device for detecting a sudden topic provided by an embodiment of the present invention;
fig. 7 is a schematic diagram of a device for breaking out topics provided by an embodiment of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
In order to explain the technical means of the present invention, the following description will be given by way of specific examples.
Fig. 1 shows an implementation flow of the burst topic detection method provided by the embodiment of the present invention, where the method flow includes steps S101 to S105. The specific realization principle of each step is as follows:
s101: topic data in the information sharing platform is continuously acquired.
In the embodiment of the invention, the information sharing platform comprises but is not limited to microblog, Twitter, Facebook, big BBS forums and the like. Each piece of topic data is specifically a piece of text information which can be shown on the information sharing platform and issued by a user, and can be associated with one or more emergencies. The text information includes, but is not limited to, the original text, the transferred text, and the user comment data corresponding to the original text or the transferred text in the information sharing platform.
The topic data in the information sharing platform can be acquired through the following two ways: in the first mode, according to an Application program which is created in advance and can be used for interacting with an Application Programming Interface (API) of an information sharing platform, and according to a pre-acquired account key, in the Application program, an API Interface provided by the information sharing platform is called, so that topic data returned by the information sharing platform is acquired; and in the second mode, topic data in the information sharing platform is continuously crawled through a crawler program.
Because the topic data in the information sharing platform is continuously updated and continuously increased, in the embodiment of the invention, the topic data in the information sharing platform is obtained in real time, namely the topic data is continuously obtained, and the system can be ensured to obtain the latest topic data at all times, so that the detection of the burst topic can be accurately, timely and rapidly executed.
S102: and when each topic data is acquired, matching the topic data with each word in a preset word bank so as to output various word segmentation results.
When receiving a new topic data, the system carries out word matching processing on the topic data. Specifically, the system will determine whether the topic data contains a word in a preset lexicon, starting from the first character of the topic data. And when determining that the words formed by the characters which continuously appear in the topic data are the same as the words in the preset word bank, determining the continuously appearing characters as a participle, and re-executing the word matching process from the first character after the participle in the topic data. And when all the participles in the topic data are determined, determining to finish a word matching process once, and correspondingly outputting a word segmentation result in the word matching process, wherein the word segmentation result comprises a plurality of participles. In particular, the total number of characters per word segment is more than two.
In fact, for a character in the topic data, it can not only form a first segmentation with one or more characters adjacent to the left, but also form a first segmentation with one or more characters adjacent to the right, therefore, under the condition of different segmentation rules, the same topic data can obtain different segmentation results. In the embodiment of the invention, for a piece of topic data, a word segmentation result corresponding to each pre-stored word segmentation rule is output. The matching degrees corresponding to different word segmentation results may be different. The matching degree represents that the user can acquire the actual semantic degree of the topic data according to each participle in the participle result.
S103: and outputting a plurality of word segmentation included in the word segmentation result with the highest matching degree as the keyword corresponding to the topic data.
In the embodiment of the present invention, the matching degree of each segmentation result may be determined according to the average number of characters of each segmentation, or the matching degree of each segmentation result may be determined according to the variance of the total number of characters of each segmentation, which is not limited herein.
Preferably, the larger the total number of characters of the participle is, the easier it is for the user to determine the actual semantics of the topic data from the participle, so the matching degree of each participle result is measured based on the longest matching principle. And after comparing the matching degree of each word segmentation result, outputting each first word segmentation contained in the word segmentation result with the maximum matching degree as a keyword corresponding to the topic data.
For example, when topic data only has three Chinese characters of "data line", since both "data line" and "data" can form a participle, and the matching degree of "data line" is higher, since the participle included in the participle result with the highest matching degree is determined to be "data line", the "data line" is output as a keyword.
As an embodiment of the present invention, a calculation method of the matching degree of the segmentation result is further defined. As shown in fig. 2, the step S103 specifically includes:
s201: and calculating the average number of the word segmentation characters of each word segmentation result according to the total number of the characters corresponding to each word segmentation in each word segmentation result and the total number of the words corresponding to each word segmentation result.
Each word segmentation result comprises a plurality of word segments, and each word segment comprises at least two characters. In the embodiment of the present invention, the total number of the segmented words is identified, and the total number of the characters of each segmented word is identified (that is, the number of the characters included in each segmented word is determined). And outputting the ratio of the sum of the total number of the characters corresponding to each participle to the total number of the participles as the average number of the participle characters.
For example, if a segmentation result obtained by performing segmentation processing on topic data is { skyway group/data line/yield }, the three segmentation results in the segmentation result are "skyway group", "data line" and "yield", the total number of characters of the three segmentation results is 4, 3 and 3, the total number of segmentation results is 3, and the average number of segmentation characters is (4+3+3)/3 is 3.33.
S202: and performing weighting processing on the word segmentation character average number and the word segmentation total number corresponding to each word segmentation result so as to output the matching degree of each word segmentation result.
In the embodiment of the invention, the average number A of word segmentation characters1The corresponding weighting coefficient is a preset value a1Total number of participles A2The corresponding weighting coefficient is a preset value a2And a is a1+a21. The matching degree of each word segmentation result is C ═ A1×a1+A2×a2
S203: and outputting a plurality of word segmentation included in the word segmentation result with the highest matching degree as the keyword corresponding to the topic data.
If the topic data is subjected to word segmentation processing, M word segmentation results are obtained, and the matching degrees of the M word segmentation results are respectively C1、C2…、CmThen is at C1、C2…、CmSelecting one value C with the largest valueiAnd C islAnd outputting each word segmentation in the corresponding word segmentation result as a keyword corresponding to the topic data. Wherein m is an integer greater than 1, and i is less than or equal to m.
In the embodiment of the invention, because the two factors of the average number of the word segmentation characters and the total number of the word segmentation have larger influence on the word segmentation result, whether the user can determine the actual semantics of the topic data can be determined, the keyword is measured by weighting the average number of the word segmentation characters and the total number of the word segmentation and taking the weighted value as the matching degree of the word segmentation result, the accuracy and the effectiveness of the keyword selection can be improved, and the event content of the emergent topic can be accurately positioned.
S104: and updating the summary information associated with the topic data according to the key words.
At any moment, the system receives a plurality of pieces of topic data in an accumulation mode, and after determining the key words of each piece of topic data, the system regenerates abstract information for describing all the topic data which are received in the accumulation mode currently, so that a user can clearly know the rough content of the burst topic at the current moment based on the abstract information.
The keywords have a decisive characteristic of the topic data, and in order to generate summary information associated with all the currently accumulated and received topic data, the accumulated word frequency of each keyword in each topic data may be counted, so as to generate the summary information according to the keywords with the accumulated word frequency greater than the threshold value. The abstract information generation tool in the TextRank algorithm or the word tool may be used to generate the abstract information associated with the topic data and the keywords.
Preferably, as an embodiment of the present invention, as shown in fig. 3, the S104 specifically includes:
s301: and respectively acquiring the accumulated word frequency of each keyword, and calculating the increase acceleration of the accumulated word frequency, wherein the accumulated word frequency of the keyword represents the accumulated times of occurrence of the keyword in all the topic data acquired at the current moment.
In the embodiment of the present invention, the cumulative word frequency of a keyword indicates the number of occurrences of the keyword in all the currently and cumulatively received topic data. Since the system is in a state of continuously acquiring topic data, the cumulative word frequency of the same keyword is continuously increased. If the system detects that the cumulative word frequency of the keyword a increases by Δ S within the time period Δ T, the rate of increase of the cumulative word frequency of the keyword a is V ═ Δ S/Δ T, and the rate of increase a of the cumulative word frequency is a partial derivative of the rate of increase V with respect to time, i.e., a ═ V' (T). The larger the growth acceleration is, the more times the keyword appears in the topic data in a unit time length is, and the higher the topic burstiness is.
S302: and adding the growth acceleration corresponding to each keyword into a pre-generated matrix.
Every time new topic data is received, the system determines the keywords of the topic data and the increase acceleration of the accumulated word frequency of the keywords. If there are K keywords of the topic data, K growth accelerations will be obtained. If the number of the growth acceleration accumulated by the system is P (P is larger than or equal to K, N belongs to Z), the matrix is expanded into a matrix of P multiplied by P, and the K growth accelerations obtained in real time are added into the matrix of P multiplied by P. In addition to containing P growth accelerations, the P matrix also includes null values.
S303: and calculating the characteristic value of the matrix at the current moment, and determining the growth acceleration which is greater than a second threshold value from the matrix when the characteristic value is greater than a first threshold value.
The system monitors each incremental acceleration in the matrix to detect the eigenvalues of the matrix in real time. As the number of the topic data obtained by accumulation is more and more, the size of the matrix and the total number of the increasing accelerations included in the matrix are also continuously changed, and therefore the eigenvalue of the matrix is increased. When the characteristic value is greater than a preset first threshold value, the system locates one or more increasing accelerations with values greater than a second threshold value from among the increasing accelerations included in the matrix.
As an embodiment of the present invention, as shown in fig. 4, the step S303 specifically includes:
s401: dividing the increasing acceleration in the matrix at the current moment into N groups, and mapping the increasing acceleration of each group into a sub-matrix.
And because the number of the increased acceleration in the matrix is large, in order to improve the positioning speed of the increased acceleration with the numerical value larger than the second threshold value, the matrix is subjected to dimension reduction processing.
Specifically, according to a preset rule, all the growth accelerations present in the matrix are divided into N groups, so that each group contains a smaller number of growth accelerations. Wherein the number of increasing accelerations in each group may be the same or different. And mapping a plurality of increasing accelerations contained in each group into a sub-matrix. Therefore, when the number of the groups is B, the number of the sub-matrices is also B. Under the condition that the topic data is gradually increased, the increasing acceleration obtained by each updating is also mapped into the B sub-matrixes respectively.
S402: and calculating the characteristic value of each sub-matrix, and screening the growth acceleration which is greater than a second threshold value from the sub-matrices when the characteristic value of the sub-matrix is greater than a fourth threshold value.
And calculating the characteristic value of each submatrix, and if the characteristic values of any plurality of the submatrixes in the B submatrixes are all larger than a preset fourth threshold value, screening out the growth accelerations larger than the second threshold value from the submatrixes with the characteristic values larger than the fourth threshold value respectively.
In the embodiment of the invention, because the number of the increasing accelerations in the sub-matrix is greatly less than that of the increasing accelerations in the matrix, the increasing accelerations larger than the second threshold value can be quickly positioned from the corresponding sub-matrix by respectively calculating the characteristic values of the sub-matrices under the condition that the characteristic values are larger than the fourth threshold value, thereby improving the detection efficiency of the unexpected topics.
S304: and screening the topic data containing the participle from all the obtained topic data according to the determined participle corresponding to each growth acceleration.
Each growth acceleration in the matrix or the sub-matrix corresponds to a keyword, and each keyword is one of the participles in the participle result with the maximum matching degree in the topic data, so the system can query the participles corresponding to each growth acceleration with the numerical value larger than the second threshold value according to the pre-stored mapping relation table of the growth acceleration and the participles. If the number of the increasing accelerations is L, which is larger than the second threshold value, the number of the inquired participles is also L.
The system sequentially screens each piece of topic data which is acquired at the current moment, and judges whether each piece of topic data contains the L participles. If a topic data includes the L segmented words, the system filters the topic data and performs step S305 on the topic data.
S305: and performing word segmentation processing on the topic data containing the word segmentation again, and calculating the word frequency characteristic value of each word segmentation obtained after the word segmentation processing.
And for each piece of screened topic data, the system carries out word segmentation processing on the topic data again. The word segmentation process may use various existing word segmentation algorithms, including but not limited to a word segmentation algorithm based on string matching, a word segmentation algorithm based on statistics, and the like. And after the word segmentation is finished, a plurality of word segments of the topic data are obtained again. In order to distinguish between the segmentation word obtained in S102 and the segmentation word obtained in S305, the segmentation word obtained in S102 is referred to as a first segmentation word, and the segmentation word obtained in S305 is referred to as a second segmentation word. The first participle and the second participle may be the same or different. In order to further screen out second participles with large influence degrees on the abstract information, the word frequency characteristic value of each second participle is calculated based on the word frequency characteristic quantity of each second participle. These word frequency feature quantities include, but are not limited to, word frequency, inverse file frequency (term-TF), and the like.
As an embodiment of the present invention, as shown in fig. 5, the S305 specifically includes:
s501: and performing word segmentation processing on the topic data containing the word segmentation again to obtain a plurality of word segmentations.
S502: and respectively calculating the statistical word frequency and the reverse file frequency corresponding to each participle obtained after the participle processing in all the topic data obtained at the current moment.
In the embodiment of the invention, the frequency of each second participle appearing in the screened topic data is calculated, and the counted frequency of the second participle is the counted word frequency FTF. If the total number of the screened topic data is X, wherein the topic data containing a certain second participle is X '(X' is less than or equal to X, N belongs to Z), the reverse file frequency F of the second participleIDFIs composed of
Figure BDA0001317966290000101
S503: and weighting the statistical word frequency and the reverse file frequency of each participle to output a word frequency characteristic value of the participle.
Statistical word frequency FTFThe corresponding weighting coefficient is a preset value a3Reverse file frequency FIDFThe corresponding weighting coefficient is a preset value a4And a is a3+a41. The word frequency characteristic value of each second participle is F ═ FTF×a3+FIDF×a4
In the embodiment of the invention, the word frequency characteristic value of each second participle can be calculated based on the self-defined weighting coefficient according to the TF and IDF value of each second participle, so that the importance degree of each second participle can be quantitatively compared on a plurality of pieces of screened topic data by comprehensively considering the TF-IDF value of each second participle.
S306: and outputting the participles with the word frequency characteristic value larger than a third threshold value as high-frequency words, and performing connection processing on the high-frequency words through a budget algorithm to obtain the abstract information containing the high-frequency words.
And determining each second participle of which the word frequency characteristic value F is greater than a preset third threshold value, wherein the second participles are high-frequency words appearing in the topic data. And connecting the high-frequency words by using the TextRank algorithm, the abstract information generation tool in the word tool, other self-defined algorithms and the like to obtain the topic data and the abstract information associated with the high-frequency words.
S105: and displaying the key words and the abstract information so as to enable a user to know the burst topics at the current moment.
And the system displays the keywords acquired in real time and the updated summary information. In practical situations, only when the topic data is a sudden topic, the increase acceleration of the accumulated word frequency of each keyword is larger than a threshold value, and the summary information is updated, so that the real content of the text content displayed in real time by the system has higher similarity with the real content of the sudden topic event, and has a certain reference value.
In the embodiment of the invention, when the topic data in the information sharing platform is acquired each time, the keyword corresponding to the topic data is determined, and the abstract information is updated in real time based on the keyword, so that a user can know about what the emergent topic on the information sharing platform is probably from the output keyword and the abstract information at the first time, and can rapidly determine whether the emergent topic is related to the enterprise per se based on the abstract information, thereby effectively finding and tracking and processing the emergent topic event related to the enterprise, and improving the soft strength of the enterprise.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
Fig. 6 shows a schematic diagram of the burst topic detection device provided in the embodiment of the present invention, corresponding to the burst topic detection method described in the above embodiment, and for convenience of description, only the relevant parts to the embodiment of the present invention are shown.
Referring to fig. 6, the apparatus includes:
and the obtaining module 61 is configured to continuously obtain topic data in the information sharing platform.
And the matching module 62 is configured to, when each topic data is obtained, perform matching processing on the topic data and each word in a preset word bank to output multiple word segmentation results.
And an output module 63, configured to output, as the keyword corresponding to the topic data, the multiple participles included in the participle result with the highest matching degree.
And the updating module 64 is used for updating the summary information associated with the topic data according to the keyword.
And the display module 65 is configured to display the keyword and the summary information, so that the user can know the outburst topic at the current time.
Optionally, the update module 64 includes:
and the first calculation submodule is used for respectively acquiring the accumulated word frequency of each keyword and calculating the increase acceleration of the accumulated word frequency, wherein the accumulated word frequency of the keyword represents the accumulated times of occurrence of the keyword in all the acquired topic data at the current moment.
And the adding submodule is used for adding the growth acceleration corresponding to each keyword into a pre-generated matrix.
And the determining submodule is used for calculating the characteristic value of the matrix at the current moment, and determining the growth acceleration which is greater than a second threshold value from the matrix when the characteristic value is greater than the first threshold value.
And the screening submodule is used for screening the topic data containing the participle from all the obtained topic data according to the determined participle corresponding to each growth acceleration.
And the word segmentation sub-module is used for carrying out word segmentation processing on the topic data containing the word segmentation again and calculating the word frequency characteristic value of each word segmentation obtained after the word segmentation processing.
And the first output sub-module is used for outputting the participles with the word frequency characteristic value larger than a third threshold value as high-frequency words, and performing connection processing on the high-frequency words through a budget algorithm to obtain the abstract information containing the high-frequency words.
Optionally, the determining sub-module is specifically configured to:
dividing each increasing acceleration in the matrix at the current moment into N groups, and mapping the increasing acceleration of each group into a sub-matrix;
calculating the characteristic value of each sub-matrix, and screening out the growth acceleration which is greater than a second threshold value from the sub-matrices when the characteristic value of the sub-matrix is greater than a fourth threshold value;
wherein N is an integer greater than 1.
Optionally, the word segmentation sub-module is specifically configured to:
performing word segmentation processing on the topic data containing the word segmentation again to obtain a plurality of word segmentations;
respectively calculating the statistical word frequency and the reverse file frequency corresponding to each participle obtained after the participle processing in all the topic data obtained at the current moment;
and weighting the statistical word frequency and the reverse file frequency of each participle to output a word frequency characteristic value of the participle.
Optionally, the output module 63 includes:
and the second calculation sub-module is used for calculating the average number of the word segmentation characters of each word segmentation result according to the total number of the characters corresponding to each word segmentation in each word segmentation result and the total number of the words corresponding to each word segmentation result.
And the weighting submodule is used for weighting the word segmentation character average number and the word segmentation total number corresponding to each word segmentation result so as to output the matching degree of each word segmentation result.
And the second output sub-module is used for outputting a plurality of participles contained in the participle result with the highest matching degree as the keywords corresponding to the topic data.
Fig. 7 is a schematic diagram of a sudden topic detection device provided by an embodiment of the present invention. As shown in fig. 7, the sudden topic detection apparatus 7 of this embodiment includes: a processor 70, a memory 71 and a computer program 72, such as a sudden topic detection program, stored in the memory 71 and executable on the processor 70. The processor 70, when executing the computer program 72, implements the steps in the various embodiments of the burst topic detection method described above, such as the steps 101-105 shown in fig. 1. Alternatively, the processor 70, when executing the computer program 72, implements the functions of the modules/units in the above-mentioned device embodiments, such as the functions of the modules 61 to 65 shown in fig. 6.
Illustratively, the computer program 72 may be partitioned into one or more modules/units that are stored in the memory 71 and executed by the processor 70 to implement the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions for describing the execution process of the computer program 72 in the sudden topic detection device 7. For example, the computer program 72 may be divided into an acquisition module, a matching module, an output module, an update module, and a presentation module, and the specific functions of each module are as follows:
the acquisition module is used for continuously acquiring topic data in the information sharing platform.
And the matching module is used for matching the topic data with each word in a preset word bank when each topic data is obtained so as to output various word segmentation results.
The output module is used for outputting a plurality of word segmentation included in the word segmentation result with the highest matching degree as the keyword corresponding to the topic data.
And the updating module is used for updating the summary information associated with the topic data according to the key words.
And the display module is used for displaying the key words and the abstract information so as to enable a user to know the burst topics at the current moment.
The sudden topic detection device 7 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. Those skilled in the art will appreciate that fig. 7 is merely an example of the unexpected topic detection device 7, and does not constitute a limitation of the unexpected topic detection device 7, and may include more or less components than those shown, or combine some components, or different components, for example, the unexpected topic detection device may also include an input-output device, a network access device, a bus, etc.
The Processor 70 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage 71 may be an internal storage unit of the sudden topic detection device 7, such as a hard disk or a memory of the sudden topic detection device 7. The memory 71 may also be an external storage device of the sudden topic detection device 7, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are equipped on the sudden topic detection device 7. Further, the memory 71 may also include both an internal storage unit of the sudden-topic detection device 7 and an external storage device. The memory 71 is used to store the computer program and other programs and data required by the sudden topic detection apparatus. The memory 71 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. . Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media which may not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (8)

1. A method for detecting a burst topic is characterized by comprising the following steps:
continuously acquiring topic data in the information sharing platform;
when each topic data is obtained, matching the topic data with each word in a preset word bank so as to output a plurality of word segmentation results;
outputting a plurality of word segmentation included in the word segmentation result with the highest matching degree as the keyword corresponding to the topic data;
updating summary information associated with the topic data according to the key words;
displaying the key words and the abstract information so that a user can know the burst topic at the current moment;
the updating the summary information associated with the topic data according to the keyword comprises:
respectively acquiring the accumulated word frequency of each keyword, and calculating the increase acceleration of the accumulated word frequency, wherein the accumulated word frequency of the keyword represents the accumulated times of occurrence of the keyword in all the topic data acquired at the current moment;
adding the growth acceleration corresponding to each keyword into a pre-generated matrix;
calculating a characteristic value of the matrix at the current moment, and determining an increasing acceleration which is greater than a second threshold value from the matrix when the characteristic value is greater than a first threshold value; the first threshold is a threshold set for an eigenvalue of the matrix;
screening topic data containing the participle from all the obtained topic data according to the determined participle corresponding to each growth acceleration;
performing word segmentation processing on the topic data containing the word segmentation again, and calculating the word frequency characteristic value of each word segmentation obtained after the word segmentation processing;
and outputting the participles with the word frequency characteristic value larger than a third threshold value as high-frequency words, and performing connection processing on the high-frequency words through a budget algorithm to obtain the abstract information containing the high-frequency words.
2. The method for detecting the burst topic according to claim 1, wherein the calculating the eigenvalue of the matrix at the current moment, and when the eigenvalue is greater than a first threshold, determining an increase acceleration greater than a second threshold from the matrix comprises:
dividing each increasing acceleration in the matrix at the current moment into N groups, and mapping the increasing acceleration of each group into a sub-matrix;
calculating the characteristic value of each sub-matrix, and screening out the growth acceleration which is greater than a second threshold value from the sub-matrices when the characteristic value of the sub-matrix is greater than a fourth threshold value;
wherein N is an integer greater than 1; the fourth threshold is a threshold set for an eigenvalue of the submatrix.
3. The method for detecting a sudden topic according to claim 1, wherein the step of performing segmentation processing again on the topic data including the segmentation word and calculating a word frequency feature value of each segmentation word obtained after the segmentation processing comprises:
performing word segmentation processing on the topic data containing the word segmentation again to obtain a plurality of word segmentations;
respectively calculating the statistical word frequency and the reverse file frequency corresponding to each participle obtained after the participle processing in all the topic data obtained at the current moment;
and weighting the statistical word frequency and the reverse file frequency of each participle to output a word frequency characteristic value of the participle.
4. The method for detecting a sudden topic according to claim 1, wherein the outputting a plurality of segmented words included in the segmented word result with the highest matching degree as the keyword corresponding to the topic data includes:
calculating the average number of word segmentation characters of each word segmentation result according to the total number of characters corresponding to each word segmentation in each word segmentation result and the total number of words segmentation corresponding to each word segmentation result;
weighting the word segmentation character average number and the word segmentation total number corresponding to each word segmentation result to output the matching degree of each word segmentation result;
and outputting a plurality of word segmentation included in the word segmentation result with the highest matching degree as the keyword corresponding to the topic data.
5. A computer-readable storage medium storing a sudden-topic detection program, wherein the sudden-topic detection program, when executed by at least one processor, implements the steps of the sudden-topic detection method as recited in any one of claims 1-4.
6. A sudden topic detection device, characterized in that the sudden topic detection device comprises a memory, a processor and a sudden topic detection program stored on the memory and operable on the processor, the processor implementing the following steps when executing the sudden topic detection program:
continuously acquiring topic data in the information sharing platform;
when each topic data is obtained, matching the topic data with each word in a preset word bank so as to output a plurality of word segmentation results;
outputting a plurality of word segmentation included in the word segmentation result with the highest matching degree as the keyword corresponding to the topic data;
updating summary information associated with the topic data according to the key words;
displaying the key words and the abstract information so that a user can know the burst topic at the current moment;
the step of updating the summary information associated with the topic data according to the keyword specifically includes:
respectively acquiring the accumulated word frequency of each keyword, and calculating the increase acceleration of the accumulated word frequency, wherein the accumulated word frequency of the keyword represents the accumulated times of occurrence of the keyword in all the topic data acquired at the current moment;
adding the growth acceleration corresponding to each keyword into a pre-generated matrix;
calculating a characteristic value of the matrix at the current moment, and determining an increasing acceleration which is greater than a second threshold value from the matrix when the characteristic value is greater than a first threshold value; the first threshold is a threshold set for an eigenvalue of the matrix;
screening topic data containing the participle from all the obtained topic data according to the determined participle corresponding to each growth acceleration;
performing word segmentation processing on the topic data containing the word segmentation again, and calculating the word frequency characteristic value of each word segmentation obtained after the word segmentation processing;
and outputting the participles with the word frequency characteristic value larger than a third threshold value as high-frequency words, and performing connection processing on the high-frequency words through a budget algorithm to obtain the abstract information containing the high-frequency words.
7. The device for detecting the unexpected topic according to claim 6, wherein the step of calculating the eigenvalue of the matrix at the current time, and when the eigenvalue is greater than a first threshold, determining the growth acceleration greater than a second threshold from the matrix specifically includes:
dividing each increasing acceleration in the matrix at the current moment into N groups, and mapping the increasing acceleration of each group into a sub-matrix;
calculating the characteristic value of each sub-matrix, and screening out the growth acceleration which is greater than a second threshold value from the sub-matrices when the characteristic value of the sub-matrix is greater than a fourth threshold value;
wherein N is an integer greater than 1; the fourth threshold is a threshold set for an eigenvalue of the submatrix.
8. The apparatus for detecting a sudden topic according to claim 6, wherein the step of performing segmentation processing again on the topic data including the segmentation word and calculating a word frequency feature value of each segmentation word obtained after the segmentation processing specifically includes:
performing word segmentation processing on the topic data containing the word segmentation again to obtain a plurality of word segmentations;
respectively calculating the statistical word frequency and the reverse file frequency corresponding to each participle obtained after the participle processing in all the topic data obtained at the current moment;
and weighting the statistical word frequency and the reverse file frequency of each participle to output a word frequency characteristic value of the participle.
CN201710433359.1A 2017-06-09 2017-06-09 Burst topic detection method and burst topic detection equipment Active CN107688596B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201710433359.1A CN107688596B (en) 2017-06-09 2017-06-09 Burst topic detection method and burst topic detection equipment
PCT/CN2018/074870 WO2018223718A1 (en) 2017-06-09 2018-01-31 Trending topic detection method, apparatus and device, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710433359.1A CN107688596B (en) 2017-06-09 2017-06-09 Burst topic detection method and burst topic detection equipment

Publications (2)

Publication Number Publication Date
CN107688596A CN107688596A (en) 2018-02-13
CN107688596B true CN107688596B (en) 2020-02-21

Family

ID=61152644

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710433359.1A Active CN107688596B (en) 2017-06-09 2017-06-09 Burst topic detection method and burst topic detection equipment

Country Status (2)

Country Link
CN (1) CN107688596B (en)
WO (1) WO2018223718A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111897958B (en) * 2020-07-16 2024-03-12 邓桦 Ancient poetry classification method based on natural language processing
CN113204638B (en) * 2021-04-23 2024-02-23 上海明略人工智能(集团)有限公司 Recommendation method, system, computer and storage medium based on working session unit

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102289487A (en) * 2011-08-09 2011-12-21 浙江大学 Network burst hotspot event detection method based on topic model
CN102346766A (en) * 2011-09-20 2012-02-08 北京邮电大学 Method and device for detecting network hot topics found based on maximal clique
CN102971762A (en) * 2010-07-01 2013-03-13 费斯布克公司 Facilitating interaction among users of a social network
CN104615593A (en) * 2013-11-01 2015-05-13 北大方正集团有限公司 Method and device for automatic detection of microblog hot topics

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102646114A (en) * 2012-02-17 2012-08-22 清华大学 News topic timeline abstract generating method based on breakthrough point
CN105022827B (en) * 2015-07-23 2016-06-15 合肥工业大学 A kind of Web news dynamic aggregation method of domain-oriented theme

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102971762A (en) * 2010-07-01 2013-03-13 费斯布克公司 Facilitating interaction among users of a social network
CN102289487A (en) * 2011-08-09 2011-12-21 浙江大学 Network burst hotspot event detection method based on topic model
CN102346766A (en) * 2011-09-20 2012-02-08 北京邮电大学 Method and device for detecting network hot topics found based on maximal clique
CN104615593A (en) * 2013-11-01 2015-05-13 北大方正集团有限公司 Method and device for automatic detection of microblog hot topics

Also Published As

Publication number Publication date
CN107688596A (en) 2018-02-13
WO2018223718A1 (en) 2018-12-13

Similar Documents

Publication Publication Date Title
CN110020122B (en) Video recommendation method, system and computer readable storage medium
CN113254549B (en) Character relation mining model training method, character relation mining method and device
CN110096614B (en) Information recommendation method and device and electronic equipment
CN106874253A (en) Recognize the method and device of sensitive information
WO2012039755A2 (en) Matching text sets
CN112181386B (en) Code construction method, device and terminal based on software continuous integration
CN112749300A (en) Method, apparatus, device, storage medium and program product for video classification
CN111756832B (en) Method and device for pushing information, electronic equipment and computer readable storage medium
CN111932308A (en) Data recommendation method, device and equipment
US10885121B2 (en) Fast filtering for similarity searches on indexed data
CN107688596B (en) Burst topic detection method and burst topic detection equipment
CN111563198B (en) Material recall method, device, equipment and storage medium
CN111104572A (en) Feature selection method and device for model training and electronic equipment
CN111460791A (en) Text classification method, device, equipment and storage medium
US7895206B2 (en) Search query categrization into verticals
CN110619349A (en) Plant image classification method and device
CN111179007A (en) Display information processing method and device and electronic equipment
CN109033224A (en) A kind of Risk Text recognition methods and device
CN109376287B (en) House property map construction method, device, computer equipment and storage medium
CN116503608A (en) Data distillation method based on artificial intelligence and related equipment
CN107656927A (en) A kind of feature selection approach and equipment
CN110738048A (en) keyword extraction method and device and terminal equipment
CN105700704A (en) Method and device for inputting characters to mini-size screen
CN110929512A (en) Data enhancement method and device
CN104778202A (en) Analysis method and system based on event evolution process of key words

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant