CN110245355A - Text topic detecting method, device, server and storage medium - Google Patents

Text topic detecting method, device, server and storage medium Download PDF

Info

Publication number
CN110245355A
CN110245355A CN201910549752.6A CN201910549752A CN110245355A CN 110245355 A CN110245355 A CN 110245355A CN 201910549752 A CN201910549752 A CN 201910549752A CN 110245355 A CN110245355 A CN 110245355A
Authority
CN
China
Prior art keywords
topic
word
center
text
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910549752.6A
Other languages
Chinese (zh)
Other versions
CN110245355B (en
Inventor
张国校
李铮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Tencent Domain Computer Network Co Ltd
Original Assignee
Shenzhen Tencent Domain Computer Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Tencent Domain Computer Network Co Ltd filed Critical Shenzhen Tencent Domain Computer Network Co Ltd
Priority to CN201910549752.6A priority Critical patent/CN110245355B/en
Publication of CN110245355A publication Critical patent/CN110245355A/en
Application granted granted Critical
Publication of CN110245355B publication Critical patent/CN110245355B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application discloses a kind of text topic detecting method, device, server and storage mediums, in the method, determine the center topic word of network text data;Calculate topic dispersion of the center topic word in all discussion group's text datas;Wherein, discussion group's text data is divided to obtain by the network text data according to topic discussion group;Text topic is determined according to target's center's topic word;Wherein, target's center's topic word is the center topic word that the topic dispersion is greater than the first preset value, so as to filter out unrelated topic item, improves the Detection accuracy of text topic.

Description

Text topic detecting method, device, server and storage medium
Technical field
This application involves internet information administrative skill field more particularly to text topic detecting method, device, servers And storage medium.
Background technique
The information sharing platform such as forum, microblogging, discussion bar have the characteristics that it is open high, user can based on personal interest and Habit releases news on information sharing platform, therefore can all generate user publication of the number in terms of necessarily daily on the internet Text information.Discussion when major event occur on society or network, on all information sharing platforms about specific topics Increase suddenly, which, which corresponds to topic, has sudden strong and high topic dispersion feature.
By the monitoring to user's discussion topic, related personnel can be helped to find the topic of negative news type in time, And take relevant remedial measure.For example, when certain game occurs paying bug suddenly, about payment in the discussion bar of the game The discussion of bug obviously increases, and operation personnel can be found the problem in time by the monitoring to burst topic, notifies maintenance personnel Repair payment bug.
In the related technology, generalling use production probabilistic model is that each comment data generates related term pair, according to mesh It marks word in the period and the word of preset value is greater than to generation burst topic to number increment.But when there are a few users in spy When determining to deliver the content of text of a large amount of duplicate unrelated topic items on information sharing platform, above-mentioned the relevant technologies can not filter out unrelated Topic item causes the Detection accuracy of text topic lower.
Therefore, unrelated topic item how is filtered out, the Detection accuracy for improving text topic is that those skilled in the art are current The technical issues that need to address.
Summary of the invention
In view of this, this application provides text topic detecting method, text topic detection device, server and storages to be situated between Matter can filter out unrelated topic item, improve the Detection accuracy of text topic.
To achieve the above object, the application first aspect provides a kind of text topic detecting method, comprising:
Determine the center topic word of network text data;
Calculate topic dispersion of the center topic word in all discussion group's text datas;Wherein, the discussion group Text data is divided to obtain by the network text data according to topic discussion group;
Text topic is determined according to target's center's topic word;Wherein, target's center's topic word is that the topic is discrete Degree is greater than the center topic word of the first preset value.
It, will be in the topic cluster in conjunction with the application in a first aspect, in the first embodiment of the application first aspect The topic word candidate item for meeting preset condition is set as the center topic word and includes:
Each topic word candidate item is calculated according to the word frequency information of the topic word candidate item and fluctuation index Happen suddenly grade;
The center topic word is set by the burst highest topic word candidate item of grade described in the topic cluster.
In conjunction with the first embodiment of the application first aspect, in second of embodiment of the application first aspect In, determine that text topic includes: according to target's center's topic word
Topic cluster where all topic dispersions to be greater than to the center topic word of first preset value is set as Target topic cluster, and the text topic is determined according to all target topic clusters.
In conjunction with second of embodiment of the application first aspect, in the third embodiment of the application first aspect In, determine that text topic includes: according to all target topic clusters
Calculate the center topic word of each target topic cluster and the first co-occurrence rate of each non-central topic word;
Mesh is set by the non-central topic word that the center topic word and the first co-occurrence rate are greater than the second preset value Mark topic word;
The text topic is determined according to all target topic words.
In conjunction with the third embodiment of the application first aspect, in the 4th kind of embodiment of the application first aspect In, the center topic word and the first co-occurrence rate of each non-central topic word for calculating each target topic cluster include:
Determine text subdata corresponding with the target topic cluster in the network text data;
The first co-occurrence rate is set by the quantity ratios that co-occurrence sentence accounts for all sentences in the text subdata;Its In, the co-occurrence sentence is the sentence in the text subdata including each non-central topic word and the center topic word Son.
In conjunction with the third embodiment of the application first aspect, in the 5th kind of embodiment of the application first aspect In, before determining the text topic according to all target topic words, further includes:
The first co-occurrence rate is less than or equal to second preset value and burst grade is greater than the third preset value Non-central topic word be set as new center topic word;
It calculates the new center topic word of each target topic cluster and the second of each non-new center topic word is total to Now rate, and the non-new center topic word that the new center topic word and the second co-occurrence rate are greater than second preset value is set It is set to the target topic word.
In conjunction with second of embodiment of the application first aspect, the third embodiment of the application first aspect, sheet Apply for the 4th kind of embodiment of first aspect and the 5th kind of embodiment of the application first aspect, in the application first aspect The 6th kind of embodiment in, set the center topic for the topic word candidate item for meeting preset condition in the topic cluster Word includes:
Each topic word candidate item is calculated according to the word frequency information of the topic word candidate item and fluctuation index Happen suddenly grade;
The center topic word is set by the burst highest topic word candidate item of grade described in the topic cluster.
To achieve the above object, the application second aspect provides a kind of text topic detection device, comprising:
Center topic word determining module, for determining the center topic word of network text data;
Dispersion computing module, it is discrete for calculating topic of the center topic word in all discussion group's text datas Degree;Wherein, discussion group's text data is divided to obtain by the network text data according to topic discussion group;
Topic determining module, for determining text topic according to target's center's topic word;Wherein, target's center's topic Word is the center topic word that the topic dispersion is greater than the first preset value.
To achieve the above object, the application third aspect provides a kind of server, comprising:
Processor and memory;
Wherein, the processor is for executing the program stored in the memory;
For storing program, described program is at least used for the memory:
Determine the center topic word of network text data;
Calculate topic dispersion of the center topic word in all discussion group's text datas;Wherein, the discussion group Text data is divided to obtain by the network text data according to topic discussion group;
Text topic is determined according to target's center's topic word;Wherein, target's center's topic word is that the topic is discrete Degree is greater than the center topic word of the first preset value.
To achieve the above object, the application fourth aspect provides a kind of storage medium, is stored in the storage medium Computer executable instructions when the computer executable instructions are loaded and executed by processor, are realized described in any one as above The step of text topic detecting method.
As it can be seen that the application after determining the center topic word of network text data, determine the topic of center topic word from Divergence.Topic dispersion can indicate distribution shape of the center topic word in the corresponding network text data of each topic discussion group Condition screens center topic word based on topic dispersion, and the center topic of the first preset value is higher than by topic dispersion Word determines text topic, it is possible to reduce individual user causes in the certain text data publication a large amount of repeated publication specific contents in community Text topic erroneous judgement the case where.Therefore, the application can filter out unrelated topic item, improve the Detection accuracy of text topic.
Detailed description of the invention
In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is only embodiments herein, for ability For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to the attached drawing of offer other Attached drawing.
Fig. 1 shows a kind of a kind of structure composed schematic diagram of text topic detection system of the embodiment of the present application;
Fig. 2 shows a kind of flow diagrams of text topic detecting method of the embodiment of the present application;
Fig. 3 shows discussion group's text data provided by the embodiments of the present application and divides schematic diagram;
Fig. 4 shows center topic word provided by the embodiments of the present application and determines schematic diagram;
Fig. 5 shows the flow diagram of another text topic detecting method of the embodiment of the present application;
Fig. 6 shows the flow diagram of another text topic detecting method of the embodiment of the present application;
Fig. 7 shows a kind of cluster schematic diagram of community Louvain partitioning algorithm of the embodiment of the present application;
Fig. 8 shows a kind of composed structure schematic diagram of text topic detection device of the embodiment of the present application;
Fig. 9 shows a kind of a kind of composed structure schematic diagram of server of the embodiment of the present application.
Specific embodiment
The scheme of the application can filter out during detecting the network text datas topics such as microblog topic, discussion bar topic Unrelated topic item reduces the Detection accuracy for improving text topic, accurately grasps the whole network user content of the discussions.
Wherein, in the embodiment of the present application, network text data is user or official in information such as microblogging, discussion bar, forums The information issued on sharing platform, network text data both included user or official itself publication content and also including user or official Comment content and forwarding content of the side to other users.
In the present embodiment, the center topic word of network text data is mainly to be discussed or refer in network text data Word, for center topic word either the word frequently occurred in network text data, being also possible to can be to network text number The word that the word frequently occurred in is summarized, such as " potato ", " sweet potato ", " taro " and " radish " in network text data The frequency of occurrences is all very high, can be by topic word centered on " rhizome vegetable ".
In the present embodiment, discussion group's text data is divided to obtain by all-network text data according to topic discussion group. Topic discussion group is the group that one or more users deliver text data jointly, such as main patch and its money order receipt to be signed and returned to the sender can under same model To be considered as a topic discussion group, all groups to make comments on microblogging about a certain topic can be considered as a topic and beg for By group, all comments under certain short-sighted frequency can be considered as a topic discussion group.It certainly, can be in a topic discussion group Including multiple topic discussion groups, the present embodiment does not limit the measurement unit of topic discussion group, a certain discussion bar can integrally be regarded For a topic discussion group, main patch specific in certain discussion bar and its money order receipt to be signed and returned to the sender can also be considered as a topic discussion group, this implementation Example is without specifically limiting.For example, when with a master reply to the topic structure be considered as a topic discussion group when, can be according to master The multi-group data that mark of replying to the topic divides all-network text data.Such as network text data include a1, a2, A3, b1, b2, c1, c2 and c3, wherein a1, b1 and c1 are two different main patches, and a2 and a3 are the money order receipts to be signed and returned to the sender to a1, and b2 is to b1 Money order receipt to be signed and returned to the sender, c2 and c3 are the money order receipts to be signed and returned to the sender to c1, and a1, a2, a3, b1, b2, c1, c2 and c3 divide according to main structure of replying to the topic can Replied to the topic structure with obtaining three groups of masters, first group of master structure of replying to the topic includes a1, a2 and a3, second group of master reply to the topic structure include b1 and B2, third group master structure of replying to the topic includes c1, c2 and c3.
In the present embodiment, topic dispersion refers to dispersion degree of the center topic word in all discussion group's text datas, It can be used for describing the distribution situation of center topic word in a network.Using above-mentioned network text data include a1, a2, a3, b1, The example of b2, c1, c2 and c3 continue to explain topic dispersion, such as it needs to be determined that the topic dispersion of center topic word A, Ke Yitong The reply to the topic word frequency of center topic word A in structure a1, a2 and a3, second group of master of first group of master is counted to reply to the topic center words in structure b1 and b2 The word frequency and third group master for writing inscription A are replied to the topic the word frequency of center topic word A in structure c1, c2 and c3.If center topic word A exists Three groups of masters reply to the topic the word frequency in structure difference in a certain range when, illustrate that the dispersion of center topic word A is higher, it is on the contrary then Illustrate that the dispersion of center topic word A is lower.Each topic discussion group can be considered as to a room, when center topic word When topic dispersion is higher, illustrate that the people in most of rooms is discussing the topic about the center topic word;When center is talked about When the topic dispersion of epigraph is lower, illustrate that the people in only small part room is discussing the topic about the center topic word, The topic of the center topic word cannot represent the topic of most people discussion at this time.In the present embodiment, text topic is network text The content mainly discussed in notebook data, text topic may include one or more words.
The text topic detecting method of the application in order to facilitate understanding, below for the text topic detecting method of the application The system being applicable in is introduced.Referring to Fig. 1, it illustrates a kind of one kind of text topic detection system of the embodiment of the present application Structure composed schematic diagram.
As shown in Figure 1, text topic detection system provided by the embodiments of the present application includes: data acquisition device 10, service Device 20 and client 30 realize communication connection by network 40 between data acquisition device 10, server 20 and client 30.
Wherein, data acquisition device 10 can be the server of the information sharing platforms such as discussion bar, microblogging, examine in text topic The quantity of data acquisition device 10 can be multiple in examining system, to realize the topic detection to multiple information sharing platforms.Clothes Business device 20 can be the server apparatus such as tablet computer or personal computer, for analyzing network text data.Client End 30 can be the server apparatus such as mobile phone, tablet computer or personal computer, for receiving the text of simultaneously display server 20 Topic detection result.In other text topic detection system structure composeds, above-mentioned server 20 and client 30 can be Same device.
In the embodiment of the present application, data acquisition device 10 can acquire the network text on simultaneously summary information sharing platform Data, network text data may include the information of the main patch that user issues or money order receipt to be signed and returned to the sender and the publication of information sharing platform official. Since the generation of topic needs the regular hour, server 20 can choose the progress of the network text data in a period of time The operation of text topic detection, obtains this time corresponding text topic.It, will by network after server 20 obtains text topic Text topic is uploaded to client 30, so as to understand in this time user universal on information sharing platform by relevant staff The topic content of discussion.
It describes in detail below to the text topic detection process of server.
Referring to fig. 2, it illustrates a kind of flow diagram of text topic detecting method of the embodiment of the present application, this implementations Example method may include:
S101, server determine the center topic word of network text data;
Wherein, network text data may include the user on any amount information sharing platform and/or official's publication Information be previously mentioned in the present embodiment since the generation of topic needs the regular hour and topic usually has timeliness Network text data can be the information that generates in special time period, the issuing time of network text data and current time It is poor that time difference is less than preset time.
It, can be using day as the chronomere of topic detection, to what is generated in every day as a kind of feasible embodiment Network text data is detected.It, can also be by the institute in front of current time 1 hour as another feasible embodiment Some network text datas are as topic detection object, to understand the current topic content of each information sharing platform in time.
The source of network text data can include but is not limited to microblogging, discussion bar, news website and view in the present embodiment The information sharing platform such as frequency website, simply by the presence of user for the comment data of certain event.It is understood that due to true The purpose for determining text topic is that related personnel is helped to understand people for the feedback opinion of a certain certain observation object, such as is somebody's turn to do Certain observation object can be certain application software, online game, design drawing, typhoon weather or spring transportation ticket etc..In order to mention The high effect for obtaining field feedback, can set the source of network text data to the relevant information of the object and share Platform.For example, when it should be understood that online game A field feedback when, can be obtained from the discussion bar of online game A or forum Network text data is taken, and then determines the hot topic being currently discussed.When the hot topic of related personnel's discovery online game A is When the price bug of game store, game store price bug can be reported, in time to repair bug.
It should be noted that core of the center topic word mentioned in the present embodiment for topic of being discussed in network text data Intracardiac appearance, there may be various ways to determine the center topic word in network text data, such as: it can be by network text data The frequency of occurrences is ranked up from high to low in each word, by topic word centered on M words before the frequency of occurrences;It can also be by net The word that the frequency of occurrences is greater than predeterminated frequency in network text data is arranged as center topic word.As a kind of feasible embodiment, Semantic analysis can also be carried out according to the high word of the frequency of occurrences in network text data to obtain that each cluster result can be summarized The center topic word of middle word.It, can will such as when the frequency of occurrences of apple, orange and watermelon in network text data is all very high Apple, orange and watermelon are classified as fruit, therefore the center topic word of network text data is fruit.Further, exist The fluctuation situation of each word in history can also be referred to during Selection Center topic word, will occur frequency in target time section The change rate of rate is greater than topic word centered on the word of default change rate.Certainly, there may also be others to choose for the present embodiment The mode of center topic word, as long as can choose out the word that can represent the core content for topic of being discussed in network text data , the present embodiment is without specifically limiting.
S102, server calculate topic dispersion of the center topic word in all discussion group's text datas;
Wherein, the text topic of network text data exists usually in the form of comment, corresponding in a topic discussion group Network text data can for discuss specific topic text data set.The present embodiment can be by all network texts Notebook data is divided to obtain multiple discussion group's text datas according to topic discussion group, to determine that center topic is being discussed Topic dispersion in group text data.
Discussion group's text data in the present embodiment is divided to obtain by the network text data according to topic discussion group, is asked Referring to Fig. 3, it illustrates discussion group's text datas provided by the embodiments of the present application to divide schematic diagram, is exemplified below in Fig. 3 Network text data is divided to obtain the process of discussion group's text data according to main structure of replying to the topic.Such as: there are 10 in forum A The network text data that user delivers, and each network text data is numbered, in all-network text data It is pasted based on the network text data that number is 1, number is 5 and number is 8, remaining network text data is money order receipt to be signed and returned to the sender.According to master Mark of replying to the topic it is found that the corresponding money order receipt to be signed and returned to the sender of main patch that number is 1 is the network text data that number is 2, number is 3 and number is 4, The corresponding money order receipt to be signed and returned to the sender of main patch that number is 5 is the network text data that number is 6 and number is 7, main patch corresponding time that number is 8 Note is the network text data that number is 9 and number is 10.Number 1 to number 4 network text data as one group of discussion group Text data, the network text data of number 5 to number 7 is as one group of discussion group's text data, the network of number 8 to number 10 Text data is as one group of discussion group's text data, therefore all network text datas are divided according to topic discussion group can be with Three groups of discussion group's text datas are obtained, same discussion group's text data belongs to same topic discussion group, i.e., main structure of replying to the topic.
It is understood that the topic dispersion mentioned in the present embodiment is description center topic in all discussion group's texts The value of distribution situation in data.When the topic dispersion of center topic word X is high, illustrate that center topic word is begged in multiple topics Frequently lifted by group, rather than only the frequency of occurrences is higher in a specific topic discussion group.It illustrates above-mentioned true Determine the process of topic dispersion: if there are multiple topic discussion groups in certain forum, can be begged for according to center topic word each The word frequency information of center topic word is determined by total word amount of frequency of occurrence and each topic discussion group in group text data, in the middle When the deviation of word frequency information of the heart topic word in all topic discussion groups is respectively less than predetermined deviation amount, illustrate center topic word Topic dispersion is higher;When the deviation of word frequency information of the center topic word in all topic discussion groups is all larger than or is equal to default When departure, illustrate that the topic dispersion of center topic word is lower.
S103, server determine text topic according to target's center's topic word;
Wherein, the corresponding network text data of same topic discussion group can be the discussion set to specific topics, therefore It can determine that the center topic word is begged for by calculating topic dispersion of the center topic word in all discussion group's text datas The prevalence of opinion.If the dispersion of a certain center topic word is very low, illustrating most of topic discussion groups, there is no in this Heart topic word is as the core content discussed.
Target's center's topic word is the center topic word that the topic dispersion is greater than the first preset value in this step.It utilizes The center topic word that topic dispersion is greater than the first preset value determines the relevant operation of text topic, is equivalent to discrete according to topic Degree filters center topic word, rejects the center topic word that topic dispersion is lower than the first preset value.It needs to illustrate It is that topic dispersion can be considered as not lower than the center topic word of the first preset value by most of topic discussion group discussion Hold, i.e., unrelated topic item.
In information sharing platform, it is understood that there may be a few users are delivered same or like repeatedly in a topic discussion group Text data the case where, such as discussion bar pours water, propagandizes topic or advertisement marketing etc., and above situation is likely to result in certain word quilts It is selected as center topic word, but the content due to mainly being discussed in the center topic word and not all topic discussion group, it should Center topic word will will cause the erroneous judgement of text topic.Fig. 4 is referred to, it illustrates center topics provided by the embodiments of the present application Word determines schematic diagram, illustrates the content of Fig. 4 description, such as there are 5 topic discussion groups A, B, C, D in certain health forum And E, three centric keywords: " winter are determined according to the frequency that word in the corresponding network text data of 5 topic discussion groups occurs Season ", " warming " and " so-and-so depressor ".Word frequency of the centric keyword " winter " in 5 topic discussion groups A, B, C, D and E point Not Wei 21%, 22%, 19%, 18% and 5%, word of the centric keyword " warming " in 5 topic discussion groups A, B, C, D and E Frequency is respectively 20%, 23%, 21%, 24% and 6%, and centric keyword " so-and-so depressor " is in 5 topic discussion groups A, B, C, D It is respectively 1%, 0%, 2%, 4% and 56% with the word frequency in E, it can thus be appreciated that keyword " winter " and the topic of " warming " are discrete Degree is substantially better than so-and-so depressor, and the text topic in the health forum is related to " winter " and " warming ", and " so-and-so is depressured Medicine " is not the content mainly discussed in the health forum, the corresponding network text content of topic discussion group E may for businessman in order to The information promoted the sale of products and issued.
Center topic word is screened using topic dispersion in the present embodiment, is greater than first according to topic dispersion The center topic word of preset value determines text topic.As a kind of feasible embodiment, all center topics can be integrated The classification of word and center topic word determines text topic.Certainly, topic dispersion is greater than the center topic word of the first preset value Quantity be limited, according only to center topic word determine text topic possibly can not expressed intact network text data discussed Content, therefore can determine topic dispersion be greater than the first preset value center topic word on the basis of, in conjunction with network text Word relevant to center topic word is comprehensive in notebook data determines text topic.Such as the topic in all discussion group's text datas The center topic word that dispersion is greater than the first preset value is " shut down and update ", and all in network text data includes shutting down to update Sentence in there is 60% sentence all to include another word " compensation ", it can thus be appreciated that the text topic of current network text data Are as follows: shut down update and its compensation way.It certainly can also include other modes for combining center topic word to determine text topic, Herein without specifically limiting, unrelated words are realized by selecting the topic dispersion of center topic word to be greater than the first preset value Inscribe the filtering of item.
S104, the text topic of client display server detection.
The present embodiment determines that the topic of center topic word is discrete after determining the center topic word of network text data Degree.Topic dispersion can indicate distribution shape of the center topic word in the corresponding network text data of each topic discussion group Condition illustrates center topic word quilt in the corresponding most of topic discussion groups of network text data if topic dispersion is high Frequently discuss;If topic dispersion is low, illustrate the center topic word in the corresponding all topic discussion groups of network text data It is middle only to be discussed by less discussion or frequently in individual topic discussion groups.Center topic word is words of being discussed in network text data The centre word of topic, the application are based on topic dispersion and screen to center topic word, and it is pre- to be higher than first by topic dispersion If the center topic word of value determines text topic, it is possible to reduce individual user largely repeats to send out in certain text data publication community The case where text topic caused by table specific content is judged by accident.Therefore, the present embodiment can filter out unrelated topic item, improve text words The Detection accuracy of topic.
Referring to Fig. 5, it illustrates the flow diagram of another text topic detecting method of the embodiment of the present application, this realities The method for applying example may include:
S201 determines the topic word candidate item in network text data, and executes to topic word candidate item and be based on main money order receipt to be signed and returned to the sender The cluster operation of structure obtains topic cluster;
Word division is carried out to network text data first in the present embodiment, according to the frequency of occurrences and morphology of each word Information initial option topic word candidate item, and the cluster operation based on topic discussion group is executed to all topic word candidate items and is obtained To multiple topic clusters.All topic word candidate items in same topic cluster correspond to same topic discussion group, referring herein to topic The corresponding network text data of cluster is equivalent to discussion group's text data that Fig. 2 corresponding embodiment is previously mentioned.
In the present embodiment, topic word candidate item may include the keyword of unitary, also may include it is polynary it is orderly frequently , it can also both include the keyword and polynary orderly frequent episode of unitary.A keyword i.e. word, orderly frequent episode are more The sequential combination of a word.Specifically, orderly frequent episode can be made of words multiple in sentence, it is wherein successive between these words Sequence remains unchanged, and does not limit the word quantity for including in one group of orderly frequent episode herein.Such as " system bug ", " system In bug " and " system has bug ", " system " and " bug " constitutes one group of orderly frequent episode.
It may include following operation in the present embodiment S201: by network text data as a kind of feasible embodiment In keyword be set as topic word candidate item;Wherein, the frequency of occurrences of keyword is greater than the 4th preset value;And/or by network Orderly frequent episode in text data is set as topic word candidate item;Wherein, orderly frequent episode is sequencing in target clause Fixed multiple words, frequency of occurrence of the target clause in network text data are greater than the 5th preset value.
In above-mentioned feasible embodiment, topic is set as by the keyword that the frequency of occurrences is greater than the 4th preset value Orderly frequent episode is set topic candidate word by candidate word.The present embodiment will include the sentence format of orderly frequent episode as mesh Clause is marked, the target clause frequency of occurrence where orderly frequent episode is greater than the 5th preset value.As a kind of feasible embodiment, The operation of importance marking and filtering can also be freely carried out using boundary after tentatively selecting orderly frequent episode.
The topic word candidate item for meeting preset condition in topic cluster is arranged as center topic word S202.
Wherein, center topic word is all arranged in each topic cluster in the present embodiment, due to the corresponding words of a topic cluster Discussion group is inscribed, therefore this step is equivalent to and selects center topic word for the corresponding network text data of each topic discussion group.It is logical Crossing aforesaid way can be avoided that the text data for including due to certain topic discussion groups is less and ignores the topic discussion group discussion Topic the case where.The present embodiment does not limit the quantity of the corresponding center topic word of each topic cluster, as a kind of feasible reality The mode of applying can select a center topic word for each topic cluster.After center topic word is arranged, each topic cluster can be with Including two class topic word candidate items, one kind is the corresponding topic word candidate item of center topic word, and another kind of is non-central topic word Corresponding topic candidate item.
As a kind of feasible embodiment, the process of S202 setting topic word candidate item be may comprise steps of:
Step 1, the prominent of each topic word candidate item is calculated according to the word frequency information of topic word candidate item and fluctuation index Send out grade;
Step 2, the highest topic word candidate item of grade that happens suddenly in topic cluster is arranged as center topic word.
Above-mentioned feasible embodiment by the reference conditions of word frequency information and fluctuation index alternatively center topic word, Word frequency information refers to that the frequency of occurrences of the topic word candidate item in network text data, fluctuation index are description topic word candidate item The situation of change of word frequency information in history.Corresponding topic word candidate item can be calculated in conjunction with word frequency information and fluctuation index Burst grade illustrates the topic word candidate item less quilt in history when the burst of some topic word candidate item is higher ranked It discusses and is frequently discussed suddenly within the network text data corresponding period.It is understood that being weighed when in society A certain word can be frequently raised in a short time when major issue, therefore above-mentioned feasible embodiment can choose the frequency of occurrences It is higher and have paroxysmal topic word candidate item centered on topic word.The center topic word obtained through the above way determines Text topic can be realized the burst topic detection for network text data.Whithin a period of time compared with the period before, The relevant discussion amount of some text topic increases considerably, then text topic is referred to as burst topic.
Further, after the burst grade for calculating each topic word candidate item, there may also be reject in topic cluster Operation of the grade that happens suddenly lower than the topic word candidate item of predetermined level.It can be filtered out by above-mentioned topic word candidate item filter operation Call drop inscribes the junior topic word candidate item of burst in cluster, so as to during S204 determines text topic according to topic cluster Conducive to happening suddenly, higher ranked topic word candidate item determines text topic, and the text topic finally determined is enable to represent network text Happen suddenly the content being discussed in notebook data.
S203 calculates topic dispersion of each center topic word in all discussion group's text datas.
Wherein, each center topic word is the content discussed to be concentrated in each topic discussion group, and be somebody's turn to do in the present embodiment The burst grade of center topic word highest in its corresponding topic cluster.By calculating certain center topic word in all discussion group's texts Whether the topic dispersion in notebook data can determine topic that a certain topic discussion group is mainly discussed in other topic discussions It is equally concentrated and is discussed in group.Illustrate the above process, for example, in game forum there are 4 topic discussion group A, B, C and D determines that the center topic word of topic discussion group A is " shutdown ", in topic discussion group B based on the relevant operation of S201 and S202 Heart topic word is " update ", the center topic word of topic discussion group C is " patch ", the center topic word of topic discussion group D is " to hand over Easily ".Word frequency of the centric keyword " shutdown " in 4 topic discussion groups A, B, C and D is respectively 37%, 30%, 29% and 5%, Word frequency of the centric keyword " update " in 4 topic discussion groups A, B, C and D is respectively 32%, 39%, 31% and 6%, center Word frequency of the keyword " patch " in 4 topic discussion groups A, B, C and D is respectively 12%, 19%, 51% and 3%, and center is crucial Word frequency of the word " transaction " in 4 topic discussion groups A, B, C and D is respectively 1%, 0%, 2% and 76%.It can thus be appreciated that in all Heart topic word " shutdown " and the topic dispersion of " update " are higher than center topic word " patch ", the topic of center topic word " patch " Dispersion is higher than " transaction ".It shuts down and updates to be concentrated by most users in the health forum and discuss.Therefore illustrate topic discussion The text data discussed in group A and topic discussion group B is the content generated for certain time burst, can combine topic discussion group Content discussed in A and topic discussion group B determines the burst topic of network text data, i.e. text topic.
S204, the topic cluster where all topic dispersions to be greater than to the center topic word of the first preset value are set as target Topic cluster, and text topic is determined according to all target topic clusters.
Wherein, although center topic word is the burst grade highest in candidate topics word all in topic cluster, by It is usually made of multiple words in a topic, single center topic word can not accurate description text topic.Such as determine certain The center topic word of one topic discussion group is " bug " and center topic word " bug " dispersion in all topic discussion groups is greater than First preset value.If only " bug " can not be determined as text topic at this time discuss in network text data be there are bug, Bug has been repaired or the compensation of bug.Due to that can also include other candidate topics relevant to center topic word in topic cluster That is, there are other words that identical topic is discussed with center topic word in word, therefore the present embodiment passes through combining target topic cluster In multiple words jointly determine text topic.Whether the present embodiment is big according to the topic dispersion of the center topic word of topic cluster In the first preset value screening topic cluster obtain target topic cluster, the topic cluster institute core discussion being screened in the present embodiment it is interior Hold and is not discussed frequently by most of topic discussion groups, and the candidate topics word in target topic cluster is most of topic discussions The word that group frequently discusses, determines that text topic can be improved the accuracy in detection of text topic according to target topic word.Topic Dispersion can describe in the same period, distribution situation of the same topic in outer net.
As a kind of feasible embodiment, determine that the process of text topic can be with according to all target topic clusters in S204 Including following operation:
S2041, the topic cluster where all topic dispersions to be greater than to the center topic word of the first preset value are set as mesh Mark topic cluster;
S2042 calculates the center topic word of each target topic cluster and the first co-occurrence rate of each non-central topic word;
Wherein, herein by target topic cluster in addition to the topic word candidate item after the topic word of center is known as non-central topic Word, the first co-occurrence rate can describe center topic word and each non-central topic word appears in the probability in same sentence jointly. If the co-occurrence rate of certain non-central topic word and center topic word is higher, illustrate being associated with for the non-central topic word and center topic word It spends higher.
Above content is illustrated, such as the center topic word of target topic cluster is " fruit ", non-central topic word is " valence Lattice " and " quality " include that the sentence of " fruit " has 50 in the corresponding network text data of target topic cluster, have 40 in this 50 There is non-central topic word " price " in sentence, has 15 non-central topic word " quality " occur, it can thus be appreciated that target topic cluster Center topic word " fruit " and non-central topic word " price " the first co-occurrence rate be 80%, with non-central topic word " quality " The first co-occurrence rate be 30%, may infer that " price " and the correlation of " fruit " are higher, the content of the target topic cluster discussion It is related to fruit price.Above-mentioned first co-occurrence rate is to be based in the corresponding network text data of target topic cluster including that middle core is talked about The sentence of epigraph calculate and, the first co-occurrence rate, detailed process can also be calculated according to the corresponding all sentences of target topic cluster It is as follows: to determine text subdata corresponding with target topic cluster in network text data;Co-occurrence sentence is accounted in text subdata The quantity ratios of all sentences are set as the first co-occurrence rate;Wherein, it includes each non-central that co-occurrence sentence, which is in text subdata, The sentence of topic word and center topic word.It is, of course, also possible to there are other modes for calculating the first co-occurrence rate, here not into Row is specific to be limited, as long as the first co-occurrence of the correlation degree of description center topic word and each non-central topic word can be obtained Rate.
The non-central topic word that center topic word and the first co-occurrence rate are greater than the second preset value is set target by S2043 Topic word;
Wherein, this step, which is equivalent to, performs filter operation to the topic word candidate item in target topic cluster, and target is talked about It is filtered out in topic cluster with the lower non-central topic word of center topic word correlation, obtained target topic word includes center topic word It is greater than the non-central topic word of the second preset value with the first co-occurrence rate.
As a kind of feasible embodiment, if the first co-occurrence rate obtained in this step be greater than the second preset value it is non-in When the quantity of heart topic word is greater than preset quantity, it can choose Q before the first co-occurrence rate non-central topic words as target words Epigraph.
Aforesaid way is equivalent to using center flooding mechanism selection target topic word, and center flooding mechanism is i.e. with center topic Subject to word, other topic word candidate items are according to the co-occurrence rate sequence filtration between the topic word of center, if selecting given number Write inscription number.
It should be noted that center flooding mechanism mentioned above can guarantee the accuracy of burst topic discovery, but But it will lead to the reduction of recall rate.This is mainly due to the structures for the topic discussion group that the present embodiment is directed to, such as main knot of replying to the topic Structure, if the bigger relevant topic of time span may change.Result with certain game in some period is Example, detects that the topic word candidate item of some target topic cluster includes: " recycling ", " missing ", " movable bug ", " compensation ".Pass through The target topic cluster, can analyze out, and from time dimension, bug occurs in activity first, and then someone comments on " missing " and utilizes The chance of the bug, end user discuss " recycling " and " compensation " relevant topic.If only center flooding mechanism is taken, Only discovery " recycling " this topic.Therefore, the present embodiment can be further introduced into heterogeneous mechanism, that is, assume in target topic cluster In there is certain heterojunction structures, at center under the premise of flooding mechanism, further to unchecked topic word candidate item into The screening of a row new round.The targets topic words such as " movable bug " and " missing " can be identified by heterogeneous mechanism.
The process of the above-mentioned target topic word identification based on heterogeneous mechanism may comprise steps of:
Step 1, by the first co-occurrence rate be less than or equal to the second preset value and happen suddenly grade be greater than third preset value it is non-in Heart topic word is set as new center topic word;
Step 2, the new center topic word of each target topic cluster and the second co-occurrence of each non-new center topic word are calculated Rate, and target topic is set by the non-new center topic word that new center topic word and the second co-occurrence rate are greater than the second preset value Word.
The above-mentioned operation based on heterogeneous mechanism selection target topic word again, which is equivalent to, not to be selected to each as target words The topic word candidate item of epigraph re-executes the relevant operation of S202 to S204.The calculating side of second co-occurrence rate and the first co-occurrence rate Formula is essentially identical, herein without repeating.
It is understood that may exist multiple heterogeneous mechanism in same target topic cluster, i.e. in a topic discussion group Repeatedly variation occurs for the topic of discussion, therefore can execute the repeatedly target topic word identification operation based on heterogeneous mechanism, executes Number can depend on the quantity and time span of topic candidate word in target topic cluster, the quantity of topic candidate word hold more greatly Capable number is more, and the number of the bigger execution of time span is more.
S2044 determines text topic according to all target topic words.
Wherein, the target topic word determined in the present embodiment be in network text data dispersion is higher, word frequency is higher and With certain paroxysmal word, all target topic words can be integrated and determine text topic.As a kind of feasible implementation Mode can be according to relationship, the grammatical relation between target topic word when all target topic words are same category of word Text topic is obtained with ageing is generated.Such as target topic word include: " it was found that ", " bug " and " reparation ", according to target The available text topic of relationship and grammatical relation between topic word is that " it was found that after bug, bug has been repaired " can also be " bug repaired is found bug again ", if " it was found that " the generation time earlier than " reparation ", the life based on target topic word It can determine that text topic is " it was found that after bug, bug has been repaired " at the time.Certainly, if all target topic words include multiple classes Other or field word, can first classify to target topic word, then generate the corresponding text of every class target topic word Topic.It is above-mentioned that there are the word situation Producing reasons that target topic word includes multiple classifications or field to be the network text of selection Notebook data source is more dispersed.Therefore it is used as feasible embodiment, can choose relevant information sharing platform as network The source of text data.
After obtaining text topic, the target sentences text where the text topic can also be determined, according to text Similitude gives a mark to all target sentences texts, and uploads the target sentences text of score top N, so as to related work Corresponding counter-measure is taken as personnel.Due to discussing that the text data of certain topic has identical word mostly, text is similar The higher sentence of property can represent most of view in textual description topic discussion group, more representative.Herein according to Text similarity is specially the phase calculated between each target sentences text to the process that all target sentences texts are given a mark Like degree, such as target sentences text includes: ABBB, ABBD, ABCB and ACBB, and the similitude of ABBB and other sentence texts is most Height, ABBB are representative sentence.
As the further supplement for Fig. 5 corresponding embodiment, calculates the boundary freedom degree of orderly frequent episode and reject side Freedom degree lower orderly frequent operation in boundary's is as follows:
After selecting orderly frequent episode, the boundary of each orderly frequent episode is calculated certainly using boundary freedom calculation formula By spending, importance marking and filtering are carried out to orderly frequent episode based on boundary freedom degree.Boundary freedom degree herein can be by having The entropy of sequence frequent episode two sides word is calculated, specific boundary freedom calculation formula are as follows:
Wherein, for orderly frequent episode k, boundary freedom degree index is Rk, this has in each sentence of network text data Word sum on the left of sequence frequent episode is m, and the word sum on right side is n, PL,i|kIndicate that given orderly frequent episode k appears in left side the The frequency of occurrences of i lexical item, PR,j|kIndicate that given orderly frequent episode k appears in the frequency of occurrences of j-th of the lexical item in right side.It is counting It, can be orderly frequent less than the 6th preset value by boundary freedom degree after the boundary freedom degree for calculating each orderly frequent episode Item is rejected from the topic word candidate item.The above-mentioned operation screened based on boundary freedom degree can be selected important orderly Frequent episode.
Referring to Fig. 6, it illustrates the flow diagram of another text topic detecting method of the embodiment of the present application, this realities The method for applying example may include:
The keyword in network text data is set topic word candidate item by S301.
Wherein, in the present embodiment there may be network text data is carried out word division, and the frequency of occurrences is greater than the The word of four preset values is set to off the operation of keyword.
The orderly frequent episode in network text data is set topic word candidate item by S302.
Since the form of network text data is changeable, much significant expression may be and be made of multiple words, and these Word is not necessarily continuous appearance, for example, for " movable bug " this statement, it is understood that there may be expression-form have that " activity goes out Bug ", " movable bug ", " activity and out bug " etc..The present embodiment is using keyword and orderly frequent episode as topic word candidate Item improves the accuracy of topic detection.
S303 calculates the boundary freedom degree of each orderly frequent episode, and the having less than the 6th preset value by boundary freedom degree Sequence frequent episode is rejected from topic word candidate item.
S304 executes the cluster operation based on main money order receipt to be signed and returned to the sender structure to topic word candidate item and obtains topic cluster.
Wherein, topic word candidate item can be clustered based on community's partitioning algorithm in this step, so that after cluster The obtained corresponding master of topic word candidate item in same topic cluster reply to the topic identify it is identical.Fig. 7 is referred to, it illustrates the application A kind of cluster schematic diagram of community Louvain partitioning algorithm of embodiment.Louvain algorithm includes two stages, 1st in Fig. 7 Pass refers to the first stage, and 2nd pass refers to second stage.Constantly the node in traverses network, trial will be single in the first stage Node addition can be such that modularization is promoted in maximum community, until all nodes all no longer change.In second stage, processing First stage as a result, reconfiguring network for community's merger small one by one is supernode, the weight on side is at this time The sum of the side right weight of all ancestor nodes in two nodes.The step of the step of continuous iteration above-mentioned first stage and second stage Until the available community division result of algorithmic stability, that is, realize the cluster operation of topic word candidate item.
S305 calculates the burst of each topic word candidate item according to the word frequency information of topic word candidate item and fluctuation index The highest topic word candidate item of grade that happens suddenly in topic cluster is arranged as center topic word grade.
Wherein, after the generation of topic word candidate item, fluctuation index can be introduced to measure the prominent of each candidate item Hair property.It is emphasized that when carrying out marking sequence to burst topic word, not only in view of each topic word is candidate The fluctuation index of item, can also further contemplate the frequency of occurrences of each candidate item.
S306 calculates topic dispersion of each center topic word in all discussion group's text datas.
Wherein, after the cluster operation of S304, one or more topic word candidate can be corresponded under each topic cluster ?.For reflecting the index of center topic word distribution situation in all-network text data, i.e., the present embodiment constructs one Topic dispersion.Since the discussion of topic has generality, the unrelated letter of topic can be filtered using topic dispersion index Cease item.Due to different observation objects (such as certain application software, online game, design drawing, typhoon weather or spring transportation ticket) Topic discussion concentration is different, therefore as a kind of feasible embodiment, can pass through formula Y=l1+(log(gi,t)- log(N1)-0.1)2The minimum value of topic dispersion, i.e. the first preset value is calculated;Wherein, Y is the first preset value, l1It is one A given minimum value, gi,tFor for some topic dispersion of the observation object in i moment t, N1For a specified value.
S307, the topic cluster where all topic dispersions to be greater than to the center topic word of the first preset value are set as target Topic cluster.
S308 determines text subdata corresponding with target topic cluster in network text data.
The quantity ratios that co-occurrence sentence accounts for all sentences in text subdata are set the first co-occurrence rate by S309.
The non-central topic word that center topic word and the first co-occurrence rate are greater than the second preset value is set target words by S310 Epigraph.
Wherein, in order to further screen the topic word candidate item in target topic cluster, the present embodiment takes center diffusion machine System is subject to other topic words of center topic word according to the co-occurrence rate sequence filtration between the topic word of center, is selected to fixed number Purpose target topic word.
First co-occurrence rate is less than or equal to the second preset value and the grade that happens suddenly is greater than the non-central of third preset value by S311 Topic word is set as new center topic word.
S312 calculates the new center topic word of each target topic cluster and the second co-occurrence of each non-new center topic word Rate, and target topic is set by the non-new center topic word that new center topic word and the second co-occurrence rate are greater than the second preset value Word determines text topic according to all target topic words.
Further, after determining text topic, in order to which the details of topic is preferably presented, the present embodiment may be used also To carry out marking sequence to relevant sentence, then selects representative sentence and reported.
The present embodiment uses the adaptation mechanism for different observation objects, introduces including in low-frequency word filters Adaptation mechanism is adaptive selected candidate item and adaptively filters irrelevant information item.Distribution based on network text data Feature is clustered based on community partitioning algorithm, and according to centric keyword each topic discussion group topic dispersion mistake Filter unrelated topic item.The present embodiment belongs to the method for data-driven, does not need to estimate the parameter of model, without preparatory Dictionary is defined, different number of topic can be automatically generated according to the specific situation of data, unsupervised topic inspection may be implemented It surveys.During selection target topic word, the present embodiment takes center flooding mechanism and heterogeneous mechanism to carry out screening and other Unsupervised approaches are compared, the text topic detecting method provided in this embodiment accuracy with higher and recall rate.This reality Text topic can be understood in time by the monitoring of the topic fluctuation to network text data by applying example.The above process is without preparatory Topic cluster and topic word are defined, deliberately discusses that the fluctuation pattern of word itself finds topic word and topic unsupervisedly according to user Cluster, the topic detecting method can relatively well make up the shortcoming of measure of supervision.
The scheme of the embodiment of the present application in order to facilitate understanding, the reality being applicable in below with reference to the scheme of the embodiment of the present application Application scenarios are introduced.
The observation object of text topic detection is mobile phone games king honor in the present embodiment, in order to obtain effective user Feedback opinion, the present embodiment is using king's honor discussion bar and wechat game circle as the source of network text data.
In order to embody the timeliness and reliability of text topic detection, the present embodiment selects the network text within 24 hours Data carry out topic detection.According to word occur frequency determine the topic word candidate item in network text data have " ranking ", " forming a team ", " upper point ", " skin " and " gal sieve ".After determining topic word candidate item, according to fluctuation index and the frequency of occurrences Marking sequence is carried out to all topic word candidate items, all topic word candidate item burst grades are followed successively by " gal from high to low Sieve ", " skin ", " ranking ", " forming a team " and " upper point ".Further, the application is based on topic discussion group to network text data It is clustered to obtain multiple topic cluster A, B, C and D, the topic word candidate item in topic cluster A includes " gal sieve ", " skin " and " row The burst grade highest of position " and " gal sieve ", the topic word candidate item in topic cluster B include " gal sieve ", " upper point " and " skin " and The burst grade highest of " upper point ", the topic word candidate item in topic cluster C includes " gal sieve " and " ranking " and the burst of " gal sieve " Grade highest, the topic word candidate item in topic cluster D include the burst grade of " gal sieve ", " forming a team " and " ranking " and " gal sieve " most It is high.Can center topic word by " gal sieve " as topic cluster A, C, D, will the center topic word of " upper point " as topic cluster B.Meter The topic dispersion for calculating " gal sieve " is greater than the first preset value, and the topic dispersion of " upper point ", therefore can be with less than the first preset value Topic cluster B is filtered out, sets target topic cluster for topic cluster A, C, D.It is unrelated in target topic cluster in order to further screen Item of information can screen the topic word candidate item in topic cluster A, C, D using center flooding mechanism, such as topic cluster A, C, there is 80% sentence in the corresponding network text data of D in all sentences including center topic word " gal sieve " while including non- Centric keyword " skin " has 30% sentence while including non-central keyword " ranking ", has 20% sentence while including Non-central keyword " upper point ", it can thus be appreciated that the co-occurrence rate of " skin " and centric keyword is much larger than other non-central keywords. " gal sieve " and " skin " is used as target topic word, final combining target topic word determines that text topic is " skin of gal sieve ". If " gal sieve " and " skin " delivers the time substantially earlier than " ranking " it was found that all topics are served in option, illustrate network There are heterojunction structures in text data, new center topic word can be selected from target topic cluster and based on center flooding mechanism Screening operation, obtaining another text topic is " ranking ".It follows that the field feedback about mobile phone games king's honor It include mainly the discussion about " skin of gal sieve " and the discussion about " ranking ", i.e. user first discusses the skin of gal sieve, Then ranking is discussed.
On the other hand, present invention also provides a kind of text topic detection devices.Such as, referring to Fig. 8, it illustrates the application A kind of composed structure schematic diagram of text topic detection device of embodiment, the device of the present embodiment can be applied to as above implement Server in example, the device include:
Center topic word determining module 21, for determining the center topic word of network text data;
Dispersion computing module 22, for calculate topic of the center topic word in all discussion group's text datas from Divergence;Wherein, discussion group's text data is divided to obtain by the network text data according to topic discussion group;
Topic determining module 23, the center topic word for being greater than the first preset value according to all topic dispersions are true Determine text topic.
The present embodiment after determining the center topic word of network text data, determine the topic of each center topic word from Divergence.Topic dispersion can indicate distribution shape of the center topic word in the corresponding network text data of each topic discussion group Condition illustrates center topic word quilt in the corresponding most of topic discussion groups of network text data if topic dispersion is high Frequently discuss;If topic dispersion is low, illustrate the center topic word in the corresponding all topic discussion groups of network text data It is middle only to be discussed by less discussion or frequently in individual topic discussion groups.Center topic word is words of being discussed in network text data The centre word of topic, the application are based on topic dispersion and screen to center topic word, and it is pre- to be higher than first by topic dispersion If the center topic word of value determines text topic, it is possible to reduce individual user largely repeats to send out in certain text data publication community The case where text topic caused by table specific content is judged by accident.Therefore, the present embodiment can filter out unrelated topic item, improve text words The Detection accuracy of topic.
Further, center topic word determining module 21 includes:
Cluster cell, for determining the topic word candidate item in the network text data, and it is candidate to the topic word Item executes cluster operation and obtains topic cluster;Wherein, all topic word candidate items in the same topic cluster correspond to same topic Discussion group;
Candidate item screening unit, for setting described for the topic word candidate item for meeting preset condition in the topic cluster Center topic word.
Further, topic determining module 23 is preset particularly for by all topic dispersions greater than described first Topic cluster where the center topic word of value is set as target topic cluster, and determines the text according to all target topic clusters This topic.
Further, topic determining module 23 includes:
Topic cluster screening unit, for all topic dispersions to be greater than to the center topic word of first preset value The topic cluster at place is set as target topic cluster,
Co-occurrence rate computing unit, for calculate each target topic cluster center topic word and each non-central topic First co-occurrence rate of word;
First object topic word setting unit, it is pre- for the center topic word and the first co-occurrence rate to be greater than second If the non-central topic word of value is set as target topic word;
Text topic determination unit, for determining the text topic according to all target topic words.
Further, co-occurrence rate computing unit includes:
Text subdata determines subelement, corresponding with the target topic cluster in the network text data for determining Text subdata;
Subelement is arranged in co-occurrence rate, and the quantity ratios for co-occurrence sentence to be accounted for all sentences in the text subdata are set It is set to the first co-occurrence rate;Wherein, the co-occurrence sentence is in the text subdata including each non-central topic The sentence of word and the center topic word.
Further, further includes:
New centre word setting unit, for the first co-occurrence rate to be less than or equal to second preset value and burst etc. The non-central topic word that grade is greater than the third preset value is set as new center topic word;
Second target topic word setting unit, for calculate the new center topic word of each target topic cluster with Second co-occurrence rate of each non-new center topic word, and the new center topic word and the second co-occurrence rate are greater than described the The non-new center topic word of two preset values is set as the target topic word.
Further, cluster cell includes:
Keyword Selection subelement, it is candidate for setting the topic word for the keyword in the network text data ?;Wherein, the frequency of occurrences of the keyword is greater than the 4th preset value;
And/or orderly frequent episode selects subelement, for setting the orderly frequent episode in the network text data to The topic word candidate item;Wherein, the orderly frequent episode is multiple words that sequencing is fixed in target clause, the target Frequency of occurrence of the clause in the network text data is greater than the 5th preset value.
Further, further includes:
Orderly frequent episode screens subelement, for calculating the boundary freedom degree of each orderly frequent episode, and will be described Boundary freedom degree is rejected from the topic word candidate item less than the orderly frequent episode of the 6th preset value.
Further, candidate item screening unit includes:
Happen suddenly rating calculation subelement, for being calculated according to the word frequency information and fluctuation index of the topic word candidate item The burst grade of each topic word candidate item;
Subelement is arranged in center topic word, is used for the highest topic word candidate item of grade that happens suddenly described in the topic cluster It is set as the center topic word.
Further, further includes:
Topic word candidate item screens subelement, for rejecting burst grade described in the topic cluster lower than predetermined level Topic word candidate item.
Further, further includes:
Text determining module, for determining the target sentences text where the text topic;
Uploading module for giving a mark according to text similarity to all target sentences texts, and uploads score The target sentences text of top N.
On the other hand, present invention also provides a kind of servers, and such as referring to Fig. 9, it illustrates the one of the embodiment of the present application A kind of composed structure schematic diagram of kind server, the server 2100 of the present embodiment may include: processor 2101 and memory 2102。
Optionally, which can also include communication interface 2103, input unit 2104 and display 2105 and communication Bus 2106.
Processor 2101, communication interface 2103, input unit 2104, display 2105, passes through communication at memory 2102 Bus 2106 completes mutual communication.
In the embodiment of the present application, the processor 2101 can be central processing unit (Central Processing Unit, CPU), application-specific integrated circuit, digital signal processor, ready-made programmable gate array or other programmable logic Device etc..
The processor can call the program stored in memory 2102.Specifically, processor can execute following context Operation performed by server side in the embodiment of topic detecting method.
For storing one or more than one program in memory 2102, program may include program code, the journey Sequence code includes computer operation instruction, in the embodiment of the present application, is at least stored in the memory for realizing following function The program of energy:
Determine the center topic word of network text data;
Calculate topic dispersion of the center topic word in all discussion group's text datas;Wherein, the discussion group Text data is divided to obtain by the network text data according to topic discussion group;
Text topic is determined according to the center topic word that all topic dispersions are greater than the first preset value.
In one possible implementation, which may include storing program area and storage data area, wherein Storing program area can application program needed for storage program area and at least one function (such as topic detection function etc.) Deng;Storage data area can store the data created in the use process according to computer.
In addition, memory 2102 may include high-speed random access memory, it can also include nonvolatile memory, example Such as at least one disk memory or other volatile solid-state parts.
The communication interface 2103 can be the interface of communication module, such as the interface of gsm module.
The application can also include display 2104 and input unit 2105 etc..
Certainly, the structure of server shown in Fig. 9 does not constitute the restriction to server in the embodiment of the present application, in reality Server may include than more or fewer components shown in Fig. 9, or the certain components of combination in.
On the other hand, the embodiment of the present application also provides a kind of storage medium, computer journey is stored in the storage medium Sequence, when the computer program is loaded and executed by processor, for realizing text described in any one embodiment as above This topic detecting method.
It should be noted that all the embodiments in this specification are described in a progressive manner, each embodiment weight Point explanation is the difference from other embodiments, and the same or similar parts between the embodiments can be referred to each other. For device class embodiment, since it is basically similar to the method embodiment, so being described relatively simple, related place ginseng See the part explanation of embodiment of the method.
Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that the process, method, article or equipment for including a series of elements not only includes that A little elements, but also including other elements that are not explicitly listed, or further include for this process, method, article or The intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence "including a ...", is not arranged Except there is also other identical elements in the process, method, article or equipment for including element.
The above is only the preferred embodiment of the present invention, it is noted that those skilled in the art are come It says, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications also should be regarded as Protection scope of the present invention.

Claims (13)

1. a kind of text topic detecting method characterized by comprising
Determine the center topic word of network text data;
Calculate topic dispersion of the center topic word in all discussion group's text datas;Wherein, discussion group's text Data are divided to obtain by the network text data according to topic discussion group;
Text topic is determined according to target's center's topic word;Wherein, target's center's topic word is that the topic dispersion is big In the center topic word of the first preset value.
2. text topic detecting method according to claim 1, which is characterized in that the center of the determining network text data Topic word includes:
It determines the topic word candidate item in the network text data, and cluster operation is executed to the topic word candidate item and is obtained Topic cluster;Wherein, all topic word candidate items in the same topic cluster correspond to same topic discussion group;
The center topic word is set by the topic word candidate item for meeting preset condition in the topic cluster.
3. text topic detecting method according to claim 2, which is characterized in that determine text according to target's center's topic word Topic includes:
Topic cluster where all topic dispersions to be greater than to the center topic word of first preset value is set as target Topic cluster, and the text topic is determined according to all target topic clusters.
4. text topic detecting method according to claim 3, which is characterized in that determined according to all target topic clusters Text topic includes:
Calculate the center topic word of each target topic cluster and the first co-occurrence rate of each non-central topic word;
Target words are set by the non-central topic word that the center topic word and the first co-occurrence rate are greater than the second preset value Epigraph;
The text topic is determined according to all target topic words.
5. text topic detecting method according to claim 4, which is characterized in that calculate in each target topic cluster Heart topic word and the first co-occurrence rate of each non-central topic word include:
Determine text subdata corresponding with the target topic cluster in the network text data;
The first co-occurrence rate is set by the quantity ratios that co-occurrence sentence accounts for all sentences in the text subdata;Wherein, The co-occurrence sentence is the sentence in the text subdata including each non-central topic word and the center topic word.
6. text topic detecting method according to claim 4, which is characterized in that true according to all target topic words Before the fixed text topic, further includes:
The first co-occurrence rate is less than or equal to second preset value and the grade that happens suddenly is greater than the non-of the third preset value Center topic word is set as new center topic word;
The new center topic word of each target topic cluster and the second co-occurrence rate of each non-new center topic word are calculated, And it sets the non-new center topic word that the new center topic word and the second co-occurrence rate are greater than second preset value to The target topic word.
7. text topic detecting method according to claim 2, which is characterized in that in the determining network text data Writing inscription candidate item includes:
The topic word candidate item is set by the keyword in the network text data;Wherein, the appearance of the keyword Frequency is greater than the 4th preset value;
And/or the topic word candidate item is set by the orderly frequent episode in the network text data;Wherein, described to have Sequence frequent episode is multiple words that sequencing is fixed in target clause, the target clause going out in the network text data Occurrence number is greater than the 5th preset value.
8. text topic detecting method according to claim 7, which is characterized in that poly- being executed to the topic word candidate item Generic operation obtains before topic cluster, further includes:
The boundary freedom degree of each orderly frequent episode is calculated, and the boundary freedom degree is orderly less than the 6th preset value Frequent episode is rejected from the topic word candidate item.
9. according to any one of claim 2 to the 8 text topic detecting method, which is characterized in that will be accorded in the topic cluster The topic word candidate item for closing preset condition is set as the center topic word and includes:
The burst of each topic word candidate item is calculated according to the word frequency information of the topic word candidate item and fluctuation index Grade;
The center topic word is set by the burst highest topic word candidate item of grade described in the topic cluster.
10. text topic detecting method according to claim 1, which is characterized in that according to all target topic clusters After determining text topic, further includes:
Determine the target sentences text where the text topic;
It is given a mark according to text similarity to all target sentences texts, and uploads the target sentences text of score top N This.
11. a kind of text topic detection device characterized by comprising
Center topic word determining module, for determining the center topic word of network text data;
Dispersion computing module, for calculating topic dispersion of the center topic word in all discussion group's text datas; Wherein, discussion group's text data is divided to obtain by the network text data according to topic discussion group;
Topic determining module, for determining text topic according to target's center's topic word;Wherein, target's center's topic word is The topic dispersion is greater than the center topic word of the first preset value.
12. a kind of server characterized by comprising
Processor and memory;
Wherein, the processor is for executing the program stored in the memory;
For storing program, described program is at least used for the memory:
Determine the center topic word of network text data;
Calculate topic dispersion of the center topic word in all discussion group's text datas;Wherein, discussion group's text Data are divided to obtain by the network text data according to topic discussion group;
Text topic is determined according to target's center's topic word;Wherein, target's center's topic word is that the topic dispersion is big In the center topic word of the first preset value.
13. a kind of storage medium, which is characterized in that be stored with computer executable instructions, the calculating in the storage medium When machine executable instruction is loaded and executed by processor, any one of claims 1 to 10 as above text topic detection is realized The step of method.
CN201910549752.6A 2019-06-24 2019-06-24 Text topic detection method, device, server and storage medium Active CN110245355B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910549752.6A CN110245355B (en) 2019-06-24 2019-06-24 Text topic detection method, device, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910549752.6A CN110245355B (en) 2019-06-24 2019-06-24 Text topic detection method, device, server and storage medium

Publications (2)

Publication Number Publication Date
CN110245355A true CN110245355A (en) 2019-09-17
CN110245355B CN110245355B (en) 2024-02-13

Family

ID=67889064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910549752.6A Active CN110245355B (en) 2019-06-24 2019-06-24 Text topic detection method, device, server and storage medium

Country Status (1)

Country Link
CN (1) CN110245355B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125484A (en) * 2019-12-17 2020-05-08 网易(杭州)网络有限公司 Topic discovery method and system and electronic device
CN111324725A (en) * 2020-02-17 2020-06-23 昆明理工大学 Topic acquisition method, terminal and computer readable storage medium
CN111444337A (en) * 2020-02-27 2020-07-24 桂林电子科技大学 Topic tracking method based on improved K L divergence
US11803709B2 (en) 2021-09-23 2023-10-31 International Business Machines Corporation Computer-assisted topic guidance in document writing

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060025995A1 (en) * 2004-07-29 2006-02-02 Erhart George W Method and apparatus for natural language call routing using confidence scores
CN104063428A (en) * 2014-06-09 2014-09-24 国家计算机网络与信息安全管理中心 Method for detecting unexpected hot topics in Chinese microblogs
CN104615593A (en) * 2013-11-01 2015-05-13 北大方正集团有限公司 Method and device for automatic detection of microblog hot topics
CN106066866A (en) * 2016-05-26 2016-11-02 同方知网(北京)技术有限公司 A kind of automatic abstracting method of english literature key phrase and system
CN106156182A (en) * 2015-04-20 2016-11-23 富士通株式会社 The method and apparatus that microblog topic word is categorized into specific field
CN106383877A (en) * 2016-09-12 2017-02-08 电子科技大学 On-line short text clustering and topic detection method of social media
WO2017041372A1 (en) * 2015-09-07 2017-03-16 百度在线网络技术(北京)有限公司 Man-machine interaction method and system based on artificial intelligence
CN106610931A (en) * 2015-10-23 2017-05-03 北京国双科技有限公司 Extraction method and device for topic names
CN107203513A (en) * 2017-06-06 2017-09-26 中国人民解放军国防科学技术大学 Microblogging text data fine granularity topic evolution analysis method based on probabilistic model
CN109271493A (en) * 2018-11-26 2019-01-25 腾讯科技(深圳)有限公司 A kind of language text processing method, device and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060025995A1 (en) * 2004-07-29 2006-02-02 Erhart George W Method and apparatus for natural language call routing using confidence scores
CN104615593A (en) * 2013-11-01 2015-05-13 北大方正集团有限公司 Method and device for automatic detection of microblog hot topics
CN104063428A (en) * 2014-06-09 2014-09-24 国家计算机网络与信息安全管理中心 Method for detecting unexpected hot topics in Chinese microblogs
CN106156182A (en) * 2015-04-20 2016-11-23 富士通株式会社 The method and apparatus that microblog topic word is categorized into specific field
WO2017041372A1 (en) * 2015-09-07 2017-03-16 百度在线网络技术(北京)有限公司 Man-machine interaction method and system based on artificial intelligence
CN106610931A (en) * 2015-10-23 2017-05-03 北京国双科技有限公司 Extraction method and device for topic names
CN106066866A (en) * 2016-05-26 2016-11-02 同方知网(北京)技术有限公司 A kind of automatic abstracting method of english literature key phrase and system
CN106383877A (en) * 2016-09-12 2017-02-08 电子科技大学 On-line short text clustering and topic detection method of social media
CN107203513A (en) * 2017-06-06 2017-09-26 中国人民解放军国防科学技术大学 Microblogging text data fine granularity topic evolution analysis method based on probabilistic model
CN109271493A (en) * 2018-11-26 2019-01-25 腾讯科技(深圳)有限公司 A kind of language text processing method, device and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
JIANQING WU: "An Automatic Procedure for Vehicle Tracking with a Roadside LiDAR Sensor", 《2018 DANIEL B. FAMBRO STUDENT PAPER AWARD WINNER》 *
万红新;彭云;: "语义约束和时间关联LDA的社交媒体主题词链提取", 小型微型计算机***, no. 04 *
吴伊萍;: "面向网络论坛话题发现的文本处理技术研究", 赤峰学院学报(自然科学版), no. 11 *
詹勇;杨燕;王红军;: "混合模型的微博交叉话题发现", 计算机科学与探索, no. 08 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125484A (en) * 2019-12-17 2020-05-08 网易(杭州)网络有限公司 Topic discovery method and system and electronic device
CN111125484B (en) * 2019-12-17 2023-06-30 网易(杭州)网络有限公司 Topic discovery method, topic discovery system and electronic equipment
CN111324725A (en) * 2020-02-17 2020-06-23 昆明理工大学 Topic acquisition method, terminal and computer readable storage medium
CN111324725B (en) * 2020-02-17 2023-05-16 昆明理工大学 Topic acquisition method, terminal and computer readable storage medium
CN111444337A (en) * 2020-02-27 2020-07-24 桂林电子科技大学 Topic tracking method based on improved K L divergence
CN111444337B (en) * 2020-02-27 2022-07-19 桂林电子科技大学 Topic tracking method based on improved KL divergence
US11803709B2 (en) 2021-09-23 2023-10-31 International Business Machines Corporation Computer-assisted topic guidance in document writing

Also Published As

Publication number Publication date
CN110245355B (en) 2024-02-13

Similar Documents

Publication Publication Date Title
Barbado et al. A framework for fake review detection in online consumer electronics retailers
CN110245355A (en) Text topic detecting method, device, server and storage medium
AU2006277608B2 (en) Method and system for extracting web data
US9165254B2 (en) Method and system to predict the likelihood of topics
US8781989B2 (en) Method and system to predict a data value
JP5563836B2 (en) System and method for providing default hierarchy training for social indexing
CN110298547A (en) Methods of risk assessment, device, computer installation and storage medium
JP2010176666A (en) System and method for managing user attention by detecting hot and cold topics in social indexes
Franzoni et al. PMING distance: a collaborative semantic proximity measure
CN108733748A (en) A kind of cross-border product quality risk fuzzy prediction method based on comment on commodity public sentiment
Shehnepoor et al. ScoreGAN: A fraud review detector based on regulated GAN with data augmentation
CN106302568B (en) A kind of user behavior evaluation method, apparatus and system
CN112419029A (en) Similar financial institution risk monitoring method, risk simulation system and storage medium
Niranjani et al. Spam detection for social media networks using machine learning
Zhong et al. Identification of opinion spammers using reviewer reputation and clustering analysis
CN108304568A (en) A kind of real estate Expectations big data processing method and system
Brunet et al. Exploring the data of blockchain-based metaverses
CN106971316A (en) A kind of data processing method and device based on social user
Kim et al. Controversy score calculation for news articles
Wang et al. Research on effect evaluation of online advertisement based on resampling method
Lappas et al. Toward a fair review-management system
Kammergruber et al. Using association rules for discovering tag bundles in social tagging data
Labatut Extraction and analysis of fictional character networks
Liao Topic modeling and sentiment analysis on Artificial Intelligence tweets
van der Bom et al. Revisiting the welfare state through the decades: Investigating the discursive construction of the welfare state in the Times from 1940-2009

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant