CN107122494A

CN107122494A - Topic model construction method based on community discovery

Info

Publication number: CN107122494A
Application number: CN201710361414.0A
Authority: CN
Inventors: 张雷; 赵鑫; 宋岳; 李宁
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2017-05-22
Filing date: 2017-05-22
Publication date: 2017-09-01
Anticipated expiration: 2037-05-22
Also published as: CN107122494B

Abstract

The present invention discloses the technical scheme for a kind of method that topic model based on community discovery is built, and successively comprises the following steps：The relational network contained is extracted based on short text data；Relational network is divided into by multiple corporations using community discovery algorithm；The short text extracted in each corporations is expanded to obtain the lengthy document with Term co-occurrence relation, and obtained multiple lengthy documents are constituted into lengthy document set；Topics Crawling is carried out for lengthy document set, the TMCD topic models based on community discovery are obtained.The angle of this method inherent corporations' relation contained from data, extending certainly for short text is carried out based on community discovery algorithm, Sparse sex chromosome mosaicism is solved.

Description

Topic model construction method based on community discovery

Technical field

Contain community network the present invention relates to a kind of topic model construction method based on community discovery, more particularly to inside Koinotropic type's short text data Topics Crawling technology.

Background technology

Under current network environment, with enriching for various line upper mounting plates, substantial amounts of koinotropic type's data are generated, Social networks just like has become the data source of a progress information excavating.The data produced under this scene, it is most of Presented again in the form of short text.Relative to long text, it is semantic terse that short text is expressed, and the speed of transmission information is fast, is letter Cease the obvious development trend propagated.Short text turns into one of most important information carrier of today's society.

At present in the analysis method to these data, the semantic information for excavating text intension by topic model is a kind of Effectively mode.Classical topic model algorithm, such as PLSA, LDA are based primarily upon double mode and Term co-occurrence relation pair text Carry out semantic analysis.This kind of algorithm effect when handling long document is significant, and when for short text, because It is not enough for Term co-occurrence relation, algorithm is faced Sparse sex chromosome mosaicism, model quality can be had a strong impact on.

Mainly there are following five kinds of processing schemes in academia for the topic model of this short text at this stage：1) using simple Splicing, short text is directly connected together；2) short text is aggregated into long text with the method in introducing external data storehouse；3) from A kind of didactic method realizes, such as based on pushing away the label information of special content, the time stream information that content is sent or transmission Author of content etc. is extended to text；4) loose hypothesis is used to the theme of text, it is assumed that only wrapped in a short text Containing a theme；5) modeling object is changed.It is BTM models of the Yan et al. in proposition in 2013 to compare representational.

Above scheme has erased the border of document or the interference by external data etc. by force, with many deficiencies Place.

The content of the invention

The present invention proposes a kind of topic model (i.e. TMCD models, Topic Model based on based on community discovery Community Detection) construction method, this method can build topic model for koinotropic type's data set, i.e., using society It was found that algorithm provides solution for the Topics Crawling of koinotropic type's short text data.TMCD models inherent society contained from data The angle of regimental tie is set out, and extending certainly for short text is carried out based on community discovery algorithm, Sparse sex chromosome mosaicism is solved.

To solve the above problems, the technical side for the method that the topic model disclosed in this invention based on community discovery is built Case comprises the following steps：

Step 1, the relational network contained based on short text data extraction；

Step 2, relational network is divided into by multiple corporations using community discovery algorithm；

Step 3, the short text extracted in each corporations expanded to obtain the lengthy document with Term co-occurrence relation,

And obtained multiple lengthy documents are constituted into lengthy document set；

Step 4, for lengthy document set carry out Topics Crawling, obtain the TMCD topic models based on community discovery.

Further, the extraction process of relational network is in step 1：Using the main body in short text data as node, It is associated by interactive relation between main body and abstract formation side, obtained node and side is collectively forming a relational network.

Further, made using the level of intimate of interactive relation between main body as the weight on side with the passive relation of the master of association For the direction on side.

Further, the community discovery algorithm described in step 2 is including in cohesion, division, label propagation and global exploration One or more.

Further, it is to use to expand short text from extended method in step 3.

Further, the short text data is the internal koinotropic type's data for containing community network, the relational network It is community network.

Topic model construction method disclosed in this invention based on community discovery, is the theme of koinotropic type's short text data Excavate and provide new solution, have the advantages that：

(1) this method is used as text classification foundation by the corporations' network associate contained inside mining data, in this base The problem of Deta sparseness in the expansion to short text, and then solution short text Topics Crawling is completed on plinth, is such koinotropic type Short text data collection topic model, which is built, provides solution.

(2) this method by based on content similarities from extended method, do not introduce it is extraneous help data in the case of, Solving has content relevance because simple concatenation is all in existing short text theme modeling solution and does not possess content phase The problem of having erased the border of document by force caused by the text of closing property is brought because external auxiliary corpus is introduced External noise interference problem, and fundamentally avoid the not enough influence to topic model of Term co-occurrence relation.

Brief description of the drawings

Fig. 1 is the theme the relation schematic diagram in model between document-theme-word.

Fig. 2 is community network schematic diagram.

Fig. 3 be embodiment in for koinotropic type's data set topic model construction method flow chart.

Fig. 4 is the flow chart of short text expansion part in Fig. 3.

Embodiment

In order to know more about the technology contents of the present invention, especially exemplified by specific embodiment and institute's accompanying drawings are coordinated to be described as follows.

The relation being the theme as shown in Figure 1 between document in model, theme, vocabulary.Data introduce " theme " this without exception After thought, theme can serve as contacting " bridge " of document and word, by observe probability distribution between document and theme and Probability distribution between theme and vocabulary can obtain the distribution situation of main body by Related Mathematical Models.Obtaining theme and word During relation, the number of Term co-occurrence relation influences whether the degree of accuracy of observed result, and this degree of accuracy also can further influence final master Inscribe the quality of model.For long text, there are enough Term co-occurrence relations as support in observation, and short text then lacks Weary enough Term co-occurrence relations, that is, occur in that the sparse sex chromosome mosaicism of data.TMCD model building methods proposed by the present invention Exactly deploy for the solution of this problem.

As shown in Fig. 2 TMCD models are directed to koinotropic type's data in embodiment, by the key body concentrated to data (i.e. The object of data, generally contact person are produced in data set) carried out with intersubjective associate (route of transmission for producing data) The topic model obtained after abstract, can show an obvious community network.Here abstract refers to data to be concentrated with reality The abstract node and side in community network such as the relation between the contact person and contact person of border meaning.Wherein, abstract data set In agent object be node, to contact artificial node in such as social data；Side is associated as between abstract subject, with the close of association Degree is as the weight on side, and contact person mutually sends out message as side in such as social data, sends out the bar number of message as weight, is disappeared with sending out The passive relation of master of breath as side direction.One key character of obtained community network is exactly to contain community structure, and Community structure refers to that community network acts on the data that can be divided into some corporations, and same corporations by some algorithms and had Similitude.In the result of division, the node relation inside corporations is more close, and contact is close, and the node contact between corporations Than sparse.

Such as the flow chart that Fig. 3 is a kind of topic model construction method for koinotropic type's data set in embodiment, this method Model construction is carried out based on community discovery, comprised the following steps：

Step 1：The social network contained is extracted according to the propagation relation of data between the main body and main body inside koinotropic type's data Network.Wherein, koinotropic type's packet contains the data set of community network containing all inside, such as：Join in the instant messagings such as QQ, wechat It is the data set for the information structure that people generates in real time, microblogging, the number that online social platform is produced by forwarding, comment data such as knows According to collection etc..Specific extraction process is as follows：

1) using the main body in abstract data (i.e. koinotropic type's data) as node, wherein, the main body bag in abstract data Containing can as build community network in node object, such as people, thing or event；

2) be associated by interactive relation between main body, abstract formation side, wherein, interactive relation comprising it is all can be two The relation of efficient association is formed between individual main body, such as：The transmission of message constitutes the association of main body contact person in instant messaging, online Body association of composition etc. is forwarded, commented on, sharing in social platform；

3) an obvious community network is formed based on the abstract obtained node of above-mentioned steps and side.

Step 2：Community network is divided into by multiple community structures using community discovery algorithm.Community discovery algorithm includes institute There is the algorithm that effective corporations' division can be carried out for community network, including but not limited to based on coacervation process, fission process, mark The algorithm that label propagation and global exploration (including analysis of spectrum) thinking are realized, this is also the design think of of most of community discovery algorithm Think, almost cover all community discovery algorithms that can effectively divide.

Step 3：The short text included in each corporations is expanded according to community structure division result.Extending method is main Including following sub-step：

1) short text data corresponding to the multiple nodes included under the corporations that each marks off is extracted；

2) by the way that short text is extended for lengthy document based on traditional extending method from extension；

3) can obtain several (numbers for depending on marking off corporations) based on above-mentioned steps has in community network The data of text similarity from expand obtain include the lengthy document of abundant Term co-occurrence relation, and obtain after each corporations are expanded Lengthy document constitutes a lengthy document set.

What deserves to be explained is, it is extended based on its data collection, extraneous help data is not introduced, specifically can directly be spelled Explained exemplified by connection, i.e. be directly attached the short text extracted, this extending method does not take into account that text is in itself It is no that there are concrete operations under similitude, this scene to be that all texts being located under same corporations corresponding to multiple nodes are expanded to make For a long document.

Step 4：Theme modeling is carried out for lengthy document set, and obtains TMCD models.Use traditional topic model structure Construction method is (such as：LDA, probability latent semantic analysis PLSA etc.), the sight of word-theme is obtained with the Term co-occurrence relation enriched in document Result is surveyed, in conjunction with the document observed-theme result, by certain mathematical method (such as：Gibbs sampler etc.) complete master Topic analysis and mining process, obtain the TMCD models for koinotropic type's data set.The TMCD models will be intuitively in output document Comprising theme situation and the correspondence information such as keyword, compared to directly traditional theme model method is acted on short text, TMCD models have additionally carried out the text based on community discovery and have expanded process so that have enough Term co-occurrence relations in text, from And greatly improve the quality of the result of Topics Crawling.

The the short text 2) described in sub-step for being illustrated in figure 4 step 3 in embodiment expands the flow chart of part, specifically Comprise the following steps：

S3-1 is that short text extracts operation, and the result divided according to corporations in Fig. 3 steps 2 is extracted one and do not expanded in corporations Comprising multiple nodes, corresponding short text data is then extracted from the information of each node；

S3-2 is short text extended operation, the short text extracted in step 3-1 by based on being expanded from extended mode Exhibition, this is sentenced explains from exemplified by the direct splicing method in extended mode, i.e., the short text extracted is directly attached, All texts positioned at this corporation are expanded into a long document；

S3-3 is Rule of judgment, judges whether all short texts to have carried out extended operation according to corporations' division result.If There are the corporations not expanded then to enter step 3-1, otherwise into step 3-4；

S3-4 is returns to the lengthy document collection after expanding, and the short text according to corporations' division result expands step and terminated.

In summary, in embodiment, the method that the topic model found based on society is built is the master of koinotropic type's data set The excavation of topic provides a kind of new thinking, this method by the discovery of the community structure to containing inside koinotropic type's data set, And the document sets that length is formed from expansion of short text are carried out based on this, solve and Topics Crawling is directly carried out on short text The Sparse sex chromosome mosaicism faced, is greatly improved the quality of topic model, is that the topic model of koinotropic type's data set is carried Solution is supplied.

Although the present invention is disclosed above in preferred embodiment, so it is not limited to the present invention.Belonging to of the invention Has usually intellectual in technical field, without departing from the spirit and scope of the present invention, when can be used for a variety of modifications and variations. Therefore, the scope of protection of the present invention is defined by those of the claims.

Claims

1. a kind of topic model construction method based on community discovery, it is characterised in that comprise the following steps：

Step 1, the relational network contained based on short text data extraction；

Step 3, the short text extracted in each corporations expanded to obtain the lengthy document with Term co-occurrence relation, and will be obtained Multiple lengthy documents constitute lengthy document set；

2. topic model construction method as claimed in claim 1, it is characterised in that the extraction process of relational network in step 1 It is：Using the main body in short text data as node, it is associated by interactive relation between main body and abstract formation side, will To node and side be collectively forming a relational network.

3. topic model construction method as claimed in claim 2, it is characterised in that with the level of intimate of interactive relation between main body As the weight on side, the direction on side is used as using the passive relation of the master of association.

4. topic model construction method as claimed in claim 1, it is characterised in that the community discovery algorithm described in step 2 Including the one or more in cohesion, division, label propagation and global exploration.

5. topic model construction method as claimed in claim 1, it is characterised in that used from extended method pair in step 3 Short text is expanded.

6. the topic model construction method as described in claim 1 to 5 any one, it is characterised in that the short text data It is the internal koinotropic type's data for containing community network, the relational network is community network.