CN107122494A - Topic model construction method based on community discovery - Google Patents

Topic model construction method based on community discovery Download PDF

Info

Publication number
CN107122494A
CN107122494A CN201710361414.0A CN201710361414A CN107122494A CN 107122494 A CN107122494 A CN 107122494A CN 201710361414 A CN201710361414 A CN 201710361414A CN 107122494 A CN107122494 A CN 107122494A
Authority
CN
China
Prior art keywords
short text
topic model
construction method
community discovery
community
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710361414.0A
Other languages
Chinese (zh)
Other versions
CN107122494B (en
Inventor
张雷
赵鑫
宋岳
李宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201710361414.0A priority Critical patent/CN107122494B/en
Publication of CN107122494A publication Critical patent/CN107122494A/en
Application granted granted Critical
Publication of CN107122494B publication Critical patent/CN107122494B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Primary Health Care (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Computational Linguistics (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses the technical scheme for a kind of method that topic model based on community discovery is built, and successively comprises the following steps:The relational network contained is extracted based on short text data;Relational network is divided into by multiple corporations using community discovery algorithm;The short text extracted in each corporations is expanded to obtain the lengthy document with Term co-occurrence relation, and obtained multiple lengthy documents are constituted into lengthy document set;Topics Crawling is carried out for lengthy document set, the TMCD topic models based on community discovery are obtained.The angle of this method inherent corporations' relation contained from data, extending certainly for short text is carried out based on community discovery algorithm, Sparse sex chromosome mosaicism is solved.

Description

Topic model construction method based on community discovery
Technical field
Contain community network the present invention relates to a kind of topic model construction method based on community discovery, more particularly to inside Koinotropic type's short text data Topics Crawling technology.
Background technology
Under current network environment, with enriching for various line upper mounting plates, substantial amounts of koinotropic type's data are generated, Social networks just like has become the data source of a progress information excavating.The data produced under this scene, it is most of Presented again in the form of short text.Relative to long text, it is semantic terse that short text is expressed, and the speed of transmission information is fast, is letter Cease the obvious development trend propagated.Short text turns into one of most important information carrier of today's society.
At present in the analysis method to these data, the semantic information for excavating text intension by topic model is a kind of Effectively mode.Classical topic model algorithm, such as PLSA, LDA are based primarily upon double mode and Term co-occurrence relation pair text Carry out semantic analysis.This kind of algorithm effect when handling long document is significant, and when for short text, because It is not enough for Term co-occurrence relation, algorithm is faced Sparse sex chromosome mosaicism, model quality can be had a strong impact on.
Mainly there are following five kinds of processing schemes in academia for the topic model of this short text at this stage:1) using simple Splicing, short text is directly connected together;2) short text is aggregated into long text with the method in introducing external data storehouse;3) from A kind of didactic method realizes, such as based on pushing away the label information of special content, the time stream information that content is sent or transmission Author of content etc. is extended to text;4) loose hypothesis is used to the theme of text, it is assumed that only wrapped in a short text Containing a theme;5) modeling object is changed.It is BTM models of the Yan et al. in proposition in 2013 to compare representational.
Above scheme has erased the border of document or the interference by external data etc. by force, with many deficiencies Place.
The content of the invention
The present invention proposes a kind of topic model (i.e. TMCD models, Topic Model based on based on community discovery Community Detection) construction method, this method can build topic model for koinotropic type's data set, i.e., using society It was found that algorithm provides solution for the Topics Crawling of koinotropic type's short text data.TMCD models inherent society contained from data The angle of regimental tie is set out, and extending certainly for short text is carried out based on community discovery algorithm, Sparse sex chromosome mosaicism is solved.
To solve the above problems, the technical side for the method that the topic model disclosed in this invention based on community discovery is built Case comprises the following steps:
Step 1, the relational network contained based on short text data extraction;
Step 2, relational network is divided into by multiple corporations using community discovery algorithm;
Step 3, the short text extracted in each corporations expanded to obtain the lengthy document with Term co-occurrence relation,
And obtained multiple lengthy documents are constituted into lengthy document set;
Step 4, for lengthy document set carry out Topics Crawling, obtain the TMCD topic models based on community discovery.
Further, the extraction process of relational network is in step 1:Using the main body in short text data as node, It is associated by interactive relation between main body and abstract formation side, obtained node and side is collectively forming a relational network.
Further, made using the level of intimate of interactive relation between main body as the weight on side with the passive relation of the master of association For the direction on side.
Further, the community discovery algorithm described in step 2 is including in cohesion, division, label propagation and global exploration One or more.
Further, it is to use to expand short text from extended method in step 3.
Further, the short text data is the internal koinotropic type's data for containing community network, the relational network It is community network.
Topic model construction method disclosed in this invention based on community discovery, is the theme of koinotropic type's short text data Excavate and provide new solution, have the advantages that:
(1) this method is used as text classification foundation by the corporations' network associate contained inside mining data, in this base The problem of Deta sparseness in the expansion to short text, and then solution short text Topics Crawling is completed on plinth, is such koinotropic type Short text data collection topic model, which is built, provides solution.
(2) this method by based on content similarities from extended method, do not introduce it is extraneous help data in the case of, Solving has content relevance because simple concatenation is all in existing short text theme modeling solution and does not possess content phase The problem of having erased the border of document by force caused by the text of closing property is brought because external auxiliary corpus is introduced External noise interference problem, and fundamentally avoid the not enough influence to topic model of Term co-occurrence relation.
Brief description of the drawings
Fig. 1 is the theme the relation schematic diagram in model between document-theme-word.
Fig. 2 is community network schematic diagram.
Fig. 3 be embodiment in for koinotropic type's data set topic model construction method flow chart.
Fig. 4 is the flow chart of short text expansion part in Fig. 3.
Embodiment
In order to know more about the technology contents of the present invention, especially exemplified by specific embodiment and institute's accompanying drawings are coordinated to be described as follows.
The relation being the theme as shown in Figure 1 between document in model, theme, vocabulary.Data introduce " theme " this without exception After thought, theme can serve as contacting " bridge " of document and word, by observe probability distribution between document and theme and Probability distribution between theme and vocabulary can obtain the distribution situation of main body by Related Mathematical Models.Obtaining theme and word During relation, the number of Term co-occurrence relation influences whether the degree of accuracy of observed result, and this degree of accuracy also can further influence final master Inscribe the quality of model.For long text, there are enough Term co-occurrence relations as support in observation, and short text then lacks Weary enough Term co-occurrence relations, that is, occur in that the sparse sex chromosome mosaicism of data.TMCD model building methods proposed by the present invention Exactly deploy for the solution of this problem.
As shown in Fig. 2 TMCD models are directed to koinotropic type's data in embodiment, by the key body concentrated to data (i.e. The object of data, generally contact person are produced in data set) carried out with intersubjective associate (route of transmission for producing data) The topic model obtained after abstract, can show an obvious community network.Here abstract refers to data to be concentrated with reality The abstract node and side in community network such as the relation between the contact person and contact person of border meaning.Wherein, abstract data set In agent object be node, to contact artificial node in such as social data;Side is associated as between abstract subject, with the close of association Degree is as the weight on side, and contact person mutually sends out message as side in such as social data, sends out the bar number of message as weight, is disappeared with sending out The passive relation of master of breath as side direction.One key character of obtained community network is exactly to contain community structure, and Community structure refers to that community network acts on the data that can be divided into some corporations, and same corporations by some algorithms and had Similitude.In the result of division, the node relation inside corporations is more close, and contact is close, and the node contact between corporations Than sparse.
Such as the flow chart that Fig. 3 is a kind of topic model construction method for koinotropic type's data set in embodiment, this method Model construction is carried out based on community discovery, comprised the following steps:
Step 1:The social network contained is extracted according to the propagation relation of data between the main body and main body inside koinotropic type's data Network.Wherein, koinotropic type's packet contains the data set of community network containing all inside, such as:Join in the instant messagings such as QQ, wechat It is the data set for the information structure that people generates in real time, microblogging, the number that online social platform is produced by forwarding, comment data such as knows According to collection etc..Specific extraction process is as follows:
1) using the main body in abstract data (i.e. koinotropic type's data) as node, wherein, the main body bag in abstract data Containing can as build community network in node object, such as people, thing or event;
2) be associated by interactive relation between main body, abstract formation side, wherein, interactive relation comprising it is all can be two The relation of efficient association is formed between individual main body, such as:The transmission of message constitutes the association of main body contact person in instant messaging, online Body association of composition etc. is forwarded, commented on, sharing in social platform;
3) an obvious community network is formed based on the abstract obtained node of above-mentioned steps and side.
Step 2:Community network is divided into by multiple community structures using community discovery algorithm.Community discovery algorithm includes institute There is the algorithm that effective corporations' division can be carried out for community network, including but not limited to based on coacervation process, fission process, mark The algorithm that label propagation and global exploration (including analysis of spectrum) thinking are realized, this is also the design think of of most of community discovery algorithm Think, almost cover all community discovery algorithms that can effectively divide.
Step 3:The short text included in each corporations is expanded according to community structure division result.Extending method is main Including following sub-step:
1) short text data corresponding to the multiple nodes included under the corporations that each marks off is extracted;
2) by the way that short text is extended for lengthy document based on traditional extending method from extension;
3) can obtain several (numbers for depending on marking off corporations) based on above-mentioned steps has in community network The data of text similarity from expand obtain include the lengthy document of abundant Term co-occurrence relation, and obtain after each corporations are expanded Lengthy document constitutes a lengthy document set.
What deserves to be explained is, it is extended based on its data collection, extraneous help data is not introduced, specifically can directly be spelled Explained exemplified by connection, i.e. be directly attached the short text extracted, this extending method does not take into account that text is in itself It is no that there are concrete operations under similitude, this scene to be that all texts being located under same corporations corresponding to multiple nodes are expanded to make For a long document.
Step 4:Theme modeling is carried out for lengthy document set, and obtains TMCD models.Use traditional topic model structure Construction method is (such as:LDA, probability latent semantic analysis PLSA etc.), the sight of word-theme is obtained with the Term co-occurrence relation enriched in document Result is surveyed, in conjunction with the document observed-theme result, by certain mathematical method (such as:Gibbs sampler etc.) complete master Topic analysis and mining process, obtain the TMCD models for koinotropic type's data set.The TMCD models will be intuitively in output document Comprising theme situation and the correspondence information such as keyword, compared to directly traditional theme model method is acted on short text, TMCD models have additionally carried out the text based on community discovery and have expanded process so that have enough Term co-occurrence relations in text, from And greatly improve the quality of the result of Topics Crawling.
The the short text 2) described in sub-step for being illustrated in figure 4 step 3 in embodiment expands the flow chart of part, specifically Comprise the following steps:
S3-1 is that short text extracts operation, and the result divided according to corporations in Fig. 3 steps 2 is extracted one and do not expanded in corporations Comprising multiple nodes, corresponding short text data is then extracted from the information of each node;
S3-2 is short text extended operation, the short text extracted in step 3-1 by based on being expanded from extended mode Exhibition, this is sentenced explains from exemplified by the direct splicing method in extended mode, i.e., the short text extracted is directly attached, All texts positioned at this corporation are expanded into a long document;
S3-3 is Rule of judgment, judges whether all short texts to have carried out extended operation according to corporations' division result.If There are the corporations not expanded then to enter step 3-1, otherwise into step 3-4;
S3-4 is returns to the lengthy document collection after expanding, and the short text according to corporations' division result expands step and terminated.
In summary, in embodiment, the method that the topic model found based on society is built is the master of koinotropic type's data set The excavation of topic provides a kind of new thinking, this method by the discovery of the community structure to containing inside koinotropic type's data set, And the document sets that length is formed from expansion of short text are carried out based on this, solve and Topics Crawling is directly carried out on short text The Sparse sex chromosome mosaicism faced, is greatly improved the quality of topic model, is that the topic model of koinotropic type's data set is carried Solution is supplied.
Although the present invention is disclosed above in preferred embodiment, so it is not limited to the present invention.Belonging to of the invention Has usually intellectual in technical field, without departing from the spirit and scope of the present invention, when can be used for a variety of modifications and variations. Therefore, the scope of protection of the present invention is defined by those of the claims.

Claims (6)

1. a kind of topic model construction method based on community discovery, it is characterised in that comprise the following steps:
Step 1, the relational network contained based on short text data extraction;
Step 2, relational network is divided into by multiple corporations using community discovery algorithm;
Step 3, the short text extracted in each corporations expanded to obtain the lengthy document with Term co-occurrence relation, and will be obtained Multiple lengthy documents constitute lengthy document set;
Step 4, for lengthy document set carry out Topics Crawling, obtain the TMCD topic models based on community discovery.
2. topic model construction method as claimed in claim 1, it is characterised in that the extraction process of relational network in step 1 It is:Using the main body in short text data as node, it is associated by interactive relation between main body and abstract formation side, will To node and side be collectively forming a relational network.
3. topic model construction method as claimed in claim 2, it is characterised in that with the level of intimate of interactive relation between main body As the weight on side, the direction on side is used as using the passive relation of the master of association.
4. topic model construction method as claimed in claim 1, it is characterised in that the community discovery algorithm described in step 2 Including the one or more in cohesion, division, label propagation and global exploration.
5. topic model construction method as claimed in claim 1, it is characterised in that used from extended method pair in step 3 Short text is expanded.
6. the topic model construction method as described in claim 1 to 5 any one, it is characterised in that the short text data It is the internal koinotropic type's data for containing community network, the relational network is community network.
CN201710361414.0A 2017-05-22 2017-05-22 Topic model construction method based on community discovery Active CN107122494B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710361414.0A CN107122494B (en) 2017-05-22 2017-05-22 Topic model construction method based on community discovery

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710361414.0A CN107122494B (en) 2017-05-22 2017-05-22 Topic model construction method based on community discovery

Publications (2)

Publication Number Publication Date
CN107122494A true CN107122494A (en) 2017-09-01
CN107122494B CN107122494B (en) 2020-06-26

Family

ID=59727788

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710361414.0A Active CN107122494B (en) 2017-05-22 2017-05-22 Topic model construction method based on community discovery

Country Status (1)

Country Link
CN (1) CN107122494B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108681557A (en) * 2018-04-08 2018-10-19 中国科学院信息工程研究所 Based on the short text motif discovery method and system indicated from expansion with similar two-way constraint
CN110264372A (en) * 2019-05-16 2019-09-20 西安交通大学 A kind of theme Combo discovering method indicated based on node

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7877407B2 (en) * 1998-10-05 2011-01-25 Smith Iii Julius O Method and apparatus for facilitating use of hypertext links on the world wide web
CN103778207A (en) * 2014-01-15 2014-05-07 杭州电子科技大学 LDA-based news comment topic digging method
EP2751720A1 (en) * 2011-08-31 2014-07-09 Metaswitch Networks Ltd Processing communications data
CN104123336A (en) * 2014-05-21 2014-10-29 深圳北航新兴产业技术研究院 Deep Boltzmann machine model and short text subject classification system and method
CN104391942A (en) * 2014-11-25 2015-03-04 中国科学院自动化研究所 Short text characteristic expanding method based on semantic atlas
CN104850650A (en) * 2015-05-29 2015-08-19 清华大学 Short-text expanding method based on similar-label relation
CN106055604A (en) * 2016-05-25 2016-10-26 南京大学 Short text topic model mining method based on word network to extend characteristics

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7877407B2 (en) * 1998-10-05 2011-01-25 Smith Iii Julius O Method and apparatus for facilitating use of hypertext links on the world wide web
EP2751720A1 (en) * 2011-08-31 2014-07-09 Metaswitch Networks Ltd Processing communications data
CN103778207A (en) * 2014-01-15 2014-05-07 杭州电子科技大学 LDA-based news comment topic digging method
CN104123336A (en) * 2014-05-21 2014-10-29 深圳北航新兴产业技术研究院 Deep Boltzmann machine model and short text subject classification system and method
CN104391942A (en) * 2014-11-25 2015-03-04 中国科学院自动化研究所 Short text characteristic expanding method based on semantic atlas
CN104850650A (en) * 2015-05-29 2015-08-19 清华大学 Short-text expanding method based on similar-label relation
CN106055604A (en) * 2016-05-25 2016-10-26 南京大学 Short text topic model mining method based on word network to extend characteristics

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LIN T TIAN: "The dual-sparse topic model:mining focused topics and focused terms in short text", 《PROCEEDINGS OF THE 23RD INTERNATIONAL CONFERENCE ON WORLD WIDEWEB" INTERNATIONAL WORLD WIDE WEB CONFERENCES STEERING COMMITTEE》 *
熊小兵: "微博网络传播行为中的关键问题研究", 《中国博士学位论文全文数据库 信息科技辑》 *
陈静,刘琰,王煦中: "主题概率模型在微博主题挖掘方面的研究综述", 《信息工程大学学报》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108681557A (en) * 2018-04-08 2018-10-19 中国科学院信息工程研究所 Based on the short text motif discovery method and system indicated from expansion with similar two-way constraint
CN110264372A (en) * 2019-05-16 2019-09-20 西安交通大学 A kind of theme Combo discovering method indicated based on node
CN110264372B (en) * 2019-05-16 2022-03-08 西安交通大学 Topic community discovery method based on node representation

Also Published As

Publication number Publication date
CN107122494B (en) 2020-06-26

Similar Documents

Publication Publication Date Title
CN107797991B (en) Dependency syntax tree-based knowledge graph expansion method and system
CN106910501B (en) Text entities extracting method and device
CN107608949B (en) A kind of Text Information Extraction method and device based on semantic model
CN110941692B (en) Internet political outturn news event extraction method
CN104182535B (en) Method and device for extracting character relation
CN110825881A (en) Method for establishing electric power knowledge graph
CN103955531A (en) Online knowledge map based on named entity library
Khan et al. Extracting Spatial Information From Place Descriptions
CN104679867B (en) Address method of knowledge processing and device based on figure
CN101944094A (en) Webpage information extraction method and device thereof
CN108710611A (en) A kind of short text topic model generation method of word-based network and term vector
CN103365978A (en) Traditional Chinese medicine data mining method based on LDA (Latent Dirichlet Allocation) topic model
CN102955853B (en) A kind of generation method and device across language digest
CN103927179B (en) Program readability analysis method based on WordNet
CN103631862B (en) Event characteristic evolution excavation method and system based on microblogs
CN106503256B (en) A kind of hot information method for digging based on social networks document
CN107092605A (en) A kind of entity link method and device
CN107608948A (en) A kind of construction method and device of Text Information Extraction model
CN109885693A (en) The quick knowledge control methods of knowledge based map and system
Garanina et al. Ontology population as algebraic information system processing based on multi-agent natural language text analysis algorithms
CN107122494A (en) Topic model construction method based on community discovery
CN106462579B (en) Dictionary is constructed for selected context
CN104217026B (en) A kind of Chinese micro-blog tendentiousness search method based on graph model
CN110990451B (en) Sentence embedding-based data mining method, device, equipment and storage device
CN107291700A (en) Entity word recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant