CN105868186A - Simple and efficient topic extracting method - Google Patents
Simple and efficient topic extracting method Download PDFInfo
- Publication number
- CN105868186A CN105868186A CN201610382578.7A CN201610382578A CN105868186A CN 105868186 A CN105868186 A CN 105868186A CN 201610382578 A CN201610382578 A CN 201610382578A CN 105868186 A CN105868186 A CN 105868186A
- Authority
- CN
- China
- Prior art keywords
- topic
- row
- column
- lexical item
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a simple and efficient topic extracting method. According to the method, the speed of extracting the topic can be increased. The method comprises the following steps: S1) in the word/file stage, treating the rows/lines of some small blocks of a topic matrix at each calculating node, successively scanning each row/line assigned to the calculating node, and performing an accepting step and a proposing step on each row/line; and S2) judging if the iteration times reach a preset constant, if yes, stopping iteration, if not, adding 1 to the iteration times and repeating the steps S1 and S2.
Description
Technical field
The present invention relates to data mining technology field, be specifically related to a kind of simple efficient topic
Extracting method.
Background technology
Topic model all body in terms of excavating document semantic information and processing complicated file structure
Reveal obvious advantage, utilize the semanteme in the topic model extensive document of excavation, structure to need
Problem to be solved is mainly: number of documents is the hugest, needs efficient algorithm;Needs carry
The topic number, the vocabulary number of data set that take are the biggest, need to optimize especially to save storage
Space;It is the simplest that algorithm realizes needs so that more users can be adopted.
Nowadays the data applying topic model develop into from small-scale text set on a large scale
Community network so that whole internet.Traditional unit learning method cannot adapt to big data
Requirement, the algorithm needing quickly and can running in a distributed computing environment.
In prior art, utilize Metropolis-Hastings algorithm and model concurrent technique, ginseng
The algorithm of number server can linearly solve in the time complexity of data set size.But
It needs extensive random-access memory, it is impossible to make full use of the cache of CPU;Needs are deposited
Store up huge topic count matrix.
Visible by foregoing description, existing topic extraction algorithm speed is slow, storage complexity
High, it is achieved complicated.
Summary of the invention
In view of this, the embodiment of the present invention provides a kind of simple efficient method for extracting topic,
The speed that topic extracts can be improved.
The embodiment of the present invention proposes a kind of simple efficient method for extracting topic, including:
S1, in word/document stage, some of each calculating node processing topic matrix is little
The column/row of block, sequential scan distributes to the column/row of this calculating node, to every column/row, performs
Accept step and propose step, wherein, whole lexical items of training data are expressed as one dilute
Dredging matrix, each document is a line, and each word is row, each lexical item of sparse matrix
This lexical item actualite of middle storage and some motion topics, perform accept step and propose step
According to word stage and document stage alternately, accept in step according to this column/row lexical item
Actualite calculate the topic count vector of this column/row, then according to the topic of this column/row
Count vector calculates the acceptance probability of motion topic, and updates the current words of this column/row lexical item
Topic;Propose motion topic new according to the actualite generation of this column/row lexical item in step;
S2, judge whether iterations reaches predetermined constant, if it is, stop iteration,
If it is not, then iterations adds 1, repeat S1, S2.
The simple efficient method for extracting topic that the embodiment of the present invention provides, by calculating node
Distributed Calculation process large-scale data, actualite is updated, and produces new
Motion topic, it is possible to increase the speed that topic extracts, and algorithm the most easily realizes.
Accompanying drawing explanation
Fig. 1 is that the flow process of a kind of simple efficient method for extracting topic one embodiment of the present invention is shown
It is intended to.
Detailed description of the invention
For making the purpose of the embodiment of the present invention, technical scheme and advantage clearer, below will
In conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clearly
Chu ground describe, it is clear that described embodiment be a part of embodiment of the present invention rather than
Whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not having
Have and make the every other embodiment obtained under creative work premise, broadly fall into the present invention
The scope of protection.
Referring to Fig. 1, the open a kind of simple efficient method for extracting topic of the present embodiment, including:
S1, in word/document stage, some of each calculating node processing topic matrix is little
The column/row of block, sequential scan distributes to the column/row of this calculating node, to every column/row, performs
Accept step and propose step, wherein, whole lexical items of training data are expressed as one dilute
Dredging matrix, each document is a line, and each word is row, each lexical item of sparse matrix
This lexical item actualite of middle storage and some motion topics, perform accept step and propose step
According to word stage and document stage alternately, accept in step according to this column/row lexical item
Actualite calculate the topic count vector of this column/row, then according to the topic of this column/row
Count vector calculates the acceptance probability of motion topic, and updates the current words of this column/row lexical item
Topic;Propose motion topic new according to the actualite generation of this column/row lexical item in step;
It should be noted that in the system that a topic extracts, calculate joint including several
Point, between use express network, as InfiniBand connect.
Training dataset comprises a vocabularyIts size is V.Claim in vocabulary is each
Xiang Weiyi word, additionally, claim a word appear as a lexical item each time.
Training dataset comprises D piece document, wherein has L in d piece documentdIndividual lexical item, n-th
Lexical item is expressed as wdn, it means the n-th lexical item is the w in vocabularydnIndividual word, i.e.The difference of word and lexical item be lexical item be that word is in the one of certain position of certain document
Secondary appearance, word possible corresponding a lot of lexical items, i.e. this word are at different article not coordinatioies
The appearance put.
Each lexical item has an actualite zdnWith M motion topic (z 'd1..., z 'dM), its
Middle M is constant set in advance.Claim actualite and the motion topic composition of a lexical item
Collection is combined into the topic vector y of this lexical itemdn=(zdn, z 'd1..., z 'dM).Actualite and motion
The span of topic be 1 ..., K}, wherein K is expression topic numbers set in advance
Constant.For each lexical item wdn, its topic vector is stored in the Y of matrix YD, wdnPosition
Put, i.e. YD, w={ ydn|wdn=w, 1≤n≤Ld}.Matrix Y is the dilute of a D × V
Dredge matrix, referred to as topic matrix.It is found that the d row of topic matrix contains in document d
All topic vectors of lexical item;The w row of topic matrix contain wordMiddle correspondence complete
The topic vector of portion's lexical item.
In step sl, by topic matrix to compress the storage of sparse column format, the most each list
The lexical item of word, by row Coutinuous store, remembers the topic matrix that the sparse column format of this compression stores
For YCSC.Additionally, point to the pointer of lexical item by compression loose line form storage, i.e. point to
The pointer of whole lexical items of each document, by row Coutinuous store, remembers this compression loose line lattice
The pointer matrix of lexical item in topic matrix that points to of formula storage is YCSR。
Before the computation, topic matrix is cut into calculating interstitial content and takes advantage of calculating interstitial content
Fritter.Design operator node number is P, and the fritter of the i-th row jth row is
The process calculated is divided into document stage and word stage.In the document stage, calculate node
I processes the i-th row fritterEach calculating node sequence processes each document, its
In document d processed the topic matrix Y compressing the storage of sparse column format into reading and writingCSC's
D row, concrete reading and writing mode is mentioned below.In the word stage, calculate node j process
Jth row fritterEach calculating node sequence processes each word, the most right
Word w processes as lexical item in the sensing topic matrix of reading and writing compression loose line form storage
Pointer matrix YCSRW row, concrete reading and writing mode is mentioned below.
Each row and every a line in document stage to the word stage, concrete reading and writing content is equal
It is divided into and accepts step and propose step.Accept step comprises:
S110: the actualite of whole lexical items calculates based on the topic of this column/row in this column/row
Number vector.If certain row d is L altogetherdIndividual lexical item, the actualite of lexical item is respectivelyThen the topic count vector of row d is Cd=(Cd1..., CdK), wherein,
CdkRepresent the number of times that topic k occurs in corresponding document, i.e.
Cdk=| n ∈ 1 ..., Ld}|zdn=k} |, wherein | | represent cardinality of a set.In like manner,
If certain row w L altogetherwIndividual lexical item, the actualite of lexical item is respectivelyThen arrange w
Topic count vector be Cw=(Cw1... CwK), wherein, CwkRepresent that topic k is right
The number of times occurred in the word answered, i.e.
Cwk=| d ∈ 1 ..., D}, n ∈ 1 ..., Ld}|zdn=k, wdn=w} |.
S111: topic count vector and motion topic according to this column/row calculate and accept generally
Rate, and update the actualite of this column/row lexical item.
To each lexical item, if actualite is k0, and motion topic is respectively k '1,…,k′M, then
Structure Metropolis-Hastings chain k1,…,kM, wherein
And when the document stage, if current document is d, acceptance probability πiExpression formula as follows:
When the word stage, if current word is the w in vocabulary, acceptance probability πiExpression
Formula is as follows:
Wherein, α1..., αK, β is previously given constant,For overall situation topic
Count vector.Finally, updating actualite is kM。
Propose that step comprises:
S120: produce new motion topic according to the actualite of this column/row lexical item.Wherein
The probability producing topic k when document phase process d piece document is proportional to Cdk+αk, at list
The probability producing topic k during word phase process w piece document is proportional to Cwk+β。
The present embodiment produces new motion topic and can use alias table method, for existing skill
Art, does not repeats them here.
Each in word/document stage accepts in step, can disposably calculate multiple
The acceptance probability of motion topic, in each proposal step, can disposably propose multiple
Motion topic.
Each word/document stage each calculating node processing one arranges/a line fritter, each
Word/document stage terminate after the fritter currently calculating on node is sent to next document/
The word stage needs on the calculating node of this fritter.
Additionally, in order to realize iterative computation, it is also possible to by further for each fritter aforesaid
Being cut into B and be multiplied by the particle of B, wherein B is constant set in advance.At each word/document
Stage processes every column/row of particle in order, and the most asynchronously will after particle disposal is complete
The transmission of this particle needs the calculating node of this particle to next document/word stage.
S2, judge whether iterations reaches predetermined constant, if it is, stop iteration,
If it is not, then iterations adds 1, repeat S1, S2.
The simple efficient method for extracting topic that the present embodiment provides, by calculating dividing of node
Cloth calculating processes large-scale data, is updated actualite, and produces new motion
Topic, it is possible to increase the speed that topic extracts, and algorithm the most easily realizes.
Alternatively, in another embodiment of the simply efficient method for extracting topic of the present invention,
Also include:
After each iteration completes, calculate the probability of the Joint Distribution of lexical item and actualite,
And according to described probabilistic determination convergence of algorithm situation.
In the embodiment of the present invention, the computing formula of the probability of the Joint Distribution of lexical item and topic is
F=fd+fk+fw, wherein
Wherein Γ () is Euler's gamma function,F can be calculated in the document staged,
The word stage calculates fkAnd fw。
It should be understood that this algorithm is without storing by CdkAnd CwkThe topic counting square of composition
Battle array, and only need to calculate in use, thus saved memory space.
When whether evaluation algorithm restrains, if calculated word and topic after certain an iteration
Joint Distribution probability and front an iteration after calculated word and the Joint Distribution of topic
The absolute value of difference of probability less than a certain less constant, then algorithmic statement is described, no
Then, illustrate that algorithm is not converged.
Although be described in conjunction with the accompanying embodiments of the present invention, but those skilled in the art
Various modifications and variations can be made without departing from the spirit and scope of the present invention,
Within the scope of such amendment and modification each fall within and are defined by the appended claims.
Claims (6)
1. a simple efficient method for extracting topic, it is characterised in that including:
S1, in word/document stage, some of each calculating node processing topic matrix is little
The column/row of block, sequential scan distributes to the column/row of this calculating node, to every column/row, performs
Accept step and propose step, wherein, whole lexical items of training data are expressed as one dilute
Dredging matrix, each document is a line, and each word is row, each lexical item of sparse matrix
This lexical item actualite of middle storage and some motion topics, perform accept step and propose step
According to word stage and document stage alternately, accept in step according to this column/row lexical item
Actualite calculate the topic count vector of this column/row, then according to the topic of this column/row
Count vector calculates the acceptance probability of motion topic, and updates the current words of this column/row lexical item
Topic;Propose motion topic new according to the actualite generation of this column/row lexical item in step;
S2, judge whether iterations reaches predetermined constant, if it is, stop iteration,
If it is not, then iterations adds 1, repeat S1, S2.
Method the most according to claim 1, it is characterised in that described S1, including:
In word/document stage, each accept in step and propose basis in step
Metropolis-Hastings algorithm calculates acceptance probability and proposes new motion topic, wherein, newly
The probability that in motion topic, each topic produces is proportional to what this topic occurred in corresponding document
The Di Li Cray priori of number of times and this topic and.
Method the most according to claim 2, it is characterised in that described S1, including:
Each in word/document stage accepts in step, disposably calculates multiple motion words
The acceptance probability of topic, in each proposal step, disposably proposes multiple motion topic.
Method the most according to claim 1, it is characterised in that sparse matrix is by row even
Renew storage lexical item, point to the pointer of lexical item by row Coutinuous store, in the word stage, according to pressing
The lexical item of row Coutinuous store accesses lexical item;In the document stage, according to the finger by row Coutinuous store
Lexical item is accessed to the pointer of lexical item.
Method the most according to claim 1, it is characterised in that described S1, including:
Topic matrix is cut into M and is multiplied by the fritter of M, in each word/document stage by suitable
Sequence processes the row fritter of each column/often, and the most asynchronously should after each column/often row fritter has processed
The fritter transmission of column/row needs the calculating node of this fritter to next document/word stage, wherein
M is constant set in advance.
Method the most according to claim 1, it is characterised in that also include:
After each iteration completes, calculate the probability of the Joint Distribution of lexical item and actualite,
And according to described probabilistic determination convergence of algorithm situation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610382578.7A CN105868186A (en) | 2016-06-01 | 2016-06-01 | Simple and efficient topic extracting method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610382578.7A CN105868186A (en) | 2016-06-01 | 2016-06-01 | Simple and efficient topic extracting method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105868186A true CN105868186A (en) | 2016-08-17 |
Family
ID=56676360
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610382578.7A Pending CN105868186A (en) | 2016-06-01 | 2016-06-01 | Simple and efficient topic extracting method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105868186A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103605658A (en) * | 2013-10-14 | 2014-02-26 | 北京航空航天大学 | Search engine system based on text emotion analysis |
CN103810282A (en) * | 2014-02-19 | 2014-05-21 | 清华大学 | Logistic-normal model topic extraction method |
CN103870447A (en) * | 2014-03-11 | 2014-06-18 | 北京优捷信达信息科技有限公司 | Keyword extracting method based on implied Dirichlet model |
CN104462253A (en) * | 2014-11-20 | 2015-03-25 | 武汉数为科技有限公司 | Topic detection or tracking method for network text big data |
US20150268930A1 (en) * | 2012-12-06 | 2015-09-24 | Korea University Research And Business Foundation | Apparatus and method for extracting semantic topic |
CN105354333A (en) * | 2015-12-07 | 2016-02-24 | 天云融创数据科技(北京)有限公司 | Topic extraction method based on news text |
-
2016
- 2016-06-01 CN CN201610382578.7A patent/CN105868186A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150268930A1 (en) * | 2012-12-06 | 2015-09-24 | Korea University Research And Business Foundation | Apparatus and method for extracting semantic topic |
CN103605658A (en) * | 2013-10-14 | 2014-02-26 | 北京航空航天大学 | Search engine system based on text emotion analysis |
CN103810282A (en) * | 2014-02-19 | 2014-05-21 | 清华大学 | Logistic-normal model topic extraction method |
CN103870447A (en) * | 2014-03-11 | 2014-06-18 | 北京优捷信达信息科技有限公司 | Keyword extracting method based on implied Dirichlet model |
CN104462253A (en) * | 2014-11-20 | 2015-03-25 | 武汉数为科技有限公司 | Topic detection or tracking method for network text big data |
CN105354333A (en) * | 2015-12-07 | 2016-02-24 | 天云融创数据科技(北京)有限公司 | Topic extraction method based on news text |
Non-Patent Citations (3)
Title |
---|
JIANFEI CHEN ET AL.: "WarpLDA: a Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation", 《PROCEEDINGS OF THE VLDB ENDOWMENT》 * |
沈聚敏 等: "《钢筋混凝土有限元与板壳极限分析》", 30 November 1993, 清华大学出版社 * |
陈平 等: "Metropolis-Hastings自适应算法及其应用", 《***工程理论与实践》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Benedetti et al. | A fast 3D dual boundary element method based on hierarchical matrices | |
CN105447179B (en) | Topic auto recommending method and its system based on microblogging social networks | |
CN106991127B (en) | Knowledge subject short text hierarchical classification method based on topological feature expansion | |
CN102298579A (en) | Scientific and technical literature-oriented model and method for sequencing papers, authors and periodicals | |
Nam et al. | Efficient approach for damped window-based high utility pattern mining with list structure | |
CN101339553A (en) | Approximate quick clustering and index method for mass data | |
Plumecoq et al. | From template analysis to generating partitions: I: periodic orbits, knots and symbolic encodings | |
CN105069290B (en) | A kind of parallelization key node towards consignment data finds method | |
Cevahir et al. | Site-based partitioning and repartitioning techniques for parallel pagerank computation | |
Baillie et al. | Cluster identification algorithms for spin models—Sequential and parallel | |
CN113159287A (en) | Distributed deep learning method based on gradient sparsity | |
CN112632874A (en) | Optimization method and system for numerical simulation of helicopter flow field | |
CN109145107A (en) | Subject distillation method, apparatus, medium and equipment based on convolutional neural networks | |
CN105913063A (en) | Sparse expression acceleration method for image data set and device | |
CN109815456A (en) | A method of it is compressed based on term vector memory space of the character to coding | |
CN105868186A (en) | Simple and efficient topic extracting method | |
Glondu et al. | Fast collision detection for fracturing rigid bodies | |
CN107818125A (en) | Assessment is iterated by SIMD processor register pair data | |
WO2020037512A1 (en) | Neural network calculation method and device | |
Gonzaga de Oliveira | An evaluation of heuristic methods for the bandwidth reduction of large-scale graphs | |
US9122997B1 (en) | Generating attribute-class-statistics for decision trees | |
CN112734625B (en) | Hardware acceleration system and method based on 3D scene design | |
Singh et al. | RSTDB a new candidate generation and test algorithm for frequent pattern mining | |
CN103353864A (en) | Method and system for excavating approximate dictionary rule of database, and database terminal | |
CN102779288A (en) | Ontology analysis method based on field theory |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination |