CN108090231A - A kind of topic model optimization method based on comentropy - Google Patents

A kind of topic model optimization method based on comentropy Download PDF

Info

Publication number
CN108090231A
CN108090231A CN201810029097.7A CN201810029097A CN108090231A CN 108090231 A CN108090231 A CN 108090231A CN 201810029097 A CN201810029097 A CN 201810029097A CN 108090231 A CN108090231 A CN 108090231A
Authority
CN
China
Prior art keywords
mrow
theme
topic
lexical item
phrase
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810029097.7A
Other languages
Chinese (zh)
Inventor
孙新
申长虹
唐正
姚晶旭
张颖捷
欧阳童
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201810029097.7A priority Critical patent/CN108090231A/en
Publication of CN108090231A publication Critical patent/CN108090231A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of topic model optimization methods based on comentropy, belong to Text Classification field.The present invention main technical schemes be:It is related to a kind of structure of topic model and the degree of subject relativity of document content is calculated using constructed topic model, specifically utilize comentropy and mutual information, the feature lexical item of theme can uniquely be characterized by being excavated from topic corpus, the feature lexical item of threshold condition will be met as theme dictionary, training topic model calculates the degree of subject relativity of document content.Present invention is particularly suitable for the document content degree of subject relativity calculating that subject key words item is depending therefrom, high granularity, the theme feature lexical item of strong feature can be excavated according to statistics such as comentropies, effective aggregation features lexical item realizes the document content Optimum Classification of dependent theme.

Description

A kind of topic model optimization method based on comentropy
Technical field
The present invention relates to a kind of topic model optimization methods based on comentropy, belong to Text Classification field.
Background technology
In recent years, mass data information is while great convenience is brought, and analysis similarly to information and looks into It looks for and brings huge challenge.Under big data background, required information how is rapidly obtained from mass data as people It is in the urgent need to address the problem of.
The form complexity of data is various, and compared to the data mode that video, audio so visualize, text data is abstract And the highest data mode of condensed degree.In machine learning and natural language processing field, it is often necessary to be dug from a large amount of texts Excavate the potential applications relation contained in text lexical item.Previous information retrieval website by Shallow Semantic Parsing to content of text into The preliminary semantic analysis of row determines the correlation between the document and search for, but with society and the continuous hair of technology Exhibition, it is desirable to quickly obtain accurate answer in a manner of " ask and answer ", so frequent and efficient interactive mode Machine is promoted to need have deeper analysis and understandability to text semantic.
By the study and prediction of topic model, the theme distribution of text can be obtained, realizes text cluster, classification, inspection The tasks such as rope, extension, recommendation and applied to text mining, sentiment analysis, commending system, digital book, public sentiment monitoring, number According to fields such as acquisition, social network sites and personalized retrievals.
Traditional subject heading list representation model mainly has Boolean Model, vector space model, probabilistic model and language model etc..Cloth You represent particular topic by model with a subject key words set, as long as calculating the intersection of keyword set, you can judge text The degree of correlation of shelves and theme.Although Boolean Model is easily achieved, but it does not account for the weight of keyword, can not be accurate Similitude is calculated, binary outcome can not effectively distinguish degree of subject relativity.The appearance of vector space model compensates for Boolean Model and recognizes For all keywords it is of equal importance the defects of, improve the binary value of keyword weight, quantitatively segment different keywords to theme Different contributions.But vector space model does not account for the semantic information of lexical item, can not judge lexical item not on semantic understanding Same and semantic relevant content of text.
Due to document semantic and the close relation of document subject matter, consider how to carry out document subject matter from document structure tree angle The method of modeling is come into being.PLSA (Probabilistic Latent Se-mantic Analysis) topic model be from The angle of Frequency school is set out to document structure tree process model building, and Frequency school thinks that model parameter is although unknown but fixes not Become, can apply and be calculated the methods of Maximum-likelihood estimation.Since think with the Bayesian schools that Frequency school is completely contradicted Unknown parameters, then parameter is also a stochastic variable, also obeys corresponding distribution.If it is ginseng on the basis of PLSA models Number has just obtained LDA (Latent Dirichlet Allocation) topic model plus corresponding prior distribution.
LDA has three layers of Feature Words, theme and document Bayesian network as complete production probability statistics topic model Network structure by being modeled to corpus, excavates potential semantic information in corpus.With the development and application of LDA models, Extended Model based on LDA is also gradually suggested.In order to preferably find the correlation information between implicit theme, CTM models The Dirichlet substituted using logic-normal distribution in LDA models is distributed;PAV models using directed acyclic graph represent theme it Between imply semantic information, so as to more effectively excavate existing hierarchical relationship between theme;SLDA models pass through Class label is added in, the structure of thematic structure information and prediction is made to become more accurate.Above-mentioned extended model takes full advantage of LDA moulds The type expression ability powerful to text.Compared to other topic models, probability theory is introduced into model by LDA, layer of structure Clearly, meet text actual conditions, there is powerful semantic classes characteristic under big data environment, meanwhile, pass through Dirichlet points Cloth constructs subject layer and Feature Words layer, can quickly handle huge topic corpus, effectively avoid over-fitting in training process Situation.
Topic model represent it is accurate whether be restriction text subject degree of correlation an important factor for.However, in actual feelings In condition, the text in text library can be gathered by the Attribute transposition of some structurings for some, and the text among each set is deposited In general character, and these general character are by being ignored such as topic model independent between LDA this class hypothesis text.Therefore, as text master Between topic share majority equal key word item when, how using LDA models advantage, build new topic model, effectively distinguish Subject categories belonging to document are the key that we study.
The content of the invention
It is an object of the invention to improve the classification accuracy of dependent theme in text classification, for document content lexical item Different and semantic relevant characteristic, it is proposed that a kind of topic model optimization method based on comentropy utilizes comentropy and mutual trust Breath statistic excavates the feature lexical item that can uniquely characterize theme from topic corpus, and selection meets the Feature Words of threshold condition , training topic model calculates the degree of subject relativity of document content further according to this topic model, distinguishes the theme class belonging to document Not;This method, which is particularly suitable for subject key words set, more intersection or each theme with the superior and the subordinate's inclusion relation, energy Subject categories belonging to enough effective district single cent shelves.
A kind of topic model optimization method based on comentropy, includes the following steps:
Step 1. training LDA topic models obtain theme dictionary;
Specifically, according to topic corpus train LDA topic models, obtain theme dictionary, wherein, topic corpus by with Family voluntarily selects as needed, and step 1 specifically includes following sub-step:
Step 1.1 is according to Probability p (dm) one document d of selectionm, wherein m ∈ [1, M], M are number of documents;
Step 1.2 utilizes Dirichlet prior distributions generation document dmTheme multinomial distributionIts In,It is the parameter of Dirichlet prior distributions;
Step 1.3 basisGenerate document dmThe theme k=z of n-th of wordm,n, wherein, zm,nIt represents n-th in m documents A theme;
Step 1.4 reuses Dirichlet prior distributions generation k=zm,nLexical item multinomial distributionWherein,It is the parameter of Dirichlet prior distributions;
Step 1.5 is from lexical item multinomial distributionGenerate Topic word wm,n, wherein, wm,nIt represents in m documents n-th Word;
Step 1.6 is for document dmIn NmA word repeats step 1.2 to step 1.5, generates corresponding Topic word
Step 1.7 repeats M-1 step 1.1 to step 1.6, generation theme dictionary w for M documentM,N, wherein, N For the number of word in M document;
Step 2. utilizes comentropy, and the candidate feature lexical item for uniquely characterizing the theme is excavated from topic corpus;
Specifically, topic corpus is scanned, and excavates based on left and right comentropy and mutual information statistic and meets certain threshold The candidate feature lexical item of value condition;
Step 2 specifically includes two sub-steps, step 2.1 and step 2.2:
Step 2.1 utilizes left and right comentropy and mutual information statistic, is determined to the feature lexical item of uniquely characterization theme, Computational methods are shown in formula (1):
Candidate feature lexical item phrase needs to meet:HL (phrase) > a1∩ HR (phrase) > a2∩ I (phrase) > a3, wherein, a1, a2, a3Respectively HL (phrase), HR (phrase), the threshold value of I (phrase);
It should be noted that the left comentropy HL (phrase) of candidate feature lexical item phrase and right comentropy HR (phrase) definition procedure is about set to formula (2) and (3):
Wherein, character string x and character string y represents the left adjacent character string of candidate feature lexical item phrase and right adjoining respectively Character string, p (x | phrase) represent that character string x is the probability of the left adjacent character string of candidate feature lexical item phrase, p (y | Phrase it is the probability of the right adjacent character string of candidate feature lexical item phrase) to represent character string y;
Meanwhile the calculation formula of candidate feature lexical item phrase mutual informations is about set to formula (4):
Wherein, character string x and character string y represents the left adjacent character string of candidate feature lexical item phrase and right adjoining respectively Character string, then the probability that p (x, y) expression candidate feature lexical items phrase occurs, p (x) and p (y) represent character string x and y respectively The probability individually occurred;It should be noted that association relationship I (phrase) is bigger, the correlation between character string x and y is higher, Namely character string x and y connect into word probability it is higher;
Meanwhile wi∈ phrase represent to form the substring w of phrasei(1≤i≤3).∑ p (phrase) is normalization Coefficient, 0≤p (phrase)≤1.p(topic|wi) can be obtained according to Bayesian formula, shown in computational methods such as formula (5):
Wherein, p (topic) and p (wi) counted from topic corpus, p (wi| topic) it is obtained from step 1;
Ranking results of the step 2.2 according to the probability value p (phrase) of candidate feature lexical item from high to low, screening meet threshold The candidate feature lexical item of value condition;Specifically, according to formula (1) threshold parameter different with formula (5) setting, calculate and choose Preceding K candidate feature lexical item;
For step 3. using candidate feature lexical item as theme dictionary, training obtains topic model;
The training of step 3 specifically includes following sub-step:
Step 3.1 random initializtion, to each word w in each document in topic corpusi, a theme is assigned at random Number zi
Step 3.2 scans topic corpus, to each word wi, compiled according to Gibbs Sampling formula resamplings theme Number, and updated in topic corpus, shown in wherein Gibbs Sampling sampling computational methods such as formula (6):
Where it is assumed that lexical item wi=t;ziRepresent the corresponding theme variable of i-th of word;It represents to reject therein i-th ;Represent the number of lexical item t occur in k themes;βtIt is the Dirichlet priori of lexical item t;It represents to lead in document m Inscribe the number of k, αkIt is the Dirichlet priori of theme k;Represent the probability of lexical item t in theme k;For theme k in document m Probability;V represents the number of lexical item in topic corpus, and K is the theme the number of theme in dictionary, and M is number of documents;
Step 3.3 repeat the above steps 3.2 sampling process, until Gibbs Sampling restrain;Wherein, Gibbs Sampling convergences refer to that the probability value that formula (6) sampling obtains approaches the Joint Distribution of word and theme;
The co-occurrence frequency matrix of document subject matter and lexical item is to get to topic model in step 3.4 statistics topic corpus;
Step 4. calculates the degree of subject relativity of document content, and probability distribution value of the document on predetermined theme is taken to be led for it Inscribe relevance degree;
Specifically, using topic model, the theme distribution probability value of the document is taken to calculate relevance degree for its theme;
Wherein, the document in the topic corpus in the document and step 1 to step 3 in step 4 is different, and the latter is instruction Practice document, for training topic model;The former is prediction document, for applying topic model;
So far, from step 1 to step 4, a kind of topic model optimization method based on comentropy is completed.
Advantageous effect
A kind of topic model optimization method based on comentropy of the present invention, with existing topic model optimization method phase Than having the advantages that:
1st, the method for the invention is more suitable for the document content degree of subject relativity calculating depending therefrom of subject key words item;
2nd, the method for the invention can excavate the theme of high granularity, strong feature according to comentropy and mutual information statistic Feature lexical item, effective aggregation features lexical item, are realized the defects of low granularity lexical item is avoided to inscribe smudgy on the theme, word more The document content Optimum Classification of dependent theme;
3rd, the method for the invention need to only calculate the comentropy of lexical item and mutual information statistic before training pattern, operation Simply, method run cost is small.
Description of the drawings
Fig. 1 is the flow diagram in the present invention a kind of topic model optimization method and embodiment 1 based on comentropy;
Fig. 2 is the LDA probabilistic models signal in a kind of topic model optimization method embodiment 2 based on comentropy of the present invention Figure;
Fig. 3 is to utilize Gibbs in a kind of topic model optimization method embodiment 3 based on comentropy of the present invention The doc-topic-word path probability schematic diagrames of Sampling formula sampling.
Specific embodiment
The invention will be further described with reference to the accompanying drawings and detailed description.
In order to which technical solution in present application example and advantage is more clearly understood, below in conjunction with attached drawing to the application's Exemplary embodiment is described in more detail, it is clear that and described embodiment is only the part of the embodiment of the application, Rather than the exhaustion of all embodiments.It should be noted that in the case where there is no conflict, the example in the application can be tied mutually It closes.
The present invention defines independent theme and refers to have significant difference between subject key words set and include pass without the superior and the subordinate Each theme of system, and dependent theme refers to there is more intersection or with the superior and the subordinate's inclusion relation between subject key words set Each theme.The reason for defining independent theme and the two concepts of dependent theme is belonging to further subdivision document content Specific category.
Applicant analyzes the existing technical method being modeled based on document structure tree angle to document subject matter, with LDA moulds Exemplified by type method, if the model directly applies to document content relatedness computation, most equal keywords are shared between theme Xiang Shi, the model are not accurate enough to the document classification of dependent theme.
A kind of topic model optimization method based on comentropy is provided in present application example, can be counted according to comentropy Amount weighs word string degree of freedom from outside, and mutual information statistic is determined uniquely to characterize theme from the internal tight ness rating for weighing word string Feature lexical item, accordingly generate theme dictionary, train topic model using Gibbs Sampling methods, calculate document content Degree of subject relativity value.The statistics such as use information entropy can excavate high granularity, the theme feature lexical item of strong feature, effectively poly- The defects of closing feature lexical item, low granularity lexical item avoided to inscribe smudgy on the theme, word more.It should be noted that here High granularity refers to multiple contaminations, and low granularity refers to single word.
Scheme in present application example can be applied to text mining, sentiment analysis, commending system, digital book, public sentiment The fields such as monitoring, data acquisition, social network sites and personalized retrieval.
Embodiment 1
The embodiment of the present invention 1 elaborates a kind of topic model optimization method based on comentropy, and attached drawing 1 is reality of the invention Existing flow chart, specifically includes following steps:
S1. LDA topic models is trained to obtain theme dictionary;
For there is a LDA models of M document and K theme, the generating process of document is about set in specific LDA models:
S1.A is for document dm∈ [1, M], according to Probability p (dm) selection document dm, wherein m ∈ [1, M], M are number of files Mesh;
S1.B is for document dm∈ [1, M], sampling document dmTheme multinomial distributionWherein, It is the parameter of Dirichlet prior distributions;
S1.C is for document dm∈ [1, M], sampling document dmThe theme k=z of n-th of wordm,n, wherein, zm,nRepresent m N-th of theme in document;
S1.D is for theme k ∈ [1, K], the lexical item multinomial distribution of sampling theme kWherein,It is The parameter of Dirichlet prior distributions;
S1.E is for document dm∈ [1, M], from lexical item multinomial distributionGenerate Topic word wm,n, wherein, wm,nRepresent m N-th of word in piece document;
S1.F is for document dmIn NmA word repeats step S1.B~S1.E, generates corresponding Topic word wm,N
S1.G repeats M-1 step S1.A~S1.F, generation theme dictionary w for M documentM,N, wherein, N is M text The number of word in shelves;
S2. using comentropy, the candidate feature lexical item for uniquely characterizing the theme is excavated from topic corpus;
S2.A utilizes comentropy and mutual information statistic, is determined to the feature lexical item of uniquely characterization theme, calculating side Method is:
Candidate feature lexical item phrase needs to meet:HL (phrase) > a1∩ HR (phrase) > a2∩ I (phrase) > a3, wherein, a1, a2, a3Respectively HL (phrase), HR (phrase), the threshold value of I (phrase);
The left comentropy HL (phrase) of candidate feature lexical item phrase and the definition procedure of right comentropy HR (phrase) About it is set to
Wherein, character string x and character string y represents the left adjacent character string of candidate feature lexical item phrase and right adjoining respectively Character string, p (x | phrase) represent that character string x is the probability of the left adjacent character string of candidate feature lexical item phrase, p (y | Phrase it is the probability of the right adjacent character string of candidate feature lexical item phrase) to represent character string y;
The calculation formula of candidate feature lexical item phrase mutual informations is about set to:
Wherein, character string x and character string y represents the left adjacent character string of candidate feature lexical item phrase and right adjoining respectively Character string, then the probability that p (x, y) expression candidate feature lexical items phrase occurs, p (x) and p (y) represent character string x and y respectively The probability individually occurred.It should be noted that association relationship I (phrase) is bigger, the correlation between character string x and y is higher, Namely character string x and y connect into word probability it is higher;
Meanwhile wi∈ phrase represent to form the substring w of phrasei(1≤i≤3);∑ p (phrase) is normalization Coefficient, 0≤p (phrase)≤1;
Meanwhile p (topic | wi) can be obtained according to Bayesian formula, computational methods are:
Wherein, p (topic) and p (wi) counted from topic corpus, p (wi| topic) divide for the joint of word and theme Cloth is obtained from S1;
Ranking results of the S2.B according to the probability value p (phrase) of feature lexical item from high to low, screening meet threshold condition Candidate feature lexical item;
Specifically, according to formulaWithDifferent threshold parameters is set, K feature lexical item before calculating and choosing;
S3. using candidate feature lexical item as theme dictionary, training obtains topic model;
Specific training step includes:
S3.A random initializtions, to each word w in every document in topic corpusi, it is random to assign a theme volume Number zi
S3.B scans topic corpus, to each word wi, it is numbered according to Gibbs Sampling formula resamplings theme, And updated in topic corpus, wherein Gibbs Sampling samplings computational methods are:
Where it is assumed that lexical item wi=t;ziRepresent the corresponding theme variable of i-th of word;It represents to reject therein i-th ;Represent the number of lexical item t occur in k themes;βtIt is the Dirichlet priori of lexical item t;It represents to lead in document m Inscribe the number of k, αkIt is the Dirichlet priori of theme k;Represent the probability of lexical item t in theme k;For theme k in document m Probability;V represents the sum of lexical item in topic corpus, and K is the theme the sum of theme in dictionary, and M is number of documents;
S3.C repeats above-mentioned sampling process, until Gibbs Sampling restrain;
The co-occurrence frequency matrix of document subject matter and lexical item is to get to topic model in S3.D statistics topic corpus;
S4. the degree of subject relativity of document content is calculated, it is its theme to take probability distribution value of the document on predetermined theme Relevance degree.
So far, by S1~S4, a kind of topic model optimization method based on comentropy in the present embodiment is completed.
Embodiment 2
The present embodiment specifically describes the theme dictionary generating process of the LDA topic models described in step 1 of the present invention, LDA generates model as a kind of document subject matter, contains Feature Words, theme and document three-decker, is three layers of Bayes Probabilistic model, as shown in Figure 2.As can be seen that the generation of theme dictionary corresponds to two solely respectively from the probabilistic model of attached drawing 2 Vertical Dirichlet-Multinomial conjugated structures.
S2.A generating process 1:α→θ→z.Corresponding to step 1, specific generating process is:Represent life The corresponding topics numbers of all words into m documents.
From between conjugated structure characteristic and document independently of each other, the generating probability of theme in entire subject matter corpus For:
Wherein, Represent the number of the lexical item of k-th of topic generation in m documents, K is The number of theme, M are the number of document.
S2.B generating process 2:β → w, corresponding to step 1, specific generating process is:Table Show all lexical items that theme number is k in m documents of generation.Due to the theme number of generation lexical item and the process of generation lexical item It can be exchanged with each other, therefore the theme that the generating process of a document can be considered as to Mr. into all lexical items in document is numbered, Then different lexical items is regenerated to all identical theme numbers.By independently of each other may be used between conjugated structure characteristic and document Know, the generating probability of lexical item is such as in entire language material:
Wherein, Represent the number of lexical item t in the lexical item of k-th of theme generation.
S2.C according toWithEach pass can be obtained The word of keyword item and the Joint Distribution of theme, computational methods are about set to:
So far, the theme dictionary generating process of LDA topic models is completed by step S2.A~S2.C.
Embodiment 3
The present embodiment specifically describes the Gibbs Sampling method of samplings described in step 3 of the present invention, gives Doc-topic-word path probability schematic diagrames, as shown in Figure 3.It is as follows:
S3.A random initializtions, to each word w in every document in topic corpusi, it is random to assign a theme volume Number zi
S3.B scans topic corpus, to each word wi, it is numbered according to Gibbs Sampling formula resamplings theme, And it is updated in topic corpus, wherein Gibbs Sampling sampling computational methods such as formula:
Where it is assumed that lexical item wi=t;ziRepresent the corresponding theme variable of i-th of word;It represents to reject therein i-th ;Represent the number of lexical item t occur in k themes;βtIt is the Dirichlet priori of lexical item t;Represent occur in document m The number of theme k, αkIt is the Dirichlet priori of theme k;Represent the probability of lexical item t in theme k;For master in document m Inscribe the probability of k;V represents the sum of lexical item in topic corpus, and K is the theme the sum of theme in dictionary, and M is number of documents;
Formula the right as p (topic | doc) p (word | topic), this probability namely doc → topic → word Path probability, since theme has K, so the physical significance of Gibbs Sampling formula is exactly to be carried out in this K paths Sampling, as shown in Figure 3.
S3.C repeats above-mentioned sampling process, until Gibbs Sampling restrain;
The co-occurrence frequency matrix of document subject matter and lexical item is to get to topic model in S3.D statistics topic corpus.
Embodiment 4
Based on examples detailed above 1, the present embodiment provides a kind of specifically topic model optimization method based on comentropy, the party Method is specifically realized based on today's tops topic corpus and is calculated the document content degree of correlation using topic model.
S4.A based on experience value, quasi-definite Dirichlet prior distributions parameter alpha=0.5 and β=0.1, Gibbs Sampling maximum iterations are 2000 times, and based on comentropy and mutual information statistic, correspondence is calculated in number of topics 70 Theme candidate feature lexical item.
Specifically, candidates' feature lexical item such as " Group Co., Ltd ", " capital market ", " China's economic " and " financial institution " It can substantially represent finance and economic theme;Candidates' feature lexical items such as " smart mobile phones ", " Internet era " and " artificial intelligence technology " It can substantially represent scientific and technological class theme;Candidates' feature lexical item such as " USN " and " weaponry " can substantially represent military class master Topic;Candidates' feature lexical item such as " Cixi empress dowager " and " Chinese history " can substantially represent history class theme.
S4.B re-starts training and obtains topic model using candidate feature lexical item as the theme dictionary of topic model.
Specifically, 3 themes, 9877 lexical items and 121 document re -trainings are chosen, obtain " data " theme Feature lexical item (highest 29 lexical items of weight selection here):11st, 90, company, double ten, technology, work, platform, ai, be more than, It is China, internet, artificial intelligence, Alibaba, day cat, product, enterprise, Jingdone district, development, Ali, future, brand, user, complete Ball, hundred million yuan, 2017,10, number, automation, drop drop;Feature lexical item (highest 29 of weight selection here of " equipment " theme Lexical item):It is dnf, player, version, injury, occupation, technical ability, Lu Ke, game, trade council, upgrading, attribute, increase, epic, amplification, true , update, abyss, robot, Korea Spro's clothes, the time, sweep the floor, stone, it is transregional, clean, 90, strength, correcting, state clothes, weapon;" vehicle The feature lexical item (highest 29 lexical items of weight selection here) of type " theme:Automobile, design, price, engine, are matched somebody with somebody at machine oil Put, use, standard, Ai Erfa, power, space, buying car, 4s shops, interior trim, sale, market, sensation, BMW, friend, influence, state Interior, 310w, car owner, automatic, moulding, vehicle, consumer, suv, purchase vehicle.
S4.C application topic models calculate the degree of subject relativity value of document content, take the document general on predetermined theme Rate Distribution Value is its degree of subject relativity value.
Specifically, using the theme dictionary of above-mentioned " data ", " equipment " and " vehicle " theme, science and technology, automobile and trip are chosen Play language material calculates the degree of subject relativity value of document content in corresponding language material.Obtain the theme phase of document content in " science and technology " language material Close angle value (3 document citings before choosing here):doc:0,topic:0(0.520574),topic:1(0.314514), topic:2(0.164912);doc:1,topic:0(0.738012),topic:1(0.135914),topic:2 (0.126075);doc:2,topic:0(0.813056),topic:2(0.122989),topic:1(0.063955)." automobile " The degree of subject relativity value (3 document citings before choosing here) of document content in language material:doc:0,topic:2(0.755955), topic:0(0.143801),topic:1(0.100244);doc:1,topic:2(0.736144),topic:0 (0.144676),topic:1(0.119180);doc:2,topic:2(0.614256),topic:0(0.298078),topic: 1(0.087666).The degree of subject relativity value (3 document citings before choosing here) of document content in " game " language material:doc:0, topic:0(0.395853),topic:1(0.336999),topic:2(0.267147);doc:1,topic:1 (0.607892),topic:2(0.252507),topic:0(0.139601);doc:2,topic:1(0.420732),topic: 0(0.314079),topic:2(0.265189)。
The foregoing is merely the preferable embodiments of the present invention, are not intended to limit the invention, all spirit in the present invention Within principle, any modification, equivalent substitution, improvement and etc. are made, should be included within the scope of protection of the invention.

Claims (5)

1. a kind of topic model optimization method based on comentropy, it is characterised in that:Using comentropy and mutual information statistic from The feature lexical item of theme can uniquely be characterized by being excavated in topic corpus, and selection meets the feature lexical item of threshold condition, training master Model is inscribed, calculates the degree of subject relativity of document content, distinguishes the subject categories belonging to document;This method is particularly suitable for theme pass Keyword set has more intersection or each theme with the superior and the subordinate's inclusion relation, theme class that can be belonging to effective district single cent shelves Not;Include the following steps:
Step 1. training LDA topic models obtain theme dictionary;
Specifically, LDA topic models are trained according to topic corpus, obtains theme dictionary;
Step 2. utilizes comentropy, and the candidate feature lexical item for uniquely characterizing the theme is excavated from topic corpus;
Specifically, topic corpus is scanned, and excavates based on left and right comentropy and mutual information statistic and meets specific threshold item The candidate feature lexical item of part;
For step 3. using candidate feature lexical item as theme dictionary, training obtains topic model;
Step 4. calculates the degree of subject relativity of document content, and it is its theme phase to take probability distribution value of the document on predetermined theme Close angle value;
Specifically, using topic model, the theme distribution probability value of the document is taken to calculate relevance degree for its theme;
So far, from step 1 to step 4, a kind of topic model optimization method based on comentropy is completed.
2. a kind of topic model optimization method based on comentropy according to claim 1, it is characterised in that:In step 1 Topic corpus voluntarily selected as needed by user, step 1 specifically includes following sub-step:
Step 1.1 is according to Probability p (dm) one document d of selectionm, wherein m ∈ [1, M], M are number of documents;
Step 1.2 utilizes Dirichlet prior distributions generation document dmTheme multinomial distributionWherein,It is the parameter of Dirichlet prior distributions;
Step 1.3 basisGenerate document dmThe theme k=z of n-th of wordm,n, wherein, zm,nRepresent n-th of master in m documents Topic;
Step 1.4 reuses Dirichlet prior distributions generation k=zm,nLexical item multinomial distribution Wherein,It is the parameter of Dirichlet prior distributions;
Step 1.5 is from lexical item multinomial distributionGenerate Topic word wm,n, wherein, wm,nRepresent n-th of word in m documents;
Step 1.6 is for document dmIn NmA word repeats step 1.2 to step 1.5, generates corresponding Topic word
Step 1.7 repeats M-1 step 1.1 to step 1.6, generation theme dictionary w for M documentM,N, wherein, N M The number of word in a document.
3. a kind of topic model optimization method based on comentropy according to claim 1, it is characterised in that:Step 2 has Body includes following sub-step:
Step 2.1 utilizes comentropy and mutual information statistic, is determined to the feature lexical item of uniquely characterization theme, calculating side Method is shown in formula (1):
<mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mo>&amp;Sigma;</mo> <mrow> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>&amp;Element;</mo> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> </mrow> </munder> <mi>p</mi> <mrow> <mo>(</mo> <mi>t</mi> <mi>o</mi> <mi>p</mi> <mi>i</mi> <mi>c</mi> <mo>|</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>/</mo> <mi>&amp;Sigma;</mi> <mi>p</mi> <mrow> <mo>(</mo> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>
Candidate feature lexical item phrase needs to meet:HL (phrase) > a1∩ HR (phrase) > a2∩ I (phrase) > a3, Middle a1, a2, a3Respectively HL (phrase), HR (phrase), the threshold value of I (phrase);
The left comentropy HL (phrase) of candidate feature lexical item phrase and the definition procedure agreement of right comentropy HR (phrase) For formula (2) and (3):
<mrow> <mi>H</mi> <mi>R</mi> <mrow> <mo>(</mo> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> <mo>)</mo> </mrow> <mo>=</mo> <mo>-</mo> <munder> <mo>&amp;Sigma;</mo> <mi>y</mi> </munder> <mi>p</mi> <mrow> <mo>(</mo> <mi>y</mi> <mo>|</mo> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> <mo>)</mo> </mrow> <mi>log</mi> <mi> </mi> <mi>p</mi> <mrow> <mo>(</mo> <mi>y</mi> <mo>|</mo> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>
<mrow> <mi>H</mi> <mi>L</mi> <mrow> <mo>(</mo> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> <mo>)</mo> </mrow> <mo>=</mo> <mo>-</mo> <munder> <mo>&amp;Sigma;</mo> <mi>x</mi> </munder> <mi>p</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>|</mo> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> <mo>)</mo> </mrow> <mi>log</mi> <mi> </mi> <mi>p</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>|</mo> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>
Wherein, character string x and character string y represents the left adjacent character string of candidate feature lexical item phrase and right adjacent character respectively String, p (x | phrase) represent that character string x is the probability of the left adjacent character string of candidate feature lexical item phrase, p (y | phrase) It is the probability of the right adjacent character string of candidate feature lexical item phrase to represent character string y;
The calculation formula of candidate feature lexical item phrase mutual informations is about set to formula (4):
<mrow> <mi>I</mi> <mrow> <mo>(</mo> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mo>&amp;Sigma;</mo> <mrow> <mi>y</mi> <mo>&amp;Element;</mo> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> </mrow> </munder> <munder> <mo>&amp;Sigma;</mo> <mrow> <mi>x</mi> <mo>&amp;Element;</mo> <mi>p</mi> <mi>h</mi> <mi>r</mi> <mi>a</mi> <mi>s</mi> <mi>e</mi> </mrow> </munder> <mi>p</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <mfrac> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>y</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow>
Wherein, character string x and character string y represents the left adjacent character string of candidate feature lexical item phrase and right adjacent character respectively String, then the probability that p (x, y) expression candidate feature lexical items phrase occurs, p (x) and p (y) represent that character string x and y are independent respectively The probability of appearance;It should be noted that association relationship I (phrase) is bigger, the correlation between character string x and y it is higher namely The probability that character string x and y connect into word is higher;
Meanwhile wi∈ phrase represent to form the substring w of phrasei(1≤i≤3);∑ p (phrase) is normalization system Number, 0≤p (phrase)≤1;
Meanwhile p (topic | wi) can be obtained according to Bayesian formula, shown in computational methods such as formula (5):
<mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>t</mi> <mi>o</mi> <mi>p</mi> <mi>i</mi> <mi>c</mi> <mo>|</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>|</mo> <mi>t</mi> <mi>o</mi> <mi>p</mi> <mi>i</mi> <mi>c</mi> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>t</mi> <mi>o</mi> <mi>p</mi> <mi>i</mi> <mi>c</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow>
Wherein, p (topic) and p (wi) counted from topic corpus, p (wi| topic) it is obtained from step 1;
Ranking results of the step 2.2 according to the probability value p (phrase) of candidate feature lexical item from high to low, screening meet threshold value The candidate feature lexical item of condition;Specifically, according to formula (1) threshold parameter different with formula (5) setting, before calculating and choosing K candidate feature lexical item.
4. a kind of topic model optimization method based on comentropy according to claim 1, it is characterised in that:Step 3 Training specifically includes following sub-step again:
Step 3.1 random initializtion, to each word w in every document in topic corpusi, it is random to assign a theme number zi
Step 3.2 scans topic corpus, to each word wi, it is numbered according to Gibbs Sampling formula resamplings theme, And updated in topic corpus, the wherein sampling of Gibbs Sampling formula is calculated as shown in (6):
<mrow> <mtable> <mtr> <mtd> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>z</mi> <mi>i</mi> </msub> <mo>=</mo> <mi>k</mi> <mo>|</mo> <msub> <mover> <mi>z</mi> <mo>&amp;RightArrow;</mo> </mover> <mrow> <mo>&amp;Not;</mo> <mi>i</mi> </mrow> </msub> <mo>,</mo> <mover> <mi>w</mi> <mo>&amp;RightArrow;</mo> </mover> <mo>)</mo> </mrow> <mo>&amp;Proportional;</mo> <msub> <mover> <mi>&amp;theta;</mi> <mo>^</mo> </mover> <mrow> <mi>m</mi> <mo>,</mo> <mi>k</mi> </mrow> </msub> <mo>&amp;CenterDot;</mo> <msub> <mover> <mi>&amp;phi;</mi> <mo>^</mo> </mover> <mrow> <mi>k</mi> <mo>,</mo> <mi>t</mi> </mrow> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>=</mo> <mfrac> <mrow> <msubsup> <mi>n</mi> <mrow> <mi>m</mi> <mo>,</mo> <mo>&amp;Not;</mo> <mi>i</mi> </mrow> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> </msubsup> <mo>+</mo> <msub> <mi>&amp;alpha;</mi> <mi>k</mi> </msub> </mrow> <mrow> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>K</mi> </munderover> <msubsup> <mi>n</mi> <mrow> <mi>m</mi> <mo>,</mo> <mo>&amp;Not;</mo> <mi>i</mi> </mrow> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> </msubsup> <mo>+</mo> <msub> <mi>&amp;alpha;</mi> <mi>k</mi> </msub> </mrow> </mfrac> <mo>&amp;CenterDot;</mo> <mfrac> <mrow> <msubsup> <mi>n</mi> <mrow> <mi>k</mi> <mo>,</mo> <mo>&amp;Not;</mo> <mi>i</mi> </mrow> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> </msubsup> <mo>+</mo> <msub> <mi>&amp;beta;</mi> <mi>t</mi> </msub> </mrow> <mrow> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>V</mi> </munderover> <msubsup> <mi>n</mi> <mrow> <mi>k</mi> <mo>,</mo> <mo>&amp;Not;</mo> <mi>i</mi> </mrow> <mrow> <mo>(</mo> <mi>t</mi> <mo>)</mo> </mrow> </msubsup> <mo>+</mo> <msub> <mi>&amp;beta;</mi> <mi>t</mi> </msub> </mrow> </mfrac> </mrow> </mtd> </mtr> </mtable> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>6</mn> <mo>)</mo> </mrow> </mrow>
Where it is assumed that lexical item wi=t;ziRepresent the corresponding theme variable of i-th of word;It represents to reject i-th therein; Represent the number of lexical item t occur in k themes;βtIt is the Dirichlet priori of lexical item t;Represent occur theme k's in document m Number, αkIt is the Dirichlet priori of theme k;Represent the probability of lexical item t in theme k;For in document m theme k it is general Rate;V represents the sum of lexical item in topic corpus, and K is the theme the sum of theme in dictionary, and M is number of documents;
Step 3.3 repeat the above steps 3.2 sampling process, until Gibbs Sampling restrain;
Wherein, Gibbs Sampling convergences refer to that formula (6) samples the Joint Distribution that obtained probability value approaches word and theme;
The co-occurrence frequency matrix of document subject matter and lexical item is to get to topic model in step 3.4 statistics topic corpus.
5. a kind of topic model optimization method based on comentropy according to claim 1, it is characterised in that:In step 4 Document and step 1 to step 3 in main body corpus in document it is different, the latter is Training document, for training theme Model;The former is prediction document, for applying topic model.
CN201810029097.7A 2018-01-12 2018-01-12 A kind of topic model optimization method based on comentropy Pending CN108090231A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810029097.7A CN108090231A (en) 2018-01-12 2018-01-12 A kind of topic model optimization method based on comentropy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810029097.7A CN108090231A (en) 2018-01-12 2018-01-12 A kind of topic model optimization method based on comentropy

Publications (1)

Publication Number Publication Date
CN108090231A true CN108090231A (en) 2018-05-29

Family

ID=62183108

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810029097.7A Pending CN108090231A (en) 2018-01-12 2018-01-12 A kind of topic model optimization method based on comentropy

Country Status (1)

Country Link
CN (1) CN108090231A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109271623A (en) * 2018-08-16 2019-01-25 龙马智芯(珠海横琴)科技有限公司 Text emotion denoising method and system
CN109344252A (en) * 2018-09-12 2019-02-15 东北大学 Microblogging file classification method and system based on high-quality topic expansion
CN109376347A (en) * 2018-10-16 2019-02-22 北京信息科技大学 A kind of HSK composition generation method based on topic model
CN109614626A (en) * 2018-12-21 2019-04-12 北京信息科技大学 Keyword Automatic method based on gravitational model
CN109919427A (en) * 2019-01-24 2019-06-21 平安科技(深圳)有限公司 Model subject under discussion duplicate removal appraisal procedure, server and computer readable storage medium
CN110347977A (en) * 2019-06-28 2019-10-18 太原理工大学 A kind of news automated tag method based on LDA model
CN111507098A (en) * 2020-04-17 2020-08-07 腾讯科技(深圳)有限公司 Ambiguous word recognition method and device, electronic equipment and computer-readable storage medium
WO2020199591A1 (en) * 2019-03-29 2020-10-08 平安科技(深圳)有限公司 Text categorization model training method, apparatus, computer device, and storage medium
CN112100492A (en) * 2020-09-11 2020-12-18 河北冀联人力资源服务集团有限公司 Batch delivery method and system for resumes of different versions
CN113032573A (en) * 2021-04-30 2021-06-25 《中国学术期刊(光盘版)》电子杂志社有限公司 Large-scale text classification method and system combining theme semantics and TF-IDF algorithm

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324700A (en) * 2013-06-08 2013-09-25 同济大学 Noumenon concept attribute learning method based on Web information

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324700A (en) * 2013-06-08 2013-09-25 同济大学 Noumenon concept attribute learning method based on Web information

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
JULY: "通俗理解LDA主题模型", 《HTTPS://BLOG.CSDN.NET/V_JULY_V/ARTICLE/DETAILS/41209515》 *
LIN YULAN: "Research on Interactive Text Topic Mining Based on LDA Model Taking customer service chat records as an example", 《2017 INTERNATIONAL CONFERENCE ON COMPUTER TECHNOLOGY, ELECTRONICS AND COMMUNICATION (ICCTEC)》 *
码农场>自然语言处理: "基于互信息和左右信息熵的短语提取识别", 《HTTP://WWW.HANKCS.COM/NLP/EXTRACTION-AND-IDENTIFICATION-OF-MUTUAL-INFORMATION-ABOUT-THE-PHRASE-BASED-ON-INFORMATION-ENTROPY.HTML》 *
黄勇: "改进的互信息与LDA结合的特征降维方法研究", 《中国优秀硕士学位论文全文数据库(电子期刊)信息科技辑》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109271623A (en) * 2018-08-16 2019-01-25 龙马智芯(珠海横琴)科技有限公司 Text emotion denoising method and system
CN109344252A (en) * 2018-09-12 2019-02-15 东北大学 Microblogging file classification method and system based on high-quality topic expansion
CN109376347A (en) * 2018-10-16 2019-02-22 北京信息科技大学 A kind of HSK composition generation method based on topic model
CN109614626A (en) * 2018-12-21 2019-04-12 北京信息科技大学 Keyword Automatic method based on gravitational model
CN109919427A (en) * 2019-01-24 2019-06-21 平安科技(深圳)有限公司 Model subject under discussion duplicate removal appraisal procedure, server and computer readable storage medium
WO2020199591A1 (en) * 2019-03-29 2020-10-08 平安科技(深圳)有限公司 Text categorization model training method, apparatus, computer device, and storage medium
CN110347977A (en) * 2019-06-28 2019-10-18 太原理工大学 A kind of news automated tag method based on LDA model
CN111507098A (en) * 2020-04-17 2020-08-07 腾讯科技(深圳)有限公司 Ambiguous word recognition method and device, electronic equipment and computer-readable storage medium
CN111507098B (en) * 2020-04-17 2023-03-21 腾讯科技(深圳)有限公司 Ambiguous word recognition method and device, electronic equipment and computer-readable storage medium
CN112100492A (en) * 2020-09-11 2020-12-18 河北冀联人力资源服务集团有限公司 Batch delivery method and system for resumes of different versions
CN113032573A (en) * 2021-04-30 2021-06-25 《中国学术期刊(光盘版)》电子杂志社有限公司 Large-scale text classification method and system combining theme semantics and TF-IDF algorithm
CN113032573B (en) * 2021-04-30 2024-01-23 同方知网数字出版技术股份有限公司 Large-scale text classification method and system combining topic semantics and TF-IDF algorithm

Similar Documents

Publication Publication Date Title
CN108090231A (en) A kind of topic model optimization method based on comentropy
Kim et al. Deep hybrid recommender systems via exploiting document context and statistics of items
Deshpande et al. Artificial intelligence for big data: Complete guide to automating big data solutions using artificial intelligence techniques
CN110765260A (en) Information recommendation method based on convolutional neural network and joint attention mechanism
CN106663124A (en) Generating and using a knowledge-enhanced model
Siddharth et al. Natural language processing in-and-for design research
CN105659225A (en) Query expansion and query-document matching using path-constrained random walks
Zhao et al. The study on the text classification for financial news based on partial information
CN113312480B (en) Scientific and technological thesis level multi-label classification method and device based on graph volume network
Rahman et al. Predicting sequential design decisions using the function-behavior-structure design process model and recurrent neural networks
CN107688870A (en) A kind of the classification factor visual analysis method and device of the deep neural network based on text flow input
Borna et al. Hierarchical LSTM network for text classification
Kalaivani et al. A review on feature extraction techniques for sentiment classification
Pandiaraj et al. Sentiment analysis on newspaper article reviews: contribution towards improved rider optimization-based hybrid classifier
Wulam et al. A Recommendation System Based on Fusing Boosting Model and DNN Model.
Wen et al. The research and development of completed GM (1, 1) model toolbox using Matlab
Delaforge et al. EBBE-text: Explaining neural networks by exploring text classification decision boundaries
Malik et al. Software requirement specific entity extraction using transformer models.
Neha et al. Deep neural networks predicting student performance
CN111859238B (en) Model-based method, device and computer equipment for predicting data change frequency
Wu et al. A study on natural language processing classified news
CN116107619A (en) Web API recommendation method based on factoring machine
Goossens et al. Extracting decision dependencies and decision logic from text using deep learning techniques
Marcondes Knowledge organization and representation in digital environments: relations between ontology and knowledge organization
Handayani et al. Sentiment Analysis Of Electric Cars Using Recurrent Neural Network Method In Indonesian Tweets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180529

RJ01 Rejection of invention patent application after publication