CN101710333B - Network text segmenting method based on genetic algorithm - Google Patents

Network text segmenting method based on genetic algorithm Download PDF

Info

Publication number
CN101710333B
CN101710333B CN2009102191638A CN200910219163A CN101710333B CN 101710333 B CN101710333 B CN 101710333B CN 2009102191638 A CN2009102191638 A CN 2009102191638A CN 200910219163 A CN200910219163 A CN 200910219163A CN 101710333 B CN101710333 B CN 101710333B
Authority
CN
China
Prior art keywords
text
population
vocabulary
expansion
formula
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2009102191638A
Other languages
Chinese (zh)
Other versions
CN101710333A (en
Inventor
蔡皖东
赵煜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NANTONG LONGXIANG ELECTRIC EQUIPMENT CO., LTD.
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN2009102191638A priority Critical patent/CN101710333B/en
Publication of CN101710333A publication Critical patent/CN101710333A/en
Application granted granted Critical
Publication of CN101710333B publication Critical patent/CN101710333B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a network text segmenting method based on the genetic algorithm, used for segmenting short network texts. The method comprises the following steps of: evaluating a Latent Dirichlet allocation (LDA) model corresponding to a corpus by using a Gibbs sampling method, inferring latent topic information using the model, representing texts by using the latent topic information; then transforming a text-segmenting process into a multi-target optimum process by using a parallel genetic algorithm, and calculating the coherency of segmented units, the divergence among the segmented units and fitness functions by using deeper semantic information; and carrying out the genetic iteration of the text segmenting process, and determining whether the segmenting process terminates based on the similarity among multi-iteration results or the upper limit of iterations to obtain the global optimal solution for segmenting the texts. Therefore, the invention improves the accuracy for segmenting the short network texts.

Description

Network text segmenting method based on genetic algorithm
Technical field
The present invention relates to a kind of network text segmenting method,, be applicable to cutting apart the short width of cloth text of network particularly based on the network text segmenting method of genetic algorithm.
Background technology
The network text cutting techniques is the important technical that network public-opinion monitoring, network text emotion are analyzed, and helps to find network text mid-deep strata time semantic information.
Document " based on the text segmentation model of multivariate discriminant analysis, software journal, 2007,18 (3), P 555-564 " discloses a kind of method of utilizing word frequency information to carry out text segmentation.This method adopts the multivariate discriminant analysis method; Utilize word frequency information to represent text with vector space model; Consider that 3 factors such as distance, cutting unit length have defined 4 global assessment functions between cutting unit inner distance, cutting unit, realize global assessment the text segmentation pattern.But,,, enough word frequency information can't be provided owing to have the sparse phenomenon of data in the text to the short width of cloth text in the network text; Simultaneously, because word frequency information is the shallow-layer semantic information,, influence the accuracy that similarity is calculated, and then influence text segmentation result's accuracy only according to the similarity between the word frequency computed segmentation unit.
Summary of the invention
To the lower defective of the short width of cloth text segmentation of art methods network accuracy rate; The present invention proposes a kind of network text segmenting method based on genetic algorithm; Utilize the Gibbs method of sampling to estimate that the corresponding potential Di Li Cray of corpus distributes (Latent Dirichlet allocation; LDA) model, and utilize this model to infer the potential topic information of target text, utilize potential topic information to represent text; Adopt paralleling genetic algorithm again; The text segmentation process is converted into the multiple-objection optimization process; Utilize in the profound semantic information computed segmentation unit diversity and fitness function between coherency, cutting unit, carry out the genetic iteration of text segmentation process, whether finish according to the similarity between the iteration result or iterations upper limit decision cutting procedure repeatedly; Obtain the text segmentation globally optimal solution, can improve the short width of cloth text segmentation of network accuracy rate.
Technical scheme of the present invention is: a kind of network text segmenting method based on genetic algorithm is characterized in may further comprise the steps:
(a) utilize crawler on network, to collect webpage; Through the webpage of collecting is carried out the text pre-service, only keep text message, and adopt the file classification method of naive Bayesian; Text message to behind the removal noise is classified, and category makes up the expansion corpus;
(b) adopt hierarchy clustering method that the expansion corpus is carried out cluster, confirm the number of sub-topics, adopt the Gibbs method of sampling to estimate the LDA model of expansion corpus; Estimate that the parameter that relates to adopts empirical value α=0.01; β=0.01, the burn-in spacing is 2000, the thinning spacing is 100;
(c) text to be split is carried out the text pre-service of participle, part-of-speech tagging, named entity recognition, word sense disambiguation, the frequency of noun, verb in the statistics text is selected the characteristic vocabulary of high frequency vocabulary as text; Again according to HowNet, the similarity between the characteristic vocabulary that calculates text and the characteristic vocabulary of expansion corpus, the corpus of choosing similarity maximal value correspondence is the outside corpus of text segmentation; Adopt the corresponding LDA model of the Gibbs method of sampling and said expansion corpus to infer the semantic structure information that text to be split comprises at last, the semantic structure information of deduction comprises the type and the probability of vocabulary in cutting unit of the affiliated sub-topics of vocabulary; The type of sub-topics is used for the expression of text to be split under the vocabulary, is the sub-topics type under unit each vocabulary of statistics with the sentence, and sentence expression is the sub-topics space vector, sentence Sj=s J1s J2... s Jj... s JT, s JjVocabulary belongs to the frequency of sub-topics j among the expression sentence j;
(d) utilize paralleling genetic algorithm to carry out text segmentation; The algorithm coding scheme adopts the binary coding scheme; Initialization of population adopts random digit generation method, utilizes the minimum length of semantic paragraph and two indexs of minimum number that text comprises semantic paragraph simultaneously, filters underproof initial individuality; According to formula
C oh = 1 - Σ n = 1 j 1 k Σ s i ∈ b n Σ l = 1 T ( s il - a nl ) 2
Coherency in the computing semantic paragraph; In the formula,
Figure DEST_PATH_GSB00000491306400022
| b n| represent the sentence number that comprises in n the semantic paragraph, a nThe corresponding average vector of expression semantic paragraph, a NlBe l component of this vector;
According to formula
D is = Σ n = 1 j | b n | k Σ l = 1 T ( a nl - c l ) 2
Diversity between the computing semantic paragraph; In the formula,
Figure DEST_PATH_GSB00000491306400024
Calculate each individual fitness function value in the genetic iteration based on diversity between coherency in the semantic paragraph and semantic paragraph, computing formula is following:
Figure DEST_PATH_GSB00000491306400025
In the formula; Population is expanded in
Figure DEST_PATH_GSB00000491306400026
expression, is used for storing the optimum solution of iteration;
In the population selection course, at first adopt elite's retention strategy, the elite who keeps in population and the expansion population is individual, directly gets into of future generation the evolution; Adopt the roulette method then, from population and expansion population, select individuality respectively, relatively the fitness value of two individualities selects the little individuality of fitness to intersect and mutation operation;
The intersection process adopts the single-point cross method, in order to prevent inbreeding, when Hamming distance between individuality surpasses threshold value, just allows between population and expansion population, to carry out interlace operation, and threshold value is set between individuality on average 20% of Hamming distance; Similarity self-adaptation according to population is regulated mutation operator; The calculating formula of similarity of population is following:
Figure DEST_PATH_RE-GSB00000738153300011
Calculate the similarity of optimum individual in the different iteration round expansion populations according to formula
Figure DEST_PATH_RE-GSB00000738153300012
; When surpassing threshold value and continue 50, similarity takes turns; Finishing iteration process then; Choose the result of the individuality of expansion in the population as text segmentation; In the binary representation of individuality, the corresponding sentence of numeral " 1 " is exactly the border of text segmentation.
The invention has the beneficial effects as follows: owing to utilize the Gibbs method of sampling to estimate that the corresponding potential Di Li Cray of corpus distributes (Latent Dirichlet allocation; LDA) model; And utilize this model to infer the potential topic information of target text, utilize potential topic information to represent text; Adopt paralleling genetic algorithm again; The text segmentation process is converted into the multiple-objection optimization process; Utilize in the profound semantic information computed segmentation unit diversity and fitness function between coherency, cutting unit, carry out the genetic iteration of text segmentation process, whether finish according to the similarity between the iteration result or iterations upper limit decision cutting procedure repeatedly; Obtain the text segmentation globally optimal solution, improved the short width of cloth text segmentation of network accuracy rate.
The accuracy rate of text segmentation is weighed by accuracy and recall rate usually; Background technology is removed and is adopted the above attribute of weighing; Also utilize P μ value as criterion; Through in above-mentioned environment, 50 texts to be split being tested, the method that the present invention relates to is weighed on the attribute at 3 and all is superior to background technology, is especially exceeding 15% aspect the P μ value.
Below in conjunction with accompanying drawing and embodiment the present invention is elaborated.
Description of drawings
Accompanying drawing is the network text segmenting method process flow diagram that the present invention is based on genetic algorithm.
Embodiment
With reference to accompanying drawing, present embodiment is directed against the target text that themes as " Beijing Olympic ", the language operating specification, and the text length is shorter, and the concrete steps of text segmentation are following:
The first step, the search for that crawler is set is a vocabulary related with Olympic, utilizes crawler on network, to collect webpage.Olympic Games theme vocabulary confirm to comprise following three steps, 1) many pieces in the artificial text of confirming to represent search for, be generally 10~20 pieces; 2) word frequency of noun, verb in the statistics literary composition is chosen the high vocabulary of word frequency and is compiled as descriptor undetermined, and the word frequency threshold value is set to 30; 3) from descriptor undetermined is compiled, manual work is chosen 10~15 vocabulary as theme vocabulary.
Webpage all is a html document, need carry out the text pre-service to the webpage of collecting, and need filter the HTML indications when extracting text message; Except title and text, also comprise many links in the webpage, these links are uncorrelated with the text text, when extracting web page contents, also need filter these useless links.
Adopt the text binary classification method of naive Bayesian; Text to behind the removal noise is classified; Remove and the incoherent webpage of theme according to classification results, make up topic corpus, Feature Selection can adopt the Feature Selection method of information gain IG, mutual information MI etc.Topic corpus is minimum to comprise 1000 pieces of texts.
In second step, adopt the Gibbs method of sampling to estimate the LDA model of corpus.Gibbs sampling iterative process is carried out according to following formula:
P ( z i = j | z - i , w i ) = n w i - ij + β n * - ij + Wβ · n d i - ij + α n d i - i * + α Σ j = 1 T n w i - ij + β n * - ij + Wβ · n d i - ij + α n d i - i * + α
Wherein,
Figure G2009102191638D00042
Expression w iCorresponding vocabulary is assigned to the number of times of theme j, n * -ijExpression is assigned to total vocabulary number of theme j,
Figure G2009102191638D00043
Expression text d iIn be assigned to the vocabulary number of theme j,
Figure G2009102191638D00044
Expression text d iIn vocabulary sum, above information all can be added up acquisition from text, statistic processes is not considered current lexical item w i
The process of Gibbs sampling comprised for three steps:
1) iteration is initial, z iBe assigned 1 to the T arbitrary value;
2), calculate w respectively according to formula iBe assigned to the probability of theme 1 to T, get more new term w of maximal value iThe theme distribution state, obtain the next state of markov chain;
3) judge according to the similarity and the burn-in spacing of front and back markov chain whether iteration finishes, then iteration end when similarity surpasses threshold value or reaches the burn-in spacing.
In the Gibbs sampling, adopt hierarchy clustering method to confirm the number of sub-topics, other parameters adopt empirical value α=0.01, β=0.01, and burn-in spacing and thinning spacing value respectively are 2000 and 100, iterative process adopts the GibbsLDA++ instrument;
The 3rd step, text to be split is carried out text pre-service such as participle, part-of-speech tagging, named entity recognition, word sense disambiguation, the frequency of noun, verb in the statistics text is selected the characteristic vocabulary of high frequency vocabulary as text.According to HowNet; Utilize context relation between adopted unit to calculate the similarity between the characteristic vocabulary of characteristic vocabulary and each corpus of text; Because " Beijing Olympic " that text to be split and step 1 generate expansion corpus similarity is maximum, therefore chooses the outside corpus that this corpus is a text segmentation.
The LDA model that adopts the Gibbs method of sampling and step 2 to estimate is inferred the semantic structure information that text to be split comprises, and the semantic structure information of deduction comprises the type of the affiliated sub-topics of vocabulary.Language construction information deduction process is still used the formula in second step, and wherein, di is expression sentence i in the 3rd step, and promptly the vocabulary statistics is a unit with the sentence.
Sub-topics type in the statistics sentence under each vocabulary, constructor theme space vector, sentence Sj=sj1sj2...sjj...sjT, sjj represent that vocabulary among the sentence j belongs to the frequency of sub-topics j.
In the 4th step, utilize paralleling genetic algorithm to carry out text segmentation.The algorithm coding scheme adopts the binary coding scheme; Initialization of population adopts random digit generation method, utilizes the minimum length of semantic paragraph and two indexs of minimum number that text comprises semantic paragraph simultaneously, filters underproof initial individuality; The paragraph minimum length is no less than 3, and the paragraph number is no less than 5.According to formula
C oh = 1 - Σ n = 1 j 1 k Σ s i ∈ b n Σ l = 1 T ( s il - a nl ) 2
Coherency in the computing semantic paragraph.In the formula, a Nl = 1 | b n | Σ s i ∈ b n s Il , | b n| represent the sentence number that comprises in n the semantic paragraph, a nThe corresponding average vector of expression semantic paragraph, a NtBe t component of this vector.
According to formula
D is = Σ n = 1 j | b n | k Σ l = 1 T ( a nl - c l ) 2
Diversity between the computing semantic paragraph.In the formula, c l = 1 k Σ i = 1 k s Il .
According to the fitness function value of diversity calculating genetic algorithm between coherency in the semantic paragraph and semantic paragraph, computing formula is following:
Figure G2009102191638D00055
In the population selection course, at first adopt elite's retention strategy, the individuality of choosing auto-adaptive function value minimum in population and the expansion population respectively is as the elite, and the individual directly entering of elite the next generation evolve.Secondly, adopt the roulette method, selection is individual from population and expansion population respectively, and relatively the fitness of two individualities selects the little individuality of fitness to intersect and mutation operation.
Adopt single-point to intersect and accomplish the intersection process; In order to prevent inbreeding, the individuality of participating in intersecting must belong to different populations, and has only when Hamming distance between individuality surpasses threshold value; Just allow between the two, to carry out interlace operation, threshold value is set to 20% of average Hamming distance between individuality usually.
According to the adaptive adjustment mutation operator of the similarity of population, the calculating formula of similarity of population is following:
Figure G2009102191638D00056
Wherein,
Figure G2009102191638D00057
x i, x jTwo individuals in the expression population.Population variation considers whether the variation result satisfies the requirement of segmentation result, and segmentation result requires to filter with initialization of population and requires identically, if do not satisfy, then generates new individuality replacement variation back individuality.
Calculate the similarity of optimum individual in the different iteration round expansion populations according to formula
Figure G2009102191638D00058
; Take turns when similarity surpasses threshold value and continues 50, then iteration finishes.Choose the result of the individuality of expansion in the population as text segmentation, in the binary representation of individuality, the corresponding sentence of numeral " 1 " is exactly the border of text segmentation.
The accuracy rate of text segmentation is weighed by accuracy and recall rate usually; Background technology is removed and is adopted the above attribute of weighing; Also utilize P μ value as criterion; Through in above-mentioned environment, 50 texts to be split being tested, the method that the present invention relates to is weighed on the attribute at 3 and all is superior to background technology, is especially exceeding 15% aspect the P μ value.

Claims (1)

1. network text segmenting method based on genetic algorithm is characterized in that may further comprise the steps:
(a) utilize crawler on network, to collect webpage; Through the webpage of collecting is carried out the text pre-service, only keep text message, and adopt the file classification method of naive Bayesian; Text message to behind the removal noise is classified, and category makes up the expansion corpus;
(b) adopt hierarchy clustering method that the expansion corpus is carried out cluster, confirm the number of sub-topics, adopt the Gibbs method of sampling to estimate the LDA model of expansion corpus; Estimate that the parameter that relates to adopts empirical value α=0.01; β=0.01, the burn-in spacing is 2000, the thinning spacing is 100;
(c) text to be split is carried out the text pre-service of participle, part-of-speech tagging, named entity recognition, word sense disambiguation, the frequency of noun, verb in the statistics text is selected the characteristic vocabulary of high frequency vocabulary as text; Again according to HowNet, the similarity between the characteristic vocabulary that calculates text and the characteristic vocabulary of expansion corpus, the corpus of choosing similarity maximal value correspondence is the outside corpus of text segmentation; Adopt the corresponding LDA model of the Gibbs method of sampling and said expansion corpus to infer the semantic structure information that text to be split comprises at last, the semantic structure information of deduction comprises the type and the probability of vocabulary in cutting unit of the affiliated sub-topics of vocabulary; The type of sub-topics is used for the expression of text to be split under the vocabulary, is the sub-topics type under unit each vocabulary of statistics with the sentence, and sentence expression is the sub-topics space vector, sentence Sj=s J1s J2... s Jj... s JT, s JjVocabulary belongs to the frequency of sub-topics j among the expression sentence j;
(d) utilize paralleling genetic algorithm to carry out text segmentation; The algorithm coding scheme adopts the binary coding scheme; Initialization of population adopts random digit generation method, utilizes the minimum length of semantic paragraph and two indexs of minimum number that text comprises semantic paragraph simultaneously, filters underproof initial individuality; According to formula
C oh = 1 - Σ n = 1 j 1 k Σ s i ∈ b n Σ l = 1 T ( s il - a nl ) 2
Coherency in the computing semantic paragraph; In the formula,
Figure FSB00000738153200012
| b n| represent the sentence number that comprises in n the semantic paragraph, a nThe corresponding average vector of expression semantic paragraph, a NlBe l component of this vector;
According to formula
D is = Σ n = 1 j | b n | k Σ l = 1 T ( a nl - c l ) 2
Diversity between the computing semantic paragraph; In the formula,
Calculate each individual fitness function value in the genetic iteration based on diversity between coherency in the semantic paragraph and semantic paragraph, computing formula is following:
Figure FSB00000738153200021
In the formula; Population is expanded in
Figure FSB00000738153200022
expression, is used for storing the optimum solution of iteration;
In the population selection course, at first adopt elite's retention strategy, the elite who keeps in population and the expansion population is individual, directly gets into of future generation the evolution; Adopt the roulette method then, from population and expansion population, select individuality respectively, relatively the fitness value of two individualities selects the little individuality of fitness to intersect and mutation operation;
The intersection process adopts the single-point cross method, in order to prevent inbreeding, when Hamming distance between individuality surpasses threshold value, just allows between population and expansion population, to carry out interlace operation, and threshold value is set between individuality on average 20% of Hamming distance; Similarity self-adaptation according to population is regulated mutation operator; The calculating formula of similarity of population is following:
Calculate the similarity of optimum individual in the different iteration round expansion populations according to formula
Figure FSB00000738153200024
; When surpassing threshold value and continue 50, similarity takes turns; Finishing iteration process then; Choose the result of the individuality of expansion in the population as text segmentation; In the binary representation of individuality, the corresponding sentence of numeral " 1 " is exactly the border of text segmentation.
CN2009102191638A 2009-11-26 2009-11-26 Network text segmenting method based on genetic algorithm Active CN101710333B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009102191638A CN101710333B (en) 2009-11-26 2009-11-26 Network text segmenting method based on genetic algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009102191638A CN101710333B (en) 2009-11-26 2009-11-26 Network text segmenting method based on genetic algorithm

Publications (2)

Publication Number Publication Date
CN101710333A CN101710333A (en) 2010-05-19
CN101710333B true CN101710333B (en) 2012-07-04

Family

ID=42403123

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009102191638A Active CN101710333B (en) 2009-11-26 2009-11-26 Network text segmenting method based on genetic algorithm

Country Status (1)

Country Link
CN (1) CN101710333B (en)

Families Citing this family (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101968798A (en) * 2010-09-10 2011-02-09 中国科学技术大学 Community recommendation method based on on-line soft constraint LDA algorithm
CN102024065B (en) * 2011-01-18 2013-01-02 中南大学 SIMD optimization-based webpage duplication elimination and concurrency method
WO2012106885A1 (en) * 2011-07-13 2012-08-16 华为技术有限公司 Latent dirichlet allocation-based parameter inference method, calculation device and system
CN102609407B (en) * 2012-02-16 2014-10-29 复旦大学 Fine-grained semantic detection method of harmful text contents in network
CN102855312B (en) * 2012-08-24 2013-08-14 武汉大学 Domain-and-theme-oriented Web service clustering method
CN102929937B (en) * 2012-09-28 2015-09-16 福州博远无线网络科技有限公司 Based on the data processing method of the commodity classification of text subject model
CN103365978B (en) * 2013-07-01 2017-03-29 浙江大学 TCM data method for digging based on LDA topic models
CN103914445A (en) * 2014-03-05 2014-07-09 中国人民解放军装甲兵工程学院 Data semantic processing method
CN105095228A (en) * 2014-04-28 2015-11-25 华为技术有限公司 Method and apparatus for monitoring social information
CN104281567A (en) * 2014-10-13 2015-01-14 安徽华贞信息科技有限公司 Latent semantic analysis method and system
CN104281692A (en) * 2014-10-13 2015-01-14 安徽华贞信息科技有限公司 Method and system for realizing paragraph dimensionalized description
CN104317579A (en) * 2014-10-13 2015-01-28 安徽华贞信息科技有限公司 Method and system for business performance of text document
CN104317785A (en) * 2014-10-13 2015-01-28 安徽华贞信息科技有限公司 Internet paragraph level topic identifying system
CN106355628B (en) * 2015-07-16 2019-07-05 中国石油化工股份有限公司 The modification method and system of picture and text knowledge point mask method and device, picture and text mark
CN105138665B (en) * 2015-09-02 2017-06-20 东南大学 A kind of internet topic online mining method based on improvement LDA models
CN105136714B (en) * 2015-09-06 2017-10-10 河南工业大学 A kind of tera-hertz spectra Wavelength selecting method based on genetic algorithm
CN105389306A (en) * 2015-11-02 2016-03-09 国网福建省电力有限公司 Latent semantic analysis based intelligent parsing method for application form
CN105787088B (en) * 2016-03-14 2018-12-07 南京理工大学 A kind of text information classification method based on segment encoding genetic algorithm
CN107239438B (en) * 2016-03-28 2020-07-28 阿里巴巴集团控股有限公司 Document analysis method and device
CN106502983B (en) * 2016-10-17 2019-05-10 清华大学 The event driven collapse Gibbs sampling method of implicit Di Li Cray model
CN106815310B (en) * 2016-12-20 2020-04-21 华南师范大学 Hierarchical clustering method and system for massive document sets
CN106709011B (en) * 2016-12-26 2019-07-23 武汉大学 A kind of position concept level resolution calculation method based on space orientation cluster
CN108009151B (en) * 2017-11-29 2021-04-16 深圳中泓在线股份有限公司 News text automatic segmentation method and device, server and readable storage medium
CN108038173B (en) * 2017-12-07 2021-11-26 广东工业大学 Webpage classification method and system and webpage classification equipment
CN109299239B (en) * 2018-09-29 2021-11-23 福建弘扬软件股份有限公司 ES-based electronic medical record retrieval method
CN109325092A (en) * 2018-11-27 2019-02-12 中山大学 Merge the nonparametric parallelization level Di Li Cray process topic model system of phrase information
CN109829151B (en) * 2018-11-27 2023-04-21 国网浙江省电力有限公司 Text segmentation method based on hierarchical dirichlet model
CN109918659B (en) * 2019-02-28 2023-06-20 华南理工大学 Method for optimizing word vector based on unreserved optimal individual genetic algorithm
CN109977227B (en) * 2019-03-19 2021-06-22 中国科学院自动化研究所 Text feature extraction method, system and device based on feature coding
CN110110326B (en) * 2019-04-25 2020-10-27 西安交通大学 Text cutting method based on subject information
CN110222654A (en) * 2019-06-10 2019-09-10 北京百度网讯科技有限公司 Text segmenting method, device, equipment and storage medium
CN113366511B (en) * 2020-01-07 2022-03-25 支付宝(杭州)信息技术有限公司 Named entity identification and extraction using genetic programming
CN111797634B (en) * 2020-06-04 2023-09-08 语联网(武汉)信息技术有限公司 Document segmentation method and device
CN112667817B (en) * 2020-12-31 2022-05-31 杭州电子科技大学 Text emotion classification integration system based on roulette attribute selection
CN113191133B (en) * 2021-04-21 2021-12-21 北京邮电大学 Audio text alignment method and system based on Doc2Vec
CN112988981B (en) * 2021-05-14 2021-10-15 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Automatic labeling method based on genetic algorithm
CN113673255B (en) * 2021-08-25 2023-06-30 北京市律典通科技有限公司 Text function area splitting method and device, computer equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101287229A (en) * 2008-05-26 2008-10-15 北京捷讯畅达科技发展有限公司 Natural language processing technique and device applying to query by short message service of mobile phone

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101287229A (en) * 2008-05-26 2008-10-15 北京捷讯畅达科技发展有限公司 Natural language processing technique and device applying to query by short message service of mobile phone

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘娜 等.文本线性分割方法的研究.《计算机工程与应用》.2008,(第21期),212-216. *
石晶 等.基于LDA模型的文本分割.《计算机学报》.2008,第31卷(第10期),1865-1873. *

Also Published As

Publication number Publication date
CN101710333A (en) 2010-05-19

Similar Documents

Publication Publication Date Title
CN101710333B (en) Network text segmenting method based on genetic algorithm
CN100353361C (en) New method of characteristic vector weighting for text classification and its device
Zamani et al. Neural query performance prediction using weak supervision from multiple signals
CN106844424B (en) LDA-based text classification method
CN104268197B (en) A kind of industry comment data fine granularity sentiment analysis method
Misra et al. Text segmentation via topic modeling: an analytical study
Das et al. A heuristic-driven uncertainty based ensemble framework for fake news detection in tweets and news articles
CN105468713A (en) Multi-model fused short text classification method
CN103207913B (en) The acquisition methods of commercial fine granularity semantic relation and system
CN108763484A (en) A kind of law article recommendation method based on LDA topic models
CN107944014A (en) A kind of Chinese text sentiment analysis method based on deep learning
CN105045812A (en) Text topic classification method and system
CN105760493A (en) Automatic work order classification method for electricity marketing service hot spot 95598
García-Hernández et al. Single extractive text summarization based on a genetic algorithm
CN105912576A (en) Emotion classification method and emotion classification system
CN105095183A (en) Text emotional tendency determination method and system
CN101714135A (en) Emotional orientation analytical method of cross-domain texts
Foong et al. Text summarization using latent semantic analysis model in mobile android platform
CN106202530A (en) Data processing method and device
Sun et al. Twitter part-of-speech tagging using pre-classification Hidden Markov model
Kang et al. Utilization strategy of user engagements in korean fake news detection
CN110851733A (en) Community discovery and emotion interpretation method based on network topology and document content
CN117474126A (en) LLaMa2 big data model design method for initial examination and evaluation of manuscript
Medagoda et al. Keywords based temporal sentiment analysis
Xu et al. KDSTM: Neural Semi-supervised Topic Modeling with Knowledge Distillation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: NANTONG LONGXIANG ELECTRICAL EQUIPMENT CO., LTD.

Free format text: FORMER OWNER: NORTHWESTERN POLYTECHNICAL UNIVERSITY

Effective date: 20140814

Owner name: NORTHWESTERN POLYTECHNICAL UNIVERSITY

Effective date: 20140814

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 710072 XI AN, SHAANXI PROVINCE TO: 226600 NANTONG, JIANGSU PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20140814

Address after: 226600 No. 69 Donghai Road, Haian Development Zone, Nantong, Jiangsu

Patentee after: NANTONG LONGXIANG ELECTRIC EQUIPMENT CO., LTD.

Patentee after: Northwestern Polytechnical University

Address before: 710072 Xi'an friendship West Road, Shaanxi, No. 127

Patentee before: Northwestern Polytechnical University