CN101710333B

CN101710333B - Network text segmenting method based on genetic algorithm

Info

Publication number: CN101710333B
Application number: CN2009102191638A
Authority: CN
Inventors: 蔡皖东; 赵煜
Original assignee: Northwestern Polytechnical University
Current assignee: NANTONG LONGXIANG ELECTRIC EQUIPMENT CO., LTD.; Northwestern Polytechnical University
Priority date: 2009-11-26
Filing date: 2009-11-26
Publication date: 2012-07-04
Anticipated expiration: 2029-11-26
Also published as: CN101710333A

Abstract

The invention discloses a network text segmenting method based on the genetic algorithm, used for segmenting short network texts. The method comprises the following steps of: evaluating a Latent Dirichlet allocation (LDA) model corresponding to a corpus by using a Gibbs sampling method, inferring latent topic information using the model, representing texts by using the latent topic information; then transforming a text-segmenting process into a multi-target optimum process by using a parallel genetic algorithm, and calculating the coherency of segmented units, the divergence among the segmented units and fitness functions by using deeper semantic information; and carrying out the genetic iteration of the text segmenting process, and determining whether the segmenting process terminates based on the similarity among multi-iteration results or the upper limit of iterations to obtain the global optimal solution for segmenting the texts. Therefore, the invention improves the accuracy for segmenting the short network texts.

Description

Network text segmenting method based on genetic algorithm

Technical field

The present invention relates to a kind of network text segmenting method,, be applicable to cutting apart the short width of cloth text of network particularly based on the network text segmenting method of genetic algorithm.

Background technology

The network text cutting techniques is the important technical that network public-opinion monitoring, network text emotion are analyzed, and helps to find network text mid-deep strata time semantic information.

Document " based on the text segmentation model of multivariate discriminant analysis, software journal, 2007,18 (3), P 555-564 " discloses a kind of method of utilizing word frequency information to carry out text segmentation.This method adopts the multivariate discriminant analysis method; Utilize word frequency information to represent text with vector space model; Consider that 3 factors such as distance, cutting unit length have defined 4 global assessment functions between cutting unit inner distance, cutting unit, realize global assessment the text segmentation pattern.But,,, enough word frequency information can't be provided owing to have the sparse phenomenon of data in the text to the short width of cloth text in the network text; Simultaneously, because word frequency information is the shallow-layer semantic information,, influence the accuracy that similarity is calculated, and then influence text segmentation result's accuracy only according to the similarity between the word frequency computed segmentation unit.

Summary of the invention

To the lower defective of the short width of cloth text segmentation of art methods network accuracy rate; The present invention proposes a kind of network text segmenting method based on genetic algorithm; Utilize the Gibbs method of sampling to estimate that the corresponding potential Di Li Cray of corpus distributes (Latent Dirichlet allocation; LDA) model, and utilize this model to infer the potential topic information of target text, utilize potential topic information to represent text; Adopt paralleling genetic algorithm again; The text segmentation process is converted into the multiple-objection optimization process; Utilize in the profound semantic information computed segmentation unit diversity and fitness function between coherency, cutting unit, carry out the genetic iteration of text segmentation process, whether finish according to the similarity between the iteration result or iterations upper limit decision cutting procedure repeatedly; Obtain the text segmentation globally optimal solution, can improve the short width of cloth text segmentation of network accuracy rate.

Technical scheme of the present invention is: a kind of network text segmenting method based on genetic algorithm is characterized in may further comprise the steps:

(a) utilize crawler on network, to collect webpage; Through the webpage of collecting is carried out the text pre-service, only keep text message, and adopt the file classification method of naive Bayesian; Text message to behind the removal noise is classified, and category makes up the expansion corpus;

(b) adopt hierarchy clustering method that the expansion corpus is carried out cluster, confirm the number of sub-topics, adopt the Gibbs method of sampling to estimate the LDA model of expansion corpus; Estimate that the parameter that relates to adopts empirical value α=0.01; β=0.01, the burn-in spacing is 2000, the thinning spacing is 100;

(c) text to be split is carried out the text pre-service of participle, part-of-speech tagging, named entity recognition, word sense disambiguation, the frequency of noun, verb in the statistics text is selected the characteristic vocabulary of high frequency vocabulary as text; Again according to HowNet, the similarity between the characteristic vocabulary that calculates text and the characteristic vocabulary of expansion corpus, the corpus of choosing similarity maximal value correspondence is the outside corpus of text segmentation; Adopt the corresponding LDA model of the Gibbs method of sampling and said expansion corpus to infer the semantic structure information that text to be split comprises at last, the semantic structure information of deduction comprises the type and the probability of vocabulary in cutting unit of the affiliated sub-topics of vocabulary; The type of sub-topics is used for the expression of text to be split under the vocabulary, is the sub-topics type under unit each vocabulary of statistics with the sentence, and sentence expression is the sub-topics space vector, sentence Sj=s _J1s _J2... s _Jj... s _JT, s _JjVocabulary belongs to the frequency of sub-topics j among the expression sentence j;

(d) utilize paralleling genetic algorithm to carry out text segmentation; The algorithm coding scheme adopts the binary coding scheme; Initialization of population adopts random digit generation method, utilizes the minimum length of semantic paragraph and two indexs of minimum number that text comprises semantic paragraph simultaneously, filters underproof initial individuality; According to formula

C_{oh} = 1 - Σ_{n = 1}^{j} \frac{1}{k} \underset{s_{i} &Element; b_{n}}{Σ} Σ_{l = 1}^{T} {(s_{il} - a_{nl})}^{2}

Coherency in the computing semantic paragraph; In the formula,

| b _n| represent the sentence number that comprises in n the semantic paragraph, a _nThe corresponding average vector of expression semantic paragraph, a _NlBe l component of this vector;

According to formula

D_{is} = Σ_{n = 1}^{j} \frac{| b_{n} |}{k} Σ_{l = 1}^{T} {(a_{nl} - c_{l})}^{2}

Diversity between the computing semantic paragraph; In the formula,

Calculate each individual fitness function value in the genetic iteration based on diversity between coherency in the semantic paragraph and semantic paragraph, computing formula is following:

In the formula; Population is expanded in

expression, is used for storing the optimum solution of iteration;

In the population selection course, at first adopt elite's retention strategy, the elite who keeps in population and the expansion population is individual, directly gets into of future generation the evolution; Adopt the roulette method then, from population and expansion population, select individuality respectively, relatively the fitness value of two individualities selects the little individuality of fitness to intersect and mutation operation;

The intersection process adopts the single-point cross method, in order to prevent inbreeding, when Hamming distance between individuality surpasses threshold value, just allows between population and expansion population, to carry out interlace operation, and threshold value is set between individuality on average 20% of Hamming distance; Similarity self-adaptation according to population is regulated mutation operator; The calculating formula of similarity of population is following:

Figure DEST_PATH_RE-GSB00000738153300011

Calculate the similarity of optimum individual in the different iteration round expansion populations according to formula

Figure DEST_PATH_RE-GSB00000738153300012

; When surpassing threshold value and continue 50, similarity takes turns; Finishing iteration process then; Choose the result of the individuality of expansion in the population as text segmentation; In the binary representation of individuality, the corresponding sentence of numeral " 1 " is exactly the border of text segmentation.

The invention has the beneficial effects as follows: owing to utilize the Gibbs method of sampling to estimate that the corresponding potential Di Li Cray of corpus distributes (Latent Dirichlet allocation; LDA) model; And utilize this model to infer the potential topic information of target text, utilize potential topic information to represent text; Adopt paralleling genetic algorithm again; The text segmentation process is converted into the multiple-objection optimization process; Utilize in the profound semantic information computed segmentation unit diversity and fitness function between coherency, cutting unit, carry out the genetic iteration of text segmentation process, whether finish according to the similarity between the iteration result or iterations upper limit decision cutting procedure repeatedly; Obtain the text segmentation globally optimal solution, improved the short width of cloth text segmentation of network accuracy rate.

The accuracy rate of text segmentation is weighed by accuracy and recall rate usually; Background technology is removed and is adopted the above attribute of weighing; Also utilize P μ value as criterion; Through in above-mentioned environment, 50 texts to be split being tested, the method that the present invention relates to is weighed on the attribute at 3 and all is superior to background technology, is especially exceeding 15% aspect the P μ value.

Below in conjunction with accompanying drawing and embodiment the present invention is elaborated.

Description of drawings

Accompanying drawing is the network text segmenting method process flow diagram that the present invention is based on genetic algorithm.

Embodiment

With reference to accompanying drawing, present embodiment is directed against the target text that themes as " Beijing Olympic ", the language operating specification, and the text length is shorter, and the concrete steps of text segmentation are following:

The first step, the search for that crawler is set is a vocabulary related with Olympic, utilizes crawler on network, to collect webpage.Olympic Games theme vocabulary confirm to comprise following three steps, 1) many pieces in the artificial text of confirming to represent search for, be generally 10～20 pieces; 2) word frequency of noun, verb in the statistics literary composition is chosen the high vocabulary of word frequency and is compiled as descriptor undetermined, and the word frequency threshold value is set to 30; 3) from descriptor undetermined is compiled, manual work is chosen 10～15 vocabulary as theme vocabulary.

Webpage all is a html document, need carry out the text pre-service to the webpage of collecting, and need filter the HTML indications when extracting text message; Except title and text, also comprise many links in the webpage, these links are uncorrelated with the text text, when extracting web page contents, also need filter these useless links.

Adopt the text binary classification method of naive Bayesian; Text to behind the removal noise is classified; Remove and the incoherent webpage of theme according to classification results, make up topic corpus, Feature Selection can adopt the Feature Selection method of information gain IG, mutual information MI etc.Topic corpus is minimum to comprise 1000 pieces of texts.

In second step, adopt the Gibbs method of sampling to estimate the LDA model of corpus.Gibbs sampling iterative process is carried out according to following formula:

P (z_{i} = j | z_{- i}, w_{i}) = \frac{\frac{{n^{w_{i}}}_{- ij} + β}{{n^{*}}_{- ij} + Wβ} \cdot \frac{{n^{d_{i}}}_{- ij} + α}{{n^{d_{i}}}_{- i *} + α}}{Σ_{j = 1}^{T} \frac{{n^{w_{i}}}_{- ij} + β}{{n^{*}}_{- ij} + Wβ} \cdot \frac{{n^{d_{i}}}_{- ij} + α}{{n^{d_{i}}}_{- i *} + α}}

Wherein,

Expression w _iCorresponding vocabulary is assigned to the number of times of theme j, n ^* _-ijExpression is assigned to total vocabulary number of theme j,

Expression text d _iIn be assigned to the vocabulary number of theme j,

Expression text d _iIn vocabulary sum, above information all can be added up acquisition from text, statistic processes is not considered current lexical item w _i

The process of Gibbs sampling comprised for three steps:

1) iteration is initial, z _iBe assigned 1 to the T arbitrary value;

2), calculate w respectively according to formula _iBe assigned to the probability of theme 1 to T, get more new term w of maximal value _iThe theme distribution state, obtain the next state of markov chain;

3) judge according to the similarity and the burn-in spacing of front and back markov chain whether iteration finishes, then iteration end when similarity surpasses threshold value or reaches the burn-in spacing.

In the Gibbs sampling, adopt hierarchy clustering method to confirm the number of sub-topics, other parameters adopt empirical value α=0.01, β=0.01, and burn-in spacing and thinning spacing value respectively are 2000 and 100, iterative process adopts the GibbsLDA++ instrument;

The 3rd step, text to be split is carried out text pre-service such as participle, part-of-speech tagging, named entity recognition, word sense disambiguation, the frequency of noun, verb in the statistics text is selected the characteristic vocabulary of high frequency vocabulary as text.According to HowNet; Utilize context relation between adopted unit to calculate the similarity between the characteristic vocabulary of characteristic vocabulary and each corpus of text; Because " Beijing Olympic " that text to be split and step 1 generate expansion corpus similarity is maximum, therefore chooses the outside corpus that this corpus is a text segmentation.

The LDA model that adopts the Gibbs method of sampling and step 2 to estimate is inferred the semantic structure information that text to be split comprises, and the semantic structure information of deduction comprises the type of the affiliated sub-topics of vocabulary.Language construction information deduction process is still used the formula in second step, and wherein, di is expression sentence i in the 3rd step, and promptly the vocabulary statistics is a unit with the sentence.

Sub-topics type in the statistics sentence under each vocabulary, constructor theme space vector, sentence Sj=sj1sj2...sjj...sjT, sjj represent that vocabulary among the sentence j belongs to the frequency of sub-topics j.

In the 4th step, utilize paralleling genetic algorithm to carry out text segmentation.The algorithm coding scheme adopts the binary coding scheme; Initialization of population adopts random digit generation method, utilizes the minimum length of semantic paragraph and two indexs of minimum number that text comprises semantic paragraph simultaneously, filters underproof initial individuality; The paragraph minimum length is no less than 3, and the paragraph number is no less than 5.According to formula

C_{oh} = 1 - Σ_{n = 1}^{j} \frac{1}{k} \underset{s_{i} &Element; b_{n}}{Σ} Σ_{l = 1}^{T} {(s_{il} - a_{nl})}^{2}

Coherency in the computing semantic paragraph.In the formula,

a_{Nl} = \frac{1}{| b_{n} |} \underset{s_{i} &Element; b_{n}}{Σ} s_{Il},

| b _n| represent the sentence number that comprises in n the semantic paragraph, a _nThe corresponding average vector of expression semantic paragraph, a _NtBe t component of this vector.

According to formula

D_{is} = Σ_{n = 1}^{j} \frac{| b_{n} |}{k} Σ_{l = 1}^{T} {(a_{nl} - c_{l})}^{2}

Diversity between the computing semantic paragraph.In the formula,

c_{l} = \frac{1}{k} Σ_{i = 1}^{k} s_{Il} .

According to the fitness function value of diversity calculating genetic algorithm between coherency in the semantic paragraph and semantic paragraph, computing formula is following:

In the population selection course, at first adopt elite's retention strategy, the individuality of choosing auto-adaptive function value minimum in population and the expansion population respectively is as the elite, and the individual directly entering of elite the next generation evolve.Secondly, adopt the roulette method, selection is individual from population and expansion population respectively, and relatively the fitness of two individualities selects the little individuality of fitness to intersect and mutation operation.

Adopt single-point to intersect and accomplish the intersection process; In order to prevent inbreeding, the individuality of participating in intersecting must belong to different populations, and has only when Hamming distance between individuality surpasses threshold value; Just allow between the two, to carry out interlace operation, threshold value is set to 20% of average Hamming distance between individuality usually.

According to the adaptive adjustment mutation operator of the similarity of population, the calculating formula of similarity of population is following:

Wherein,

x _i, x _jTwo individuals in the expression population.Population variation considers whether the variation result satisfies the requirement of segmentation result, and segmentation result requires to filter with initialization of population and requires identically, if do not satisfy, then generates new individuality replacement variation back individuality.

; Take turns when similarity surpasses threshold value and continues 50, then iteration finishes.Choose the result of the individuality of expansion in the population as text segmentation, in the binary representation of individuality, the corresponding sentence of numeral " 1 " is exactly the border of text segmentation.

Claims

1. network text segmenting method based on genetic algorithm is characterized in that may further comprise the steps:

C_{oh} = 1 - Σ_{n = 1}^{j} \frac{1}{k} \underset{s_{i} &Element; b_{n}}{Σ} Σ_{l = 1}^{T} {(s_{il} - a_{nl})}^{2}

Coherency in the computing semantic paragraph; In the formula,

According to formula

D_{is} = Σ_{n = 1}^{j} \frac{| b_{n} |}{k} Σ_{l = 1}^{T} {(a_{nl} - c_{l})}^{2}

Diversity between the computing semantic paragraph; In the formula,

In the formula; Population is expanded in

expression, is used for storing the optimum solution of iteration;