CN103631856B

CN103631856B - Subject visualization method for Chinese document set

Info

Publication number: CN103631856B
Application number: CN201310488312.7A
Authority: CN
Inventors: 朱敏; 梁婷; 甘启宏; 李明召; 李; 李�一
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2013-10-17
Filing date: 2013-10-17
Publication date: 2017-01-11
Anticipated expiration: 2033-10-17
Also published as: CN103631856A

Abstract

The invention discloses a subject visualization method for a Chinese document set. The subject visualization method comprises the steps of classifying the document set according to subjects, dividing the time periods of the document set, calculating subject frequency, ranking the subjects, generating a subject flow graph, extracting keywords expressing the content of the subjects, calculating and ranking the weights of the keywords and generating a character cloud. The subject visualization method further comprises an ordering method based on the subject frequency and geometrical complementarity, a character cloud arrangement method and a method for generating the detailed character cloud. The subject visualization method has the advantages that the subject visualization on the Chinese document set is achieved; the subject flow graph generated through the ordering method based on the subject frequency and the geometrical complementarity is more attractive, flatter, high in space use ratio and more beneficial to character cloud arrangement; the character cloud arrangement method can effectively utilize space, and arrangement efficiency is greatly improved; the detailed character cloud is generated, and all the keyword content of the subjects can be shown.

Description

A kind of theme method for visualizing of Chinese document collection

Technical field

The present invention relates to text visualization and subject analysis field, be specifically that the theme of a kind of Chinese document collection is visual Change method.

Background technology

Large-scale collection of document, such as news, scientific and technical literature, webpage and electronic publication, bulletin etc., has contained bulk information.With The development of information digitalization and popularize, the scale expanding day of collection of document, the information that rapid reading is vast as the open sea with understanding, And therefrom extract useful knowledge, it has also become people's problem demanding prompt solution.

" theme " generally includes a core event or activity, and the most directly related all event and activity.Main Topic detection method use cluster, classify, retrieve, the technology such as topic tracking, according to theme document sets carried out hierarchy type classification with Tissue, facilitates user to retrieve it, selects and browse.But, after document is sorted out, when user still needs to expend a large amount of Between read all documents under this theme, with understand theme main contents, excavate potential knowledge and obtain needed for information.

Subject content, on the basis of topic detection, is collected by multi-document auto-abstracting technology, removes redundancy After, generate comprehensive, succinct text.Thus drastically increase information acquisition efficiency.But existing multi-document summary result is led to The most more complicated, user's indigestion, and be difficult to summarization generation process is controlled, lack friendly user interface and man-machine Interactive operation.Additionally, multi-document auto-abstracting technology often have ignored other attributes outside content of text, such as time, quantity etc., It is difficult to represent theme and subject content Characteristics of Evolution in time in document sets, also cannot reflect each theme under same document sets Between relation.

Text visualization as an important branch in information visualization field, utilizes the mankind inherent to figure Identification, memory and analysis ability, be converted into graph image by text message, help people intuitively, understand efficiently, read and divide Analysis content of text and structure, and by corresponding interactive operation, help people to excavate valuable knowledge and pattern.

Content of text is abstracted into the set of one group of vocabulary by Word Cloud (word cloud) visualization technique, utilizes font big The word frequency information of little expression vocabulary, then vocabulary is compact according to certain rule, aesthetically line up, special to represent text Levy.But single document can only be visualized by word cloud.To multiple documents, Themerive(theme stream) in document sets Theme visualizes, and shows each theme intensity trend over time in document sets.Initial theme stream only comprises theme Intensity and temporal information, and theme order random alignment.Afterwards, Liu Shixia et al. proposes the theme stream TIARA improved, and is i.e. leading Topic stream embeds word cloud, further each subject content is visualized, contribute to user and quickly analyze text subject content Rule over time.

The most several text visualization technology all lack versatility, are not suitable for Chinese document, the most up to the present, The most still lack the visualization technique that Chinese document subject matter is analyzed.Additionally, it is visual just for English document theme TIARA technology there is also following problem: 1) shape of word cloud in theme stream, layout are unstable, easily make user cause misunderstanding, Affect subject analysis effect；2) due to area-limited, the word cloud of generation cannot show whole key contents of each theme.

Summary of the invention

It is an object of the invention to provide the theme method for visualizing of a kind of Chinese document collection, by Chinese document sets The each subject information extracted is added up and is processed, and measures out the intensity of theme and the weight of content, the most graphically changes Mode is shown.

The technical scheme realizing the object of the invention is as follows: the theme method for visualizing of a kind of Chinese document collection, including by master Inscribe the step to document sets classification: setting document sets has n theme l_j, j=0,1,2 ..., n-1, according to theme in document sets All documents are classified, and obtain n document subset D_j, j=0,1,2 ..., n-1；Wherein, theme l_jCorresponding document subset is D_j；

Divide the step of document sets time period: set the document sets time started as t_start, the end time is t_end, to document sets Time period [t_start,t_end] carry out decile, obtain time period T_p=(t_start+ (p-1) Δ t, t_start+ p Δ t], wherein, p=1, 2 ..., m-1,Calculate the step of the theme frequency: set the theme frequency and include v_j,0And v_j,p, wherein v_j,0Be the theme l_j Corresponding document subset D_jAt time started t_starThe number of documents of t, v_j,pIt is theme l_jCorresponding document subset D_jAt time period T_p The quantity of interior document；Calculate the theme frequency of each theme respectively；

The step that theme is ranked up: all themes are sorted, the subject nucleotide sequence table after being sorted；

Generate the step of theme flow graph: according to the subject nucleotide sequence table after sequence and the theme frequency, use theme flow algorithm, raw Become theme flow graph；

Extract the step of the key word representing subject content: set W_j,pIt is theme l_jCorresponding document subset D_jAt time period T_p Interior document represents the key word subset of this subject content；Use the general Words partition system of Modern Chinese corresponding from each theme Document subset extracts the key word subset representing this subject content in the document of each time period respectively；

The weight calculating key word the step sorted: the weight setting key word is that this key word is a key word subset The number of times of middle appearance；Calculate each key word weight in each key word subset, and basis in each key word subset All key words are sorted by the weight of key word from big to small；

Generate the step of word cloud: according to key word subset and keyword weight, theme flow graph generates word cloud.

In technique scheme, the step being ranked up theme can use the row complementary based on the theme frequency and geometry Sequence method, including

Step 1, if theme l_jInitial time be OT_j；Work as v_j,0When being not equal to zero, take the time started t of document sets_start For OT_j；Work as v_j,0During equal to zero, then take v_j,pThose time periods T being not zero_pThe minima of left end point as OT_j；Calculate every The initial time of individual theme；

Step 2, if theme l_jThe frequency andCalculate each theme the frequency and；

Step 3: newly-built empty list B；If n is even number, then the frequency and that maximum subject write are entered list the first row, As upper extreme point theme l_up, the frequency and time that big subject write are entered list the second row, as lower extreme point theme l_down；If N is odd number, then the frequency and that maximum subject write are entered list the first row, simultaneously as upper extreme point theme l_upWith lower extreme point master Topic l_down；

Step 4: select a not theme l in list B_i, calculate l_upAnd l_iThe meansigma methods of frequency sum

\overset{&OverBar;}{V (l_{up} + l_{i})} = \frac{1}{m} Σ_{p = 0}^{m - 1} (v_{up, p} + v_{i, p}) - - - (1);

Calculate l_upAnd l_iGeometry complementary, use varianceRepresent:

σ_{up, i} = \sqrt{\frac{1}{m} Σ_{p = 0}^{m - 1} {((v_{up, p} + v_{i, p}) - \overset{&OverBar;}{V (l_{up} + l_{i})})}^{2}} - - - (2);

WillAnd OT_iAfter normalization, calculate weighted value D_i:

D_i=sOT_i+(1-s)σ_up,i(3)；

Wherein s is for controlling parameter, 0≤s≤1；

Step 5: repeat step 4, until obtaining the weighted value of each not theme in list B；

Step 6: select weighted value D_iMinimum theme, is inserted into the top of list B upper extreme point theme, as new upper end Point theme l_up；

Step 7: select a not theme l in list B_k, calculate l_downAnd l_kThe meansigma methods of frequency sum

\overset{&OverBar;}{V (l_{down} + l_{k})} = \frac{1}{m} Σ_{p = 0}^{m - 1} (v_{down, p} + v_{k, p}) - - - (4);

Calculate l_downAnd l_kGeometry complementary, use variances sigma_down,kRepresent:

σ_{down, k} = \sqrt{\frac{1}{m} Σ_{p = 0}^{m - 1} {((v_{down, p} + v_{k, p}) - \overset{&OverBar;}{V (l_{down} + l_{k})})}^{2}} - - - (5);

By σ_down,kAnd OT_kAfter normalization, calculate weighted value D_k:

D_k=sOT_k+(1-s)σ_down,k(6)；

Wherein s is for controlling parameter, 0≤s≤1；

Step 8: repeat step 7, until obtaining the weighted value of each not theme in list B；

Step 9: select the theme that weighted value is minimum, be inserted into the lower section of list B lower extreme point theme, as new lower extreme point Theme l_down；

Step 10: repeated execution of steps 4 to step 9, until all of theme all adds in list B.

In the present invention, the value controlling parameter s is 0.3.

In preceding solution, the method generating word cloud on theme flow graph is:

Step 1: select theme l on theme flow graph_jCorresponding region G_j, its time started and end time respectively equal to literary composition The time started t of shelves collection_startWith end time t_end, by region G_jTime period [t_start,t_end] it is divided into m-1 section, Mei Geshi Between section a length of Obtain decile time point t_start+ p Δ t, wherein, p=1,2 ..., m-2；

Step 2: successively with decile time point t_startCentered by+p Δ t, according to Δ t at region G_jUpper intercepting subregion R_j,p； R_j,pIt is by a setWith Curve And line segmentThe closed space constituted；

Step 3: at each subregion R_j,pUpper placement key word subset W_j,pIn key word, generate theme l_jWord Cloud；Line segment is used including 3.1 The each point of Pointcut N, obtains subregion R_j,pNear Like polygon；

3.2 set subregion R_j,pThe each summit of approximate polygon in the y-coordinate value on maximum that summit of y-coordinate be y_max； If the y-coordinate value on that summit that y-coordinate is minimum is y in each summit of polygon_min；

3.3 with one group of horizontal line H={y=c | y_min≤c≤y_max, c ∈ Z} and subregion R_j,pIntersect, obtain some phases Intersection section；Take each intersecting lens section and be positioned at the sub-line section of polygonal internal, be expressed asWherein, M It is positioned at R for this intersecting lens section_j,pThe bar number of internal sub-line section；By R_j,pIt is expressed as one group of horizontal line section collection

L_{j, p} = {r_{c} (i) = \overset{&OverBar;}{s_{c} (i) e_{c} (i)}, y_{\min} \leq c \leq y_{\max}, 0 < i \leq M};

3.4 according to W_j,pIn keyword weight choose a key word the most successively, a height of h is set, a width of w's Rectangle replaces this key word to be laid out, and then places this key word at placement position；Including

A, detection are at R_j,pCorresponding L_j,pIn, at c=(y_max-y_minAn a width of w be placed in position)/2 can, a height of h's Rectangle, detection method is: detect r corresponding from c to c-h_cIn (i), if all there is same i, meet line segment's Length is more than w；If can, then at position (c, s_c(i)) place key word, update s_cI () is s_c(i)=s_c(i)+w；If can not, turn Enter step B；

B, with c=(y_max-y_minCentered by)/2, c=c+1, c=c-1 is made alternately to travel through L successively_j,p, detection is at c position energy This rectangle of no placement；If can, then at position (c, s_c(i)) place key word, update s_cI () is s_c(i)=s_c(i)+w；If can not Then continue to make c=c+1, c=c-1 alternately to travel through L_j,p, until finding the r meeting condition_c(i) or traveled through all of r_c(i)；If Travel through all of r_cAfter (i), do not find the position c meeting condition yet, then give up this key word；

3.5 repeat step 3.4, until by W_j,pIn key word all place；

Step 4: repetition step 1 to step 3, until generating the word cloud of each theme on theme flow graph.

The method that the invention allows for generating detailed word cloud, including

Step 1: select to express theme l_jThe keyword set of content；

Step 2: arrange a border circular areas C, turns to one group of conflict point set P by the boundary discrete method of C；

Step 3: choose a key word from big to small according to the weight of key word from keyword set, uses random greedy Center algorithm generates position candidate coordinate (word.x, word.y) for it in the C of region；

Step 4: according to the weight setting font size of this key word, further according to the number of words of font size He this key word, near with rectangle r Seemingly replace key word, if the lower left corner coordinate of rectangle r is equal to coordinate；

Step 5: to each conflict point in P, whether detection each point conflicts with r；If there is conflict, proceed to step 6；As There is not conflict in fruit, proceeds to step 7；

Step 6: after updating position coordinate along spiral path, repeats step 4, step 5, meets condition until finding Position pcoordinate or the radius of spin more than 100；When the radius of spin is more than 100, key word will be rejected；

Step 7: place this key word at position coordinate (word.x, word.y), and this key word is taken Discrete region turns to conflict point, adds in conflict point set P；

Step 8: repeat step 3 and arrive step 7, until all key words are placed in keyword set.

When generating detailed word cloud, theme l can be selected_jAny one key word subset W_j,pFor expressing theme The keyword set of content, it is also possible to select theme l_jAll key word subsets W_j,pThe keyword set formed after merging Cooperation is for expressing the keyword set of the content of theme.

The present invention has the technical effect that 1 relative to prior art, achieves the theme visualization to Chinese document sets. 2, after using sort method based on the theme frequency and geometry complementarity to be ranked up theme, the theme flow graph of generation is more beautiful See, more smooth, space availability ratio is high, the more conducively placement of word cloud.3, the word cloud layout method of the present invention is used, can be effective Utilize space, on the premise of same area size, font size, more key word can be placed；And the layout generated is steady Fixed, do not change with interactive operation below；Efficiency of algorithm also significantly improves, and this algorithm is by certain regular by irregular area It is expressed as the most discrete entity, only need to travel through when placing word and find the entity meeting this word placement condition, be not required to Collision detection to be carried out and border detection, therefore substantially increase positioning efficiency.4, generate detailed word cloud and can show master All key words contents of topic.

Accompanying drawing explanation

Fig. 1 is the flow chart of the present invention.

Fig. 2 is in first embodiment of the invention, rear generate theme flow graph randomly ordered to theme, then uses TIARA skill The placement algorithm of word cloud result visual to subject content figure in art.

Fig. 3 is the flow chart of the sort method complementary based on the theme frequency and geometry in second embodiment of the invention.

Fig. 4 is the design sketch of the theme flow graph generated in second embodiment of the invention.

Fig. 5 is the schematic diagram intercepting subregion in third embodiment of the invention.

Fig. 6 is by schematic diagram that subregion approximate representation is one group of horizontal line section collection in third embodiment of the invention.

Fig. 7 is the design sketch after placing key word in third embodiment of the invention on theme flow graph, theme stream therein Figure rear generation randomly ordered to theme.

Fig. 8 is to the design sketch after the theme visualization of Chinese document sets, wherein theme stream in fourth embodiment of the invention Figure is to have employed generation after sort method sequence based on the theme frequency and geometry complementarity, and word cloud is by the layout after improving Algorithm generates.

Fig. 9 be in fifth embodiment of the invention to Chinese document sets theme visualize and add detailed word cloud after Design sketch.

Detailed description of the invention

Embodiment 1: below as a example by " Journal of Software " journal data, in conjunction with Fig. 1, shows that Chinese document theme is visual Method.

Step one, classifies to document sets by theme: setting document sets has n theme l_j, j=0,1,2 ..., n-1, according to master All documents in document sets are classified by topic, obtain n document subset D_j, j=0,1,2 ..., n-1；Wherein, theme l_jRight The document subset answered is D_j.Particularly as follows: the paper data of input " Journal of Software " periodical the 1st phase to the 9th phase.By document sets according to Systems soft ware and soft project, database technology, computer network and information security, pattern recognition and artificial intelligence and operation system Five themes of uniting are classified, and obtain five document subset.

Step 2, divides the document sets time period: set the document sets time started as t_start, the end time is t_end, to document sets Time period [t_start,t_end] carry out decile, obtain time period T_p=(t_start+ (p-1) Δ t, t_start+ p Δ t], wherein, p=1, 2 ..., m-1, Wherein, m is time started, end time and the sum of decile time point.Here, document sets Time started was the 1st phase, and the end time was the 9th phase, and the document sets time is divided into 8 time periods, and time interval was 1 phase.

Step 3, calculates the theme frequency: set the theme frequency and include v_j,0And v_j,p, wherein v_j,0Be the theme l_jCorresponding document Subset D_jAt time started t_startNumber of documents, v_j,pIt is theme l_jCorresponding document subset D_jAt time period T_pInterior document Quantity；Calculate the theme frequency of each theme respectively.Here, the document calculating each theme occurred in the 1st phase and other each phase The frequency, the number of documents that the most each theme comprised within document sets time started and each time period.

Step 4, is ranked up theme: sort all themes, the subject nucleotide sequence table after being sorted.The present embodiment In, it is ranked up theme using traditional random alignment method.Use different sort methods, to the theme flow graph generated Effect has a direct impact.

Step 5, generates theme flow graph: according to the subject nucleotide sequence table after sequence and the theme frequency, uses theme flow algorithm, Generate theme flow graph, theme intensity is visualized.In the present embodiment, generate the theme as shown in colour band colored in Fig. 2 Flow graph, wherein blue color is pattern recognition and artificial intelligence, and purple colour band is computer network and information security, red ribbon For operating system, yellow colour band is database technology, and green color bars is systems soft ware and soft project.Theme flow algorithm (quoted from " ThemeRiver:Visualizing thematic changes in large document collections " literary composition), (it is the constraint of zero that interpolating function need to meet at extreme point derivative i.e. to carry out interpolation according to each theme weights in discrete time Condition), then carry out the drafting of stacked graph, generate theme flow graph.In theme flow graph, horizontal axis representing time, longitudinal axis difference in height Representing theme intensity, different colours band represents different themes.Colour band broadens over time or narrows expression theme intensity at any time Between differentiation.

Step 6, extracts the key word representing subject content: set W_j,pIt is theme l_jCorresponding document subset D_jIn the time period T_pInterior document represents the key word subset of this subject content, uses " the general Words partition system of Modern Chinese " from each theme pair The document subset answered extracts the key word subset representing this subject content in the document of each time period respectively.The present embodiment In, use word bag model text analysis technique, extract the key word subset representing subject content.Particularly as follows: use Beijing language " the general Words partition system of Modern Chinese " of speech university language information processing institute exploitation, returns in the document subset to each theme The all documents belonging to each time period carry out participle, remove stop words, such as auxiliary words of mood, adverbial word, preposition, conjunction etc., finally Obtain multiple key word subset.

Step 7, calculates the weight of key word and sorts: the weight of key word is that this key word is a key word subset The number of times of middle appearance；Calculate each key word weight in each key word subset, and according to the weight of key word from greatly to Little all key words are sorted.

Step 8, generates word cloud: according to key word subset and keyword weight, uses TIARA algorithm at theme flow graph Upper generation word cloud, visualizes subject content.

According to the Chinese document collection theme method for visualizing of the present embodiment, to the result after Chinese document sets theme visualization As shown in Figure 2.

Embodiment 2: in above-mentioned Chinese document collection theme method for visualizing, each theme order random alignment.Generating theme During stream, if the Strength Changes of certain theme is excessive, theme shape the most adjacent thereto can be twisted so that result is the most beautiful See, and the relative intensity between theme is also difficult to identification.Additionally, the theme after Niu Qu also can affect the placement of word cloud.With this Simultaneously for all themes of a document sets, user is often more concerned with the particular content of theme of theme maximum intensity.Cause This, the present invention step to theme being ranked up in embodiment 1, carry out further improvement, devised a kind of based on master Theme is ranked up by the sort method of the topic frequency and geometry complementarity.This sort method is described in detail below in conjunction with Fig. 3:

Step 2, if theme l_jThe frequency andCalculate each theme the frequency and；

Step 3: newly-built empty list B；If n is even number, then the frequency and that maximum subject write are entered list the first row, As upper extreme point theme l_up, the frequency and time that big subject write are entered list the second row, as lower extreme point theme l_down；If N is odd number, then the frequency and that maximum subject write are entered list the first row, simultaneously as upper extreme point theme l_upWith lower extreme point master Topic l_down；Step 4: select a not theme l in list B_i, calculate l_upAnd l_iThe meansigma methods of frequency sum

\overset{&OverBar;}{V (l_{up} + l_{i})} = \frac{1}{m} Σ_{p = 0}^{m - 1} (v_{up, p} + v_{i, p}) - - - (1);

Calculate l_upAnd l_iGeometry complementary, use varianceRepresent:

σ_{up, i} = \sqrt{\frac{1}{m} Σ_{p = 0}^{m - 1} {((v_{up, p} + v_{i, p}) - \overset{&OverBar;}{V (l_{up} + l_{i})})}^{2}} - - - (2);

WillAnd OT_iAfter normalization, calculate weighted value D_i:

D_i=sOT_i+(1-s)σ_up,i(3)；

Wherein s is for controlling parameter, 0≤s≤1；

\overset{&OverBar;}{V (l_{down} + l_{k})} = \frac{1}{m} Σ_{p = 0}^{m - 1} (v_{down, p} + v_{k, p}) - - - (4);

σ_{down, k} = \sqrt{\frac{1}{m} Σ_{p = 0}^{m - 1} {((v_{down, p} + v_{k, p}) - \overset{&OverBar;}{V (l_{down} + l_{k})})}^{2}} - - - (5);

By σ_down,kAnd OT_kAfter normalization, calculate weighted value D_k:

D_k=sOT_k+(1-s)σ_down,k(6)；

Wherein s is for controlling parameter, 0≤s≤1；

In the present embodiment, the value controlling parameter s is 0.3.

After being ranked up theme according to the sort method in the present embodiment, the theme flow graph of generation is as shown in Figure 4, permissible Finding out that theme flow graph is more attractive, more smooth, space availability ratio is high, the more conducively placement of word cloud.

Embodiment 3: for word cloud shape, the problem of layout instability in TIARA technology, word cloud is also entered by the present invention Go improvement, first theme has been divided into several subregions, then uses scalable algorithm (quoted from " Tag Cloud++- Scalable Tag Clouds for Arbitrary Layouts " literary composition) it is one group of horizontal line section collection by this region representation, It is sequentially placed key word again, generates word cloud.Visual signature is as follows: 1) weight of key word is the biggest, and font is the biggest；2) weight is more Big key word is the closer to this regional center.It is described in detail below in conjunction with Fig. 5, Fig. 6:

Step 1: select theme l on theme flow graph_jCorresponding region G_j, its time started and end time respectively equal to literary composition The time started t of shelves collection_startWith end time t_end, by region G_jTime period [t_start,t_end] it is divided into m-1 section, Mei Geshi Between section a length of Obtain decile time point t_starT+p Δ t, wherein, p=1,2 ..., m-2；

Step 2: as it is shown in figure 5, successively with decile time point t_startCentered by+p Δ t, according to Δ t at region G_jUpper intercepting Subregion R_j,p；R_j,pIt is by a setWith Curve And line segmentThe closed space constituted；

Step 3: at each subregion R_j,pUpper placement key word subset W_j,pIn key word, generate theme l_jWord Cloud；Including

3.1 use line segment successively The each point of Pointcut N, Obtain subregion R_j,pApproximate polygon；

3.3 as shown in Figure 6, with one group of horizontal line H={y=c | y_min≤c≤y_max, c ∈ Z} and subregion R_j,pIntersect, To some crossing line segments；Take each intersecting lens section and be positioned at the sub-line section of polygonal internal, be expressed asWherein, M is that this intersecting lens section is positioned at R_j,pThe bar number of internal sub-line section；By R_j,pIt is expressed as one Group horizontal line section collection

3.5 repeat step 3.4, until by W_j,pIn key word all place；

Fig. 7 shows the effect after using this method to generate word cloud, wherein generates theme flow graph, uses theme sequence Randomly ordered method.Comparison diagram 2 is it can be seen that use the word cloud placement algorithm of the present invention to have the advantage that 1) energy Effectively utilize space.On the premise of same area size, font size, more key word can be placed.2) layout generated Stable, do not change with interactive operation below.3) efficiency of algorithm significantly improves.This algorithm is by certain regular by region of disorder Domain representation is the most discrete entity, only need to travel through when placing word and find the entity meeting this word placement condition, no Need to carry out collision detection and border detection, therefore drastically increase positioning efficiency.

Embodiment 4: the present embodiment combines the word cloud after improving in the theme sort method of embodiment 2 and embodiment 3 Laying method, other step is constant.Fig. 8 shows that the present embodiment is to the Chinese visual result of document sets theme.

Embodiment 5: in TIARA, is limited by area size, it is difficult to place all of key word in a region. Therefore, the present invention uses a detailed word cloud, and full content or each theme with each theme of visualization further are at each The full content of time period.The color of theme in the background color correspondence theme flow graph of word cloud, the size of key word corresponds to The weight of key word.The present invention uses random greedy algorithm (quoted from " TIARA:A Visual Exploratory Text Analytic System ") generate detailed word cloud, particularly as follows:

Step 1: select to express theme l_jThe keyword set of content；

Step 6: after updating position coordinate along spiral path, repeats step 4, step 5, meets condition until finding Position coordinate or the radius of spin more than 100；When the radius of spin is more than 100, key word will be rejected；

Fig. 9 shows in the theme visualization method of Chinese document collection, adds the effect after detailed word cloud.In detail Thin word cloud places lower right in the drawings, clicks the colour band that theme is corresponding, each theme of the most changeable display on theme flow graph Corresponding detailed word cloud.Show in figure is the detailed word cloud of operating system theme.It can be seen that due to by region The restriction of size, on theme flow graph in the word cloud of operating system theme, whole key words contents are placed the most completely, and detailed Thin word cloud then illustrates all key words contents of this theme.

Claims

1. the theme method for visualizing of a Chinese document collection, it is characterised in that include

Step document sets classified by theme: setting document sets has n theme l_j, j=0,1,2 ..., n-1, according to theme to literary composition All documents that shelves are concentrated are classified, and obtain n document subset D_j, j=0,1,2 ..., n-1；Wherein, theme l_jCorresponding Document subset is D_j；

Divide the step of document sets time period: set the document sets time started as t_start, the end time is t_end, to the document sets time Section [t_start,t_end] carry out decile, obtain time period T_p=(t_start+(p-1)Δt,t_start+ p Δ t], wherein, p=1,2 ..., M-1,

Calculate the step of the theme frequency: set the theme frequency and include v_j,0And v_j,p, wherein v_j,0Be the theme l_jCorresponding document subset D_j At time started t_startNumber of documents, v_j,pIt is theme l_jCorresponding document subset D_jAt time period T_pThe quantity of interior document；

Calculate the theme frequency of each theme respectively；

Generate the step of theme flow graph: according to the subject nucleotide sequence table after sequence and the theme frequency, use theme flow algorithm, generate main Topic flow graph；

Extract the step of the key word representing subject content: set W_j,pIt is theme l_jCorresponding document subset D_jAt time period T_pIn Document represents the key word subset of this subject content；Use the general Words partition system of Modern Chinese from document corresponding to each theme Subset extracts the key word subset representing this subject content in the document of each time period respectively；

The weight calculating key word the step sorted: the weight setting key word is that this key word goes out in a key word subset Existing number of times；Calculate each key word weight in each key word subset, and according to key in each key word subset All key words are sorted by the weight of word from big to small；

Generate the step of word cloud: according to key word subset and keyword weight, theme flow graph generates word cloud；

The described step being ranked up theme, is carried out according to the sort method complementary based on the theme frequency and geometry, including

Step 1, if theme l_jInitial time be OT_j；Work as v_j,0When being not equal to zero, take the time started t of document sets_startFor OT_j；

Work as v_j,0During equal to zero, then take v_j,pThose time periods T being not zero_pThe minima of left end point as OT_j；Calculate each The initial time of theme；

Step 2, if theme l_jThe frequency andCalculate each theme the frequency and；

Step 3: newly-built empty list B；If n is even number, then the frequency and that maximum subject write are entered list the first row, as Upper extreme point theme l_up, the frequency and time that big subject write are entered list the second row, as lower extreme point theme l_down；If n is Odd number, then enter list the first row, simultaneously as upper extreme point theme l the frequency and that maximum subject write_upWith lower extreme point theme l_down；

\overset{&OverBar;}{V (l_{u p} + l_{i})} = \frac{1}{m} Σ_{p = 0}^{m - 1} (v_{u p, p} + v_{i, p}) - - - (1);

Calculate l_upAnd l_iGeometry complementary, use variances sigma_up,iRepresent:

σ_{u p, i} = \sqrt{\frac{1}{m} Σ_{p = 0}^{m - 1} {((v_{u p, p} + v_{i, p}) - \overset{&OverBar;}{V (l_{u p} + l_{i})})}^{2}} - - - (2);

By σ_up,iAnd OT_iAfter normalization, calculate weighted value D_i:

D_i=sOT_i+(1-s)σ_up,i(3)；

Wherein s is for controlling parameter, 0≤s≤1；

Step 6: select weighted value D_iMinimum theme, is inserted into the top of list B upper extreme point theme, as new upper extreme point master Topic l_up；

\overset{&OverBar;}{V (l_{d o w n} + l_{k})} = \frac{1}{m} Σ_{p = 0}^{m - 1} (v_{d o w n, p} + v_{k, p}) - - - (4);

σ_{d o w n, k} = \sqrt{\frac{1}{m} Σ_{p = 0}^{m - 1} {((v_{d o w n, p} + v_{k, p}) - \overset{&OverBar;}{V (l_{d o w n} + l_{k})})}^{2}} - - - (5);

By σ_down,kAnd OT_kAfter normalization, calculate weighted value D_k:

D_k=sOT_k+(1-s)σ_down,k(6)；

Wherein s is for controlling parameter, 0≤s≤1；

2. the theme method for visualizing of Chinese document collection as claimed in claim 1, it is characterised in that described control parameter s= 0.3。

3. the theme method for visualizing of Chinese document collection as claimed in claim 1, it is characterised in that described generation word cloud Step, including

Step 1: select theme l on theme flow graph_jCorresponding region G_j, its time started and end time be respectively equal to document sets Time started t_startWith end time t_end, by region G_jTime period [t_start,t_end] it is divided into m-1 section, each time period A length ofObtain decile time point t_start+ p Δ t, wherein, p=1,2 ..., m-2；

Step 2: successively with decile time point t_startCentered by+p Δ t, according to Δ t at region G_jUpper intercepting subregion R_j,p；R_j,p It is by a setAnd curveAnd line SectionThe closed space constituted；

Step 3: at each subregion R_j,pUpper placement key word subset W_j,pIn key word, generate theme l_jWord cloud；Bag Include

3.1 use line segment Pointcut N's is each Point, obtains subregion R_j,pApproximate polygon；

3.2 set subregion R_j,pThe each summit of approximate polygon in the y-coordinate value on maximum that summit of y-coordinate be y_max；If it is many In Xing Ge summit, limit, the y-coordinate value on that summit that y-coordinate is minimum is y_min；

3.3 with one group of horizontal line H={y=c | y_min≤c≤y_max, c ∈ Z} and subregion R_j,pIntersect, obtain some intersecting lenses Section；Take each intersecting lens section and be positioned at the sub-line section of polygonal internal, be expressed asWherein, M is for being somebody's turn to do Intersecting lens section is positioned at R_j,pThe bar number of internal sub-line section；By R_j,pIt is expressed as one group of horizontal line section collection

3.4 according to W_j,pIn keyword weight choose a key word the most successively, a height of h, the rectangle of a width of w are set Replace this key word to be laid out, then place this key word at placement position；Including

A, detection are at R_j,pCorresponding L_j,pIn, at c=(y_max-y_minAn a width of w be placed in position)/2 can, the rectangle of a height of h, Detection method is: detect r corresponding from c to c-h_cIn (i), if all there is same i, meet line segmentLength big In w；If can, then at position (c, s_c(i)) place key word, update s_cI () is s_c(i)=s_c(i)+w；If can not, proceed to step B；

B, with c=(y_max-y_minCentered by)/2, c=c+1, c=c-1 is made alternately to travel through L successively_j,p, can detection in c position Place this rectangle；If can, then at position (c, s_c(i)) place key word, update s_cI () is s_c(i)=s_c(i)+w；If can not, Continue to make c=c+1, c=c-1 alternately to travel through L_j,p, until finding the r meeting condition_c(i) or traveled through all of r_c(i)；If Travel through all of r_cAfter (i), do not find the position c meeting condition yet, then give up this key word；

3.5 repeat step 3.4, until by W_j,pIn key word all place；

4. the theme method for visualizing of Chinese document collection as claimed in claim 1, it is characterised in that also include generating detailed The step of word cloud, including

Step 1: select to express theme l_jThe keyword set of content；

Step 3: choose a key word from big to small according to the weight of key word from keyword set, uses random greed to calculate Method generates position candidate coordinate (word.x, word.y) for it in the C of region；

Step 4: according to the weight setting font size of this key word, further according to the number of words of font size He this key word, approximate generation with rectangle r For key word, if the lower left corner coordinate of rectangle r is equal to coordinate；

Step 5: to each conflict point in P, whether detection each point conflicts with r；If there is conflict, proceed to step 6；If no There is conflict, proceed to step 7；

Step 6: after updating position coordinate along spiral path, repeats step 4, step 5, until finding the position meeting condition Put pcoordinate or the radius of spin more than 100；When the radius of spin is more than 100, key word will be rejected；

Step 7: place this key word, and the region taken by this key word at position coordinate (word.x, word.y) Discrete turn to conflict point, add in conflict point set P；

5. the theme method for visualizing of Chinese document collection as claimed in claim 4, it is characterised in that described expression theme l_j's The keyword set of content is theme l_jAny one key word subset W_j,p。

6. the theme method for visualizing of Chinese document collection as claimed in claim 4, it is characterised in that described expression theme l_j's The keyword set of content, is obtained by following steps:

Step 1, merges theme l_jAll key word subsets W_j,p, p=1,2 ..., m-1；

Step 2: calculate the weight of all key words in the set after merging, the weight of described key word is that this key word is all The number of times occurred in key word subset.