CN103631856B - Subject visualization method for Chinese document set - Google Patents

Subject visualization method for Chinese document set Download PDF

Info

Publication number
CN103631856B
CN103631856B CN201310488312.7A CN201310488312A CN103631856B CN 103631856 B CN103631856 B CN 103631856B CN 201310488312 A CN201310488312 A CN 201310488312A CN 103631856 B CN103631856 B CN 103631856B
Authority
CN
China
Prior art keywords
theme
key word
word
document
subset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310488312.7A
Other languages
Chinese (zh)
Other versions
CN103631856A (en
Inventor
朱敏
梁婷
甘启宏
李明召
李�一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN201310488312.7A priority Critical patent/CN103631856B/en
Publication of CN103631856A publication Critical patent/CN103631856A/en
Application granted granted Critical
Publication of CN103631856B publication Critical patent/CN103631856B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a subject visualization method for a Chinese document set. The subject visualization method comprises the steps of classifying the document set according to subjects, dividing the time periods of the document set, calculating subject frequency, ranking the subjects, generating a subject flow graph, extracting keywords expressing the content of the subjects, calculating and ranking the weights of the keywords and generating a character cloud. The subject visualization method further comprises an ordering method based on the subject frequency and geometrical complementarity, a character cloud arrangement method and a method for generating the detailed character cloud. The subject visualization method has the advantages that the subject visualization on the Chinese document set is achieved; the subject flow graph generated through the ordering method based on the subject frequency and the geometrical complementarity is more attractive, flatter, high in space use ratio and more beneficial to character cloud arrangement; the character cloud arrangement method can effectively utilize space, and arrangement efficiency is greatly improved; the detailed character cloud is generated, and all the keyword content of the subjects can be shown.

Description

A kind of theme method for visualizing of Chinese document collection
Technical field
The present invention relates to text visualization and subject analysis field, be specifically that the theme of a kind of Chinese document collection is visual Change method.
Background technology
Large-scale collection of document, such as news, scientific and technical literature, webpage and electronic publication, bulletin etc., has contained bulk information.With The development of information digitalization and popularize, the scale expanding day of collection of document, the information that rapid reading is vast as the open sea with understanding, And therefrom extract useful knowledge, it has also become people's problem demanding prompt solution.
" theme " generally includes a core event or activity, and the most directly related all event and activity.Main Topic detection method use cluster, classify, retrieve, the technology such as topic tracking, according to theme document sets carried out hierarchy type classification with Tissue, facilitates user to retrieve it, selects and browse.But, after document is sorted out, when user still needs to expend a large amount of Between read all documents under this theme, with understand theme main contents, excavate potential knowledge and obtain needed for information.
Subject content, on the basis of topic detection, is collected by multi-document auto-abstracting technology, removes redundancy After, generate comprehensive, succinct text.Thus drastically increase information acquisition efficiency.But existing multi-document summary result is led to The most more complicated, user's indigestion, and be difficult to summarization generation process is controlled, lack friendly user interface and man-machine Interactive operation.Additionally, multi-document auto-abstracting technology often have ignored other attributes outside content of text, such as time, quantity etc., It is difficult to represent theme and subject content Characteristics of Evolution in time in document sets, also cannot reflect each theme under same document sets Between relation.
Text visualization as an important branch in information visualization field, utilizes the mankind inherent to figure Identification, memory and analysis ability, be converted into graph image by text message, help people intuitively, understand efficiently, read and divide Analysis content of text and structure, and by corresponding interactive operation, help people to excavate valuable knowledge and pattern.
Content of text is abstracted into the set of one group of vocabulary by Word Cloud (word cloud) visualization technique, utilizes font big The word frequency information of little expression vocabulary, then vocabulary is compact according to certain rule, aesthetically line up, special to represent text Levy.But single document can only be visualized by word cloud.To multiple documents, Themerive(theme stream) in document sets Theme visualizes, and shows each theme intensity trend over time in document sets.Initial theme stream only comprises theme Intensity and temporal information, and theme order random alignment.Afterwards, Liu Shixia et al. proposes the theme stream TIARA improved, and is i.e. leading Topic stream embeds word cloud, further each subject content is visualized, contribute to user and quickly analyze text subject content Rule over time.
The most several text visualization technology all lack versatility, are not suitable for Chinese document, the most up to the present, The most still lack the visualization technique that Chinese document subject matter is analyzed.Additionally, it is visual just for English document theme TIARA technology there is also following problem: 1) shape of word cloud in theme stream, layout are unstable, easily make user cause misunderstanding, Affect subject analysis effect;2) due to area-limited, the word cloud of generation cannot show whole key contents of each theme.
Summary of the invention
It is an object of the invention to provide the theme method for visualizing of a kind of Chinese document collection, by Chinese document sets The each subject information extracted is added up and is processed, and measures out the intensity of theme and the weight of content, the most graphically changes Mode is shown.
The technical scheme realizing the object of the invention is as follows: the theme method for visualizing of a kind of Chinese document collection, including by master Inscribe the step to document sets classification: setting document sets has n theme lj, j=0,1,2 ..., n-1, according to theme in document sets All documents are classified, and obtain n document subset Dj, j=0,1,2 ..., n-1;Wherein, theme ljCorresponding document subset is Dj
Divide the step of document sets time period: set the document sets time started as tstart, the end time is tend, to document sets Time period [tstart,tend] carry out decile, obtain time period Tp=(tstart+ (p-1) Δ t, tstart+ p Δ t], wherein, p=1, 2 ..., m-1,Calculate the step of the theme frequency: set the theme frequency and include vj,0And vj,p, wherein vj,0Be the theme lj Corresponding document subset DjAt time started tstarThe number of documents of t, vj,pIt is theme ljCorresponding document subset DjAt time period Tp The quantity of interior document;Calculate the theme frequency of each theme respectively;
The step that theme is ranked up: all themes are sorted, the subject nucleotide sequence table after being sorted;
Generate the step of theme flow graph: according to the subject nucleotide sequence table after sequence and the theme frequency, use theme flow algorithm, raw Become theme flow graph;
Extract the step of the key word representing subject content: set Wj,pIt is theme ljCorresponding document subset DjAt time period Tp Interior document represents the key word subset of this subject content;Use the general Words partition system of Modern Chinese corresponding from each theme Document subset extracts the key word subset representing this subject content in the document of each time period respectively;
The weight calculating key word the step sorted: the weight setting key word is that this key word is a key word subset The number of times of middle appearance;Calculate each key word weight in each key word subset, and basis in each key word subset All key words are sorted by the weight of key word from big to small;
Generate the step of word cloud: according to key word subset and keyword weight, theme flow graph generates word cloud.
In technique scheme, the step being ranked up theme can use the row complementary based on the theme frequency and geometry Sequence method, including
Step 1, if theme ljInitial time be OTj;Work as vj,0When being not equal to zero, take the time started t of document setsstart For OTj;Work as vj,0During equal to zero, then take vj,pThose time periods T being not zeropThe minima of left end point as OTj;Calculate every The initial time of individual theme;
Step 2, if theme ljThe frequency andCalculate each theme the frequency and;
Step 3: newly-built empty list B;If n is even number, then the frequency and that maximum subject write are entered list the first row, As upper extreme point theme lup, the frequency and time that big subject write are entered list the second row, as lower extreme point theme ldown;If N is odd number, then the frequency and that maximum subject write are entered list the first row, simultaneously as upper extreme point theme lupWith lower extreme point master Topic ldown
Step 4: select a not theme l in list Bi, calculate lupAnd liThe meansigma methods of frequency sum
V ( l up + l i ) ‾ = 1 m Σ p = 0 m - 1 ( v up , p + v i , p ) - - - ( 1 ) ;
Calculate lupAnd liGeometry complementary, use varianceRepresent:
σ up , i = 1 m Σ p = 0 m - 1 ( ( v up , p + v i , p ) - V ( l up + l i ) ‾ ) 2 - - - ( 2 ) ;
WillAnd OTiAfter normalization, calculate weighted value Di:
Di=sOTi+(1-s)σup,i(3);
Wherein s is for controlling parameter, 0≤s≤1;
Step 5: repeat step 4, until obtaining the weighted value of each not theme in list B;
Step 6: select weighted value DiMinimum theme, is inserted into the top of list B upper extreme point theme, as new upper end Point theme lup
Step 7: select a not theme l in list Bk, calculate ldownAnd lkThe meansigma methods of frequency sum
V ( l down + l k ) ‾ = 1 m Σ p = 0 m - 1 ( v down , p + v k , p ) - - - ( 4 ) ;
Calculate ldownAnd lkGeometry complementary, use variances sigmadown,kRepresent:
σ down , k = 1 m Σ p = 0 m - 1 ( ( v down , p + v k , p ) - V ( l down + l k ) ‾ ) 2 - - - ( 5 ) ;
By σdown,kAnd OTkAfter normalization, calculate weighted value Dk:
Dk=sOTk+(1-s)σdown,k(6);
Wherein s is for controlling parameter, 0≤s≤1;
Step 8: repeat step 7, until obtaining the weighted value of each not theme in list B;
Step 9: select the theme that weighted value is minimum, be inserted into the lower section of list B lower extreme point theme, as new lower extreme point Theme ldown
Step 10: repeated execution of steps 4 to step 9, until all of theme all adds in list B.
In the present invention, the value controlling parameter s is 0.3.
In preceding solution, the method generating word cloud on theme flow graph is:
Step 1: select theme l on theme flow graphjCorresponding region Gj, its time started and end time respectively equal to literary composition The time started t of shelves collectionstartWith end time tend, by region GjTime period [tstart,tend] it is divided into m-1 section, Mei Geshi Between section a length of Obtain decile time point tstart+ p Δ t, wherein, p=1,2 ..., m-2;
Step 2: successively with decile time point tstartCentered by+p Δ t, according to Δ t at region GjUpper intercepting subregion Rj,p; Rj,pIt is by a setWith Curve And line segmentThe closed space constituted;
Step 3: at each subregion Rj,pUpper placement key word subset Wj,pIn key word, generate theme ljWord Cloud;Line segment is used including 3.1 The each point of Pointcut N, obtains subregion Rj,pNear Like polygon;
3.2 set subregion Rj,pThe each summit of approximate polygon in the y-coordinate value on maximum that summit of y-coordinate be ymax; If the y-coordinate value on that summit that y-coordinate is minimum is y in each summit of polygonmin
3.3 with one group of horizontal line H={y=c | ymin≤c≤ymax, c ∈ Z} and subregion Rj,pIntersect, obtain some phases Intersection section;Take each intersecting lens section and be positioned at the sub-line section of polygonal internal, be expressed asWherein, M It is positioned at R for this intersecting lens sectionj,pThe bar number of internal sub-line section;By Rj,pIt is expressed as one group of horizontal line section collection
L j , p = { r c ( i ) = s c ( i ) e c ( i ) &OverBar; , y min &le; c &le; y max , 0 < i &le; M } ;
3.4 according to Wj,pIn keyword weight choose a key word the most successively, a height of h is set, a width of w's Rectangle replaces this key word to be laid out, and then places this key word at placement position;Including
A, detection are at Rj,pCorresponding Lj,pIn, at c=(ymax-yminAn a width of w be placed in position)/2 can, a height of h's Rectangle, detection method is: detect r corresponding from c to c-hcIn (i), if all there is same i, meet line segment's Length is more than w;If can, then at position (c, sc(i)) place key word, update scI () is sc(i)=sc(i)+w;If can not, turn Enter step B;
B, with c=(ymax-yminCentered by)/2, c=c+1, c=c-1 is made alternately to travel through L successivelyj,p, detection is at c position energy This rectangle of no placement;If can, then at position (c, sc(i)) place key word, update scI () is sc(i)=sc(i)+w;If can not Then continue to make c=c+1, c=c-1 alternately to travel through Lj,p, until finding the r meeting conditionc(i) or traveled through all of rc(i);If Travel through all of rcAfter (i), do not find the position c meeting condition yet, then give up this key word;
3.5 repeat step 3.4, until by Wj,pIn key word all place;
Step 4: repetition step 1 to step 3, until generating the word cloud of each theme on theme flow graph.
The method that the invention allows for generating detailed word cloud, including
Step 1: select to express theme ljThe keyword set of content;
Step 2: arrange a border circular areas C, turns to one group of conflict point set P by the boundary discrete method of C;
Step 3: choose a key word from big to small according to the weight of key word from keyword set, uses random greedy Center algorithm generates position candidate coordinate (word.x, word.y) for it in the C of region;
Step 4: according to the weight setting font size of this key word, further according to the number of words of font size He this key word, near with rectangle r Seemingly replace key word, if the lower left corner coordinate of rectangle r is equal to coordinate;
Step 5: to each conflict point in P, whether detection each point conflicts with r;If there is conflict, proceed to step 6;As There is not conflict in fruit, proceeds to step 7;
Step 6: after updating position coordinate along spiral path, repeats step 4, step 5, meets condition until finding Position pcoordinate or the radius of spin more than 100;When the radius of spin is more than 100, key word will be rejected;
Step 7: place this key word at position coordinate (word.x, word.y), and this key word is taken Discrete region turns to conflict point, adds in conflict point set P;
Step 8: repeat step 3 and arrive step 7, until all key words are placed in keyword set.
When generating detailed word cloud, theme l can be selectedjAny one key word subset Wj,pFor expressing theme The keyword set of content, it is also possible to select theme ljAll key word subsets Wj,pThe keyword set formed after merging Cooperation is for expressing the keyword set of the content of theme.
The present invention has the technical effect that 1 relative to prior art, achieves the theme visualization to Chinese document sets. 2, after using sort method based on the theme frequency and geometry complementarity to be ranked up theme, the theme flow graph of generation is more beautiful See, more smooth, space availability ratio is high, the more conducively placement of word cloud.3, the word cloud layout method of the present invention is used, can be effective Utilize space, on the premise of same area size, font size, more key word can be placed;And the layout generated is steady Fixed, do not change with interactive operation below;Efficiency of algorithm also significantly improves, and this algorithm is by certain regular by irregular area It is expressed as the most discrete entity, only need to travel through when placing word and find the entity meeting this word placement condition, be not required to Collision detection to be carried out and border detection, therefore substantially increase positioning efficiency.4, generate detailed word cloud and can show master All key words contents of topic.
Accompanying drawing explanation
Fig. 1 is the flow chart of the present invention.
Fig. 2 is in first embodiment of the invention, rear generate theme flow graph randomly ordered to theme, then uses TIARA skill The placement algorithm of word cloud result visual to subject content figure in art.
Fig. 3 is the flow chart of the sort method complementary based on the theme frequency and geometry in second embodiment of the invention.
Fig. 4 is the design sketch of the theme flow graph generated in second embodiment of the invention.
Fig. 5 is the schematic diagram intercepting subregion in third embodiment of the invention.
Fig. 6 is by schematic diagram that subregion approximate representation is one group of horizontal line section collection in third embodiment of the invention.
Fig. 7 is the design sketch after placing key word in third embodiment of the invention on theme flow graph, theme stream therein Figure rear generation randomly ordered to theme.
Fig. 8 is to the design sketch after the theme visualization of Chinese document sets, wherein theme stream in fourth embodiment of the invention Figure is to have employed generation after sort method sequence based on the theme frequency and geometry complementarity, and word cloud is by the layout after improving Algorithm generates.
Fig. 9 be in fifth embodiment of the invention to Chinese document sets theme visualize and add detailed word cloud after Design sketch.
Detailed description of the invention
Embodiment 1: below as a example by " Journal of Software " journal data, in conjunction with Fig. 1, shows that Chinese document theme is visual Method.
Step one, classifies to document sets by theme: setting document sets has n theme lj, j=0,1,2 ..., n-1, according to master All documents in document sets are classified by topic, obtain n document subset Dj, j=0,1,2 ..., n-1;Wherein, theme ljRight The document subset answered is Dj.Particularly as follows: the paper data of input " Journal of Software " periodical the 1st phase to the 9th phase.By document sets according to Systems soft ware and soft project, database technology, computer network and information security, pattern recognition and artificial intelligence and operation system Five themes of uniting are classified, and obtain five document subset.
Step 2, divides the document sets time period: set the document sets time started as tstart, the end time is tend, to document sets Time period [tstart,tend] carry out decile, obtain time period Tp=(tstart+ (p-1) Δ t, tstart+ p Δ t], wherein, p=1, 2 ..., m-1, Wherein, m is time started, end time and the sum of decile time point.Here, document sets Time started was the 1st phase, and the end time was the 9th phase, and the document sets time is divided into 8 time periods, and time interval was 1 phase.
Step 3, calculates the theme frequency: set the theme frequency and include vj,0And vj,p, wherein vj,0Be the theme ljCorresponding document Subset DjAt time started tstartNumber of documents, vj,pIt is theme ljCorresponding document subset DjAt time period TpInterior document Quantity;Calculate the theme frequency of each theme respectively.Here, the document calculating each theme occurred in the 1st phase and other each phase The frequency, the number of documents that the most each theme comprised within document sets time started and each time period.
Step 4, is ranked up theme: sort all themes, the subject nucleotide sequence table after being sorted.The present embodiment In, it is ranked up theme using traditional random alignment method.Use different sort methods, to the theme flow graph generated Effect has a direct impact.
Step 5, generates theme flow graph: according to the subject nucleotide sequence table after sequence and the theme frequency, uses theme flow algorithm, Generate theme flow graph, theme intensity is visualized.In the present embodiment, generate the theme as shown in colour band colored in Fig. 2 Flow graph, wherein blue color is pattern recognition and artificial intelligence, and purple colour band is computer network and information security, red ribbon For operating system, yellow colour band is database technology, and green color bars is systems soft ware and soft project.Theme flow algorithm (quoted from " ThemeRiver:Visualizing thematic changes in large document collections " literary composition), (it is the constraint of zero that interpolating function need to meet at extreme point derivative i.e. to carry out interpolation according to each theme weights in discrete time Condition), then carry out the drafting of stacked graph, generate theme flow graph.In theme flow graph, horizontal axis representing time, longitudinal axis difference in height Representing theme intensity, different colours band represents different themes.Colour band broadens over time or narrows expression theme intensity at any time Between differentiation.
Step 6, extracts the key word representing subject content: set Wj,pIt is theme ljCorresponding document subset DjIn the time period TpInterior document represents the key word subset of this subject content, uses " the general Words partition system of Modern Chinese " from each theme pair The document subset answered extracts the key word subset representing this subject content in the document of each time period respectively.The present embodiment In, use word bag model text analysis technique, extract the key word subset representing subject content.Particularly as follows: use Beijing language " the general Words partition system of Modern Chinese " of speech university language information processing institute exploitation, returns in the document subset to each theme The all documents belonging to each time period carry out participle, remove stop words, such as auxiliary words of mood, adverbial word, preposition, conjunction etc., finally Obtain multiple key word subset.
Step 7, calculates the weight of key word and sorts: the weight of key word is that this key word is a key word subset The number of times of middle appearance;Calculate each key word weight in each key word subset, and according to the weight of key word from greatly to Little all key words are sorted.
Step 8, generates word cloud: according to key word subset and keyword weight, uses TIARA algorithm at theme flow graph Upper generation word cloud, visualizes subject content.
According to the Chinese document collection theme method for visualizing of the present embodiment, to the result after Chinese document sets theme visualization As shown in Figure 2.
Embodiment 2: in above-mentioned Chinese document collection theme method for visualizing, each theme order random alignment.Generating theme During stream, if the Strength Changes of certain theme is excessive, theme shape the most adjacent thereto can be twisted so that result is the most beautiful See, and the relative intensity between theme is also difficult to identification.Additionally, the theme after Niu Qu also can affect the placement of word cloud.With this Simultaneously for all themes of a document sets, user is often more concerned with the particular content of theme of theme maximum intensity.Cause This, the present invention step to theme being ranked up in embodiment 1, carry out further improvement, devised a kind of based on master Theme is ranked up by the sort method of the topic frequency and geometry complementarity.This sort method is described in detail below in conjunction with Fig. 3:
Step 1, if theme ljInitial time be OTj;Work as vj,0When being not equal to zero, take the time started t of document setsstart For OTj;Work as vj,0During equal to zero, then take vj,pThose time periods T being not zeropThe minima of left end point as OTj;Calculate every The initial time of individual theme;
Step 2, if theme ljThe frequency andCalculate each theme the frequency and;
Step 3: newly-built empty list B;If n is even number, then the frequency and that maximum subject write are entered list the first row, As upper extreme point theme lup, the frequency and time that big subject write are entered list the second row, as lower extreme point theme ldown;If N is odd number, then the frequency and that maximum subject write are entered list the first row, simultaneously as upper extreme point theme lupWith lower extreme point master Topic ldown;Step 4: select a not theme l in list Bi, calculate lupAnd liThe meansigma methods of frequency sum
V ( l up + l i ) &OverBar; = 1 m &Sigma; p = 0 m - 1 ( v up , p + v i , p ) - - - ( 1 ) ;
Calculate lupAnd liGeometry complementary, use varianceRepresent:
&sigma; up , i = 1 m &Sigma; p = 0 m - 1 ( ( v up , p + v i , p ) - V ( l up + l i ) &OverBar; ) 2 - - - ( 2 ) ;
WillAnd OTiAfter normalization, calculate weighted value Di:
Di=sOTi+(1-s)σup,i(3);
Wherein s is for controlling parameter, 0≤s≤1;
Step 5: repeat step 4, until obtaining the weighted value of each not theme in list B;
Step 6: select weighted value DiMinimum theme, is inserted into the top of list B upper extreme point theme, as new upper end Point theme lup
Step 7: select a not theme l in list Bk, calculate ldownAnd lkThe meansigma methods of frequency sum
V ( l down + l k ) &OverBar; = 1 m &Sigma; p = 0 m - 1 ( v down , p + v k , p ) - - - ( 4 ) ;
Calculate ldownAnd lkGeometry complementary, use variances sigmadown,kRepresent:
&sigma; down , k = 1 m &Sigma; p = 0 m - 1 ( ( v down , p + v k , p ) - V ( l down + l k ) &OverBar; ) 2 - - - ( 5 ) ;
By σdown,kAnd OTkAfter normalization, calculate weighted value Dk:
Dk=sOTk+(1-s)σdown,k(6);
Wherein s is for controlling parameter, 0≤s≤1;
Step 8: repeat step 7, until obtaining the weighted value of each not theme in list B;
Step 9: select the theme that weighted value is minimum, be inserted into the lower section of list B lower extreme point theme, as new lower extreme point Theme ldown
Step 10: repeated execution of steps 4 to step 9, until all of theme all adds in list B.
In the present embodiment, the value controlling parameter s is 0.3.
After being ranked up theme according to the sort method in the present embodiment, the theme flow graph of generation is as shown in Figure 4, permissible Finding out that theme flow graph is more attractive, more smooth, space availability ratio is high, the more conducively placement of word cloud.
Embodiment 3: for word cloud shape, the problem of layout instability in TIARA technology, word cloud is also entered by the present invention Go improvement, first theme has been divided into several subregions, then uses scalable algorithm (quoted from " Tag Cloud++- Scalable Tag Clouds for Arbitrary Layouts " literary composition) it is one group of horizontal line section collection by this region representation, It is sequentially placed key word again, generates word cloud.Visual signature is as follows: 1) weight of key word is the biggest, and font is the biggest;2) weight is more Big key word is the closer to this regional center.It is described in detail below in conjunction with Fig. 5, Fig. 6:
Step 1: select theme l on theme flow graphjCorresponding region Gj, its time started and end time respectively equal to literary composition The time started t of shelves collectionstartWith end time tend, by region GjTime period [tstart,tend] it is divided into m-1 section, Mei Geshi Between section a length of Obtain decile time point tstarT+p Δ t, wherein, p=1,2 ..., m-2;
Step 2: as it is shown in figure 5, successively with decile time point tstartCentered by+p Δ t, according to Δ t at region GjUpper intercepting Subregion Rj,p;Rj,pIt is by a setWith Curve And line segmentThe closed space constituted;
Step 3: at each subregion Rj,pUpper placement key word subset Wj,pIn key word, generate theme ljWord Cloud;Including
3.1 use line segment successively The each point of Pointcut N, Obtain subregion Rj,pApproximate polygon;
3.2 set subregion Rj,pThe each summit of approximate polygon in the y-coordinate value on maximum that summit of y-coordinate be ymax; If the y-coordinate value on that summit that y-coordinate is minimum is y in each summit of polygonmin
3.3 as shown in Figure 6, with one group of horizontal line H={y=c | ymin≤c≤ymax, c ∈ Z} and subregion Rj,pIntersect, To some crossing line segments;Take each intersecting lens section and be positioned at the sub-line section of polygonal internal, be expressed asWherein, M is that this intersecting lens section is positioned at Rj,pThe bar number of internal sub-line section;By Rj,pIt is expressed as one Group horizontal line section collection
3.4 according to Wj,pIn keyword weight choose a key word the most successively, a height of h is set, a width of w's Rectangle replaces this key word to be laid out, and then places this key word at placement position;Including
A, detection are at Rj,pCorresponding Lj,pIn, at c=(ymax-yminAn a width of w be placed in position)/2 can, a height of h's Rectangle, detection method is: detect r corresponding from c to c-hcIn (i), if all there is same i, meet line segment's Length is more than w;If can, then at position (c, sc(i)) place key word, update scI () is sc(i)=sc(i)+w;If can not, turn Enter step B;
B, with c=(ymax-yminCentered by)/2, c=c+1, c=c-1 is made alternately to travel through L successivelyj,p, detection is at c position energy This rectangle of no placement;If can, then at position (c, sc(i)) place key word, update scI () is sc(i)=sc(i)+w;If can not Then continue to make c=c+1, c=c-1 alternately to travel through Lj,p, until finding the r meeting conditionc(i) or traveled through all of rc(i);If Travel through all of rcAfter (i), do not find the position c meeting condition yet, then give up this key word;
3.5 repeat step 3.4, until by Wj,pIn key word all place;
Step 4: repetition step 1 to step 3, until generating the word cloud of each theme on theme flow graph.
Fig. 7 shows the effect after using this method to generate word cloud, wherein generates theme flow graph, uses theme sequence Randomly ordered method.Comparison diagram 2 is it can be seen that use the word cloud placement algorithm of the present invention to have the advantage that 1) energy Effectively utilize space.On the premise of same area size, font size, more key word can be placed.2) layout generated Stable, do not change with interactive operation below.3) efficiency of algorithm significantly improves.This algorithm is by certain regular by region of disorder Domain representation is the most discrete entity, only need to travel through when placing word and find the entity meeting this word placement condition, no Need to carry out collision detection and border detection, therefore drastically increase positioning efficiency.
Embodiment 4: the present embodiment combines the word cloud after improving in the theme sort method of embodiment 2 and embodiment 3 Laying method, other step is constant.Fig. 8 shows that the present embodiment is to the Chinese visual result of document sets theme.
Embodiment 5: in TIARA, is limited by area size, it is difficult to place all of key word in a region. Therefore, the present invention uses a detailed word cloud, and full content or each theme with each theme of visualization further are at each The full content of time period.The color of theme in the background color correspondence theme flow graph of word cloud, the size of key word corresponds to The weight of key word.The present invention uses random greedy algorithm (quoted from " TIARA:A Visual Exploratory Text Analytic System ") generate detailed word cloud, particularly as follows:
Step 1: select to express theme ljThe keyword set of content;
Step 2: arrange a border circular areas C, turns to one group of conflict point set P by the boundary discrete method of C;
Step 3: choose a key word from big to small according to the weight of key word from keyword set, uses random greedy Center algorithm generates position candidate coordinate (word.x, word.y) for it in the C of region;
Step 4: according to the weight setting font size of this key word, further according to the number of words of font size He this key word, near with rectangle r Seemingly replace key word, if the lower left corner coordinate of rectangle r is equal to coordinate;
Step 5: to each conflict point in P, whether detection each point conflicts with r;If there is conflict, proceed to step 6;As There is not conflict in fruit, proceeds to step 7;
Step 6: after updating position coordinate along spiral path, repeats step 4, step 5, meets condition until finding Position coordinate or the radius of spin more than 100;When the radius of spin is more than 100, key word will be rejected;
Step 7: place this key word at position coordinate (word.x, word.y), and this key word is taken Discrete region turns to conflict point, adds in conflict point set P;
Step 8: repeat step 3 and arrive step 7, until all key words are placed in keyword set.
When generating detailed word cloud, theme l can be selectedjAny one key word subset Wj,pFor expressing theme The keyword set of content, it is also possible to select theme ljAll key word subsets Wj,pThe keyword set formed after merging Cooperation is for expressing the keyword set of the content of theme.
Fig. 9 shows in the theme visualization method of Chinese document collection, adds the effect after detailed word cloud.In detail Thin word cloud places lower right in the drawings, clicks the colour band that theme is corresponding, each theme of the most changeable display on theme flow graph Corresponding detailed word cloud.Show in figure is the detailed word cloud of operating system theme.It can be seen that due to by region The restriction of size, on theme flow graph in the word cloud of operating system theme, whole key words contents are placed the most completely, and detailed Thin word cloud then illustrates all key words contents of this theme.

Claims (6)

1. the theme method for visualizing of a Chinese document collection, it is characterised in that include
Step document sets classified by theme: setting document sets has n theme lj, j=0,1,2 ..., n-1, according to theme to literary composition All documents that shelves are concentrated are classified, and obtain n document subset Dj, j=0,1,2 ..., n-1;Wherein, theme ljCorresponding Document subset is Dj
Divide the step of document sets time period: set the document sets time started as tstart, the end time is tend, to the document sets time Section [tstart,tend] carry out decile, obtain time period Tp=(tstart+(p-1)Δt,tstart+ p Δ t], wherein, p=1,2 ..., M-1,
Calculate the step of the theme frequency: set the theme frequency and include vj,0And vj,p, wherein vj,0Be the theme ljCorresponding document subset Dj At time started tstartNumber of documents, vj,pIt is theme ljCorresponding document subset DjAt time period TpThe quantity of interior document;
Calculate the theme frequency of each theme respectively;
The step that theme is ranked up: all themes are sorted, the subject nucleotide sequence table after being sorted;
Generate the step of theme flow graph: according to the subject nucleotide sequence table after sequence and the theme frequency, use theme flow algorithm, generate main Topic flow graph;
Extract the step of the key word representing subject content: set Wj,pIt is theme ljCorresponding document subset DjAt time period TpIn Document represents the key word subset of this subject content;Use the general Words partition system of Modern Chinese from document corresponding to each theme Subset extracts the key word subset representing this subject content in the document of each time period respectively;
The weight calculating key word the step sorted: the weight setting key word is that this key word goes out in a key word subset Existing number of times;Calculate each key word weight in each key word subset, and according to key in each key word subset All key words are sorted by the weight of word from big to small;
Generate the step of word cloud: according to key word subset and keyword weight, theme flow graph generates word cloud;
The described step being ranked up theme, is carried out according to the sort method complementary based on the theme frequency and geometry, including
Step 1, if theme ljInitial time be OTj;Work as vj,0When being not equal to zero, take the time started t of document setsstartFor OTj
Work as vj,0During equal to zero, then take vj,pThose time periods T being not zeropThe minima of left end point as OTj;Calculate each The initial time of theme;
Step 2, if theme ljThe frequency andCalculate each theme the frequency and;
Step 3: newly-built empty list B;If n is even number, then the frequency and that maximum subject write are entered list the first row, as Upper extreme point theme lup, the frequency and time that big subject write are entered list the second row, as lower extreme point theme ldown;If n is Odd number, then enter list the first row, simultaneously as upper extreme point theme l the frequency and that maximum subject writeupWith lower extreme point theme ldown
Step 4: select a not theme l in list Bi, calculate lupAnd liThe meansigma methods of frequency sum
V ( l u p + l i ) &OverBar; = 1 m &Sigma; p = 0 m - 1 ( v u p , p + v i , p ) - - - ( 1 ) ;
Calculate lupAnd liGeometry complementary, use variances sigmaup,iRepresent:
&sigma; u p , i = 1 m &Sigma; p = 0 m - 1 ( ( v u p , p + v i , p ) - V ( l u p + l i ) &OverBar; ) 2 - - - ( 2 ) ;
By σup,iAnd OTiAfter normalization, calculate weighted value Di:
Di=sOTi+(1-s)σup,i(3);
Wherein s is for controlling parameter, 0≤s≤1;
Step 5: repeat step 4, until obtaining the weighted value of each not theme in list B;
Step 6: select weighted value DiMinimum theme, is inserted into the top of list B upper extreme point theme, as new upper extreme point master Topic lup
Step 7: select a not theme l in list Bk, calculate ldownAnd lkThe meansigma methods of frequency sum
V ( l d o w n + l k ) &OverBar; = 1 m &Sigma; p = 0 m - 1 ( v d o w n , p + v k , p ) - - - ( 4 ) ;
Calculate ldownAnd lkGeometry complementary, use variances sigmadown,kRepresent:
&sigma; d o w n , k = 1 m &Sigma; p = 0 m - 1 ( ( v d o w n , p + v k , p ) - V ( l d o w n + l k ) &OverBar; ) 2 - - - ( 5 ) ;
By σdown,kAnd OTkAfter normalization, calculate weighted value Dk:
Dk=sOTk+(1-s)σdown,k(6);
Wherein s is for controlling parameter, 0≤s≤1;
Step 8: repeat step 7, until obtaining the weighted value of each not theme in list B;
Step 9: select the theme that weighted value is minimum, be inserted into the lower section of list B lower extreme point theme, as new lower extreme point theme ldown
Step 10: repeated execution of steps 4 to step 9, until all of theme all adds in list B.
2. the theme method for visualizing of Chinese document collection as claimed in claim 1, it is characterised in that described control parameter s= 0.3。
3. the theme method for visualizing of Chinese document collection as claimed in claim 1, it is characterised in that described generation word cloud Step, including
Step 1: select theme l on theme flow graphjCorresponding region Gj, its time started and end time be respectively equal to document sets Time started tstartWith end time tend, by region GjTime period [tstart,tend] it is divided into m-1 section, each time period A length ofObtain decile time point tstart+ p Δ t, wherein, p=1,2 ..., m-2;
Step 2: successively with decile time point tstartCentered by+p Δ t, according to Δ t at region GjUpper intercepting subregion Rj,p;Rj,p It is by a setAnd curveAnd line SectionThe closed space constituted;
Step 3: at each subregion Rj,pUpper placement key word subset Wj,pIn key word, generate theme ljWord cloud;Bag Include
3.1 use line segment Pointcut N's is each Point, obtains subregion Rj,pApproximate polygon;
3.2 set subregion Rj,pThe each summit of approximate polygon in the y-coordinate value on maximum that summit of y-coordinate be ymax;If it is many In Xing Ge summit, limit, the y-coordinate value on that summit that y-coordinate is minimum is ymin
3.3 with one group of horizontal line H={y=c | ymin≤c≤ymax, c ∈ Z} and subregion Rj,pIntersect, obtain some intersecting lenses Section;Take each intersecting lens section and be positioned at the sub-line section of polygonal internal, be expressed asWherein, M is for being somebody's turn to do Intersecting lens section is positioned at Rj,pThe bar number of internal sub-line section;By Rj,pIt is expressed as one group of horizontal line section collection
3.4 according to Wj,pIn keyword weight choose a key word the most successively, a height of h, the rectangle of a width of w are set Replace this key word to be laid out, then place this key word at placement position;Including
A, detection are at Rj,pCorresponding Lj,pIn, at c=(ymax-yminAn a width of w be placed in position)/2 can, the rectangle of a height of h, Detection method is: detect r corresponding from c to c-hcIn (i), if all there is same i, meet line segmentLength big In w;If can, then at position (c, sc(i)) place key word, update scI () is sc(i)=sc(i)+w;If can not, proceed to step B;
B, with c=(ymax-yminCentered by)/2, c=c+1, c=c-1 is made alternately to travel through L successivelyj,p, can detection in c position Place this rectangle;If can, then at position (c, sc(i)) place key word, update scI () is sc(i)=sc(i)+w;If can not, Continue to make c=c+1, c=c-1 alternately to travel through Lj,p, until finding the r meeting conditionc(i) or traveled through all of rc(i);If Travel through all of rcAfter (i), do not find the position c meeting condition yet, then give up this key word;
3.5 repeat step 3.4, until by Wj,pIn key word all place;
Step 4: repetition step 1 to step 3, until generating the word cloud of each theme on theme flow graph.
4. the theme method for visualizing of Chinese document collection as claimed in claim 1, it is characterised in that also include generating detailed The step of word cloud, including
Step 1: select to express theme ljThe keyword set of content;
Step 2: arrange a border circular areas C, turns to one group of conflict point set P by the boundary discrete method of C;
Step 3: choose a key word from big to small according to the weight of key word from keyword set, uses random greed to calculate Method generates position candidate coordinate (word.x, word.y) for it in the C of region;
Step 4: according to the weight setting font size of this key word, further according to the number of words of font size He this key word, approximate generation with rectangle r For key word, if the lower left corner coordinate of rectangle r is equal to coordinate;
Step 5: to each conflict point in P, whether detection each point conflicts with r;If there is conflict, proceed to step 6;If no There is conflict, proceed to step 7;
Step 6: after updating position coordinate along spiral path, repeats step 4, step 5, until finding the position meeting condition Put pcoordinate or the radius of spin more than 100;When the radius of spin is more than 100, key word will be rejected;
Step 7: place this key word, and the region taken by this key word at position coordinate (word.x, word.y) Discrete turn to conflict point, add in conflict point set P;
Step 8: repeat step 3 and arrive step 7, until all key words are placed in keyword set.
5. the theme method for visualizing of Chinese document collection as claimed in claim 4, it is characterised in that described expression theme lj's The keyword set of content is theme ljAny one key word subset Wj,p
6. the theme method for visualizing of Chinese document collection as claimed in claim 4, it is characterised in that described expression theme lj's The keyword set of content, is obtained by following steps:
Step 1, merges theme ljAll key word subsets Wj,p, p=1,2 ..., m-1;
Step 2: calculate the weight of all key words in the set after merging, the weight of described key word is that this key word is all The number of times occurred in key word subset.
CN201310488312.7A 2013-10-17 2013-10-17 Subject visualization method for Chinese document set Active CN103631856B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310488312.7A CN103631856B (en) 2013-10-17 2013-10-17 Subject visualization method for Chinese document set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310488312.7A CN103631856B (en) 2013-10-17 2013-10-17 Subject visualization method for Chinese document set

Publications (2)

Publication Number Publication Date
CN103631856A CN103631856A (en) 2014-03-12
CN103631856B true CN103631856B (en) 2017-01-11

Family

ID=50212898

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310488312.7A Active CN103631856B (en) 2013-10-17 2013-10-17 Subject visualization method for Chinese document set

Country Status (1)

Country Link
CN (1) CN103631856B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320683A (en) * 2014-07-24 2016-02-10 贾新志 Graphical display method of literature theme content analysis
CN105989090A (en) * 2015-02-12 2016-10-05 中兴通讯股份有限公司 Critical data processing method and device as well as critical data display method and system
CN105373579B (en) * 2015-08-18 2018-08-03 天津大学 A kind of news competitiveness analysis method and its visualization device based on regression analysis
CN106250512B (en) * 2016-08-04 2019-07-26 国家基础地理信息中心 A kind of subject network information collecting method for taking time intention into account
CN106681983A (en) * 2016-11-25 2017-05-17 北京掌行通信息技术有限公司 Station name participle display method and device
CN106909381B (en) * 2017-02-24 2020-01-03 西南交通大学 Interactive theme river visualization method
CN109144504A (en) * 2017-06-26 2019-01-04 华东师范大学 Data visualization image generation method and storage medium based on D3
CN107622132B (en) * 2017-10-09 2020-07-03 四川大学 Online question-answer community oriented association analysis visualization method
CN109783616A (en) * 2018-12-03 2019-05-21 广东蔚海数问大数据科技有限公司 A kind of text subject extracting method, system and storage medium
CN109933702B (en) * 2019-03-11 2022-12-16 智慧芽信息科技(苏州)有限公司 Retrieval display method, device, equipment and storage medium
CN110189393B (en) * 2019-06-05 2021-04-23 山东大学 Shape word cloud generation method and device
CN111737523B (en) * 2020-04-22 2023-11-14 聚好看科技股份有限公司 Video tag, generation method of search content and server

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996234A (en) * 2009-08-17 2011-03-30 阿瓦雅公司 Word cloud audio navigation
US8402030B1 (en) * 2011-11-21 2013-03-19 Raytheon Company Textual document analysis using word cloud comparison

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8753184B2 (en) * 2012-04-04 2014-06-17 David Goldenberg System and method for interactive gameplay with song lyric database

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996234A (en) * 2009-08-17 2011-03-30 阿瓦雅公司 Word cloud audio navigation
US8402030B1 (en) * 2011-11-21 2013-03-19 Raytheon Company Textual document analysis using word cloud comparison

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WordStream: Visualizing Theme Summarization and Comparison in Document Collections over Time;Ting Liang et.al;《Advances in information Sciences and Service Sciences(AISS)》;20130228;正文第975页3-9段、第976页第1-6段、第977页第1-6段、第978页第1-9段、第979页第1-14段、第980页第1-5段、第981页第1-3段,附图1-7 *

Also Published As

Publication number Publication date
CN103631856A (en) 2014-03-12

Similar Documents

Publication Publication Date Title
CN103631856B (en) Subject visualization method for Chinese document set
US10552468B2 (en) Topic predictions based on natural language processing of large corpora
Cao et al. Introduction to text visualization
Van Ham et al. Mapping text with phrase nets
Andrienko et al. Designing visual analytics methods for massive collections of movement data
Hilpert Dynamic visualizations of language change: Motion charts on the basis of bivariate and multivariate data from diachronic corpora
CN107578292B (en) User portrait construction system
Archambault et al. ThemeCrowds: Multiresolution summaries of twitter usage
Fried et al. Maps of computer science
Liang et al. Highlighting in information visualization: A survey
Sperberg‐McQueen Classification and its Structures
Nocaj et al. Organizing search results with a reference map
US11256687B2 (en) Surfacing relationships between datasets
CN105975597B (en) A kind of international shared platform of Dongba classics ancient books succession system digitlization
CN110276014A (en) Recommended method, device, equipment and the storage medium of copyright
Xia et al. Visualizing rank time series of wikipedia top-viewed pages
US11650073B2 (en) Knowledge space analytics
Rayson et al. Towards interactive multidimensional visualisations for corpus linguistics
CN107908749B (en) Character retrieval system and method based on search engine
Seifert et al. Visual analysis and knowledge discovery for text
CN110083760B (en) Multi-recording dynamic webpage information extraction method based on visual block
Peña et al. Linked open data visualization revisited: a survey
JP2000020538A (en) Method and device for retrieving information, and storage medium for information retrieving program
Chen et al. Research on Data Analysis and Visualization of Recruitment Positions Based on Text Mining
Sabol et al. Visual knowledge discovery in dynamic enterprise text repositories

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant