CN103631856B - Subject visualization method for Chinese document set - Google Patents
Subject visualization method for Chinese document set Download PDFInfo
- Publication number
- CN103631856B CN103631856B CN201310488312.7A CN201310488312A CN103631856B CN 103631856 B CN103631856 B CN 103631856B CN 201310488312 A CN201310488312 A CN 201310488312A CN 103631856 B CN103631856 B CN 103631856B
- Authority
- CN
- China
- Prior art keywords
- theme
- key word
- word
- document
- subset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a subject visualization method for a Chinese document set. The subject visualization method comprises the steps of classifying the document set according to subjects, dividing the time periods of the document set, calculating subject frequency, ranking the subjects, generating a subject flow graph, extracting keywords expressing the content of the subjects, calculating and ranking the weights of the keywords and generating a character cloud. The subject visualization method further comprises an ordering method based on the subject frequency and geometrical complementarity, a character cloud arrangement method and a method for generating the detailed character cloud. The subject visualization method has the advantages that the subject visualization on the Chinese document set is achieved; the subject flow graph generated through the ordering method based on the subject frequency and the geometrical complementarity is more attractive, flatter, high in space use ratio and more beneficial to character cloud arrangement; the character cloud arrangement method can effectively utilize space, and arrangement efficiency is greatly improved; the detailed character cloud is generated, and all the keyword content of the subjects can be shown.
Description
Technical field
The present invention relates to text visualization and subject analysis field, be specifically that the theme of a kind of Chinese document collection is visual
Change method.
Background technology
Large-scale collection of document, such as news, scientific and technical literature, webpage and electronic publication, bulletin etc., has contained bulk information.With
The development of information digitalization and popularize, the scale expanding day of collection of document, the information that rapid reading is vast as the open sea with understanding,
And therefrom extract useful knowledge, it has also become people's problem demanding prompt solution.
" theme " generally includes a core event or activity, and the most directly related all event and activity.Main
Topic detection method use cluster, classify, retrieve, the technology such as topic tracking, according to theme document sets carried out hierarchy type classification with
Tissue, facilitates user to retrieve it, selects and browse.But, after document is sorted out, when user still needs to expend a large amount of
Between read all documents under this theme, with understand theme main contents, excavate potential knowledge and obtain needed for information.
Subject content, on the basis of topic detection, is collected by multi-document auto-abstracting technology, removes redundancy
After, generate comprehensive, succinct text.Thus drastically increase information acquisition efficiency.But existing multi-document summary result is led to
The most more complicated, user's indigestion, and be difficult to summarization generation process is controlled, lack friendly user interface and man-machine
Interactive operation.Additionally, multi-document auto-abstracting technology often have ignored other attributes outside content of text, such as time, quantity etc.,
It is difficult to represent theme and subject content Characteristics of Evolution in time in document sets, also cannot reflect each theme under same document sets
Between relation.
Text visualization as an important branch in information visualization field, utilizes the mankind inherent to figure
Identification, memory and analysis ability, be converted into graph image by text message, help people intuitively, understand efficiently, read and divide
Analysis content of text and structure, and by corresponding interactive operation, help people to excavate valuable knowledge and pattern.
Content of text is abstracted into the set of one group of vocabulary by Word Cloud (word cloud) visualization technique, utilizes font big
The word frequency information of little expression vocabulary, then vocabulary is compact according to certain rule, aesthetically line up, special to represent text
Levy.But single document can only be visualized by word cloud.To multiple documents, Themerive(theme stream) in document sets
Theme visualizes, and shows each theme intensity trend over time in document sets.Initial theme stream only comprises theme
Intensity and temporal information, and theme order random alignment.Afterwards, Liu Shixia et al. proposes the theme stream TIARA improved, and is i.e. leading
Topic stream embeds word cloud, further each subject content is visualized, contribute to user and quickly analyze text subject content
Rule over time.
The most several text visualization technology all lack versatility, are not suitable for Chinese document, the most up to the present,
The most still lack the visualization technique that Chinese document subject matter is analyzed.Additionally, it is visual just for English document theme
TIARA technology there is also following problem: 1) shape of word cloud in theme stream, layout are unstable, easily make user cause misunderstanding,
Affect subject analysis effect;2) due to area-limited, the word cloud of generation cannot show whole key contents of each theme.
Summary of the invention
It is an object of the invention to provide the theme method for visualizing of a kind of Chinese document collection, by Chinese document sets
The each subject information extracted is added up and is processed, and measures out the intensity of theme and the weight of content, the most graphically changes
Mode is shown.
The technical scheme realizing the object of the invention is as follows: the theme method for visualizing of a kind of Chinese document collection, including by master
Inscribe the step to document sets classification: setting document sets has n theme lj, j=0,1,2 ..., n-1, according to theme in document sets
All documents are classified, and obtain n document subset Dj, j=0,1,2 ..., n-1;Wherein, theme ljCorresponding document subset is
Dj;
Divide the step of document sets time period: set the document sets time started as tstart, the end time is tend, to document sets
Time period [tstart,tend] carry out decile, obtain time period Tp=(tstart+ (p-1) Δ t, tstart+ p Δ t], wherein, p=1,
2 ..., m-1,Calculate the step of the theme frequency: set the theme frequency and include vj,0And vj,p, wherein vj,0Be the theme lj
Corresponding document subset DjAt time started tstarThe number of documents of t, vj,pIt is theme ljCorresponding document subset DjAt time period Tp
The quantity of interior document;Calculate the theme frequency of each theme respectively;
The step that theme is ranked up: all themes are sorted, the subject nucleotide sequence table after being sorted;
Generate the step of theme flow graph: according to the subject nucleotide sequence table after sequence and the theme frequency, use theme flow algorithm, raw
Become theme flow graph;
Extract the step of the key word representing subject content: set Wj,pIt is theme ljCorresponding document subset DjAt time period Tp
Interior document represents the key word subset of this subject content;Use the general Words partition system of Modern Chinese corresponding from each theme
Document subset extracts the key word subset representing this subject content in the document of each time period respectively;
The weight calculating key word the step sorted: the weight setting key word is that this key word is a key word subset
The number of times of middle appearance;Calculate each key word weight in each key word subset, and basis in each key word subset
All key words are sorted by the weight of key word from big to small;
Generate the step of word cloud: according to key word subset and keyword weight, theme flow graph generates word cloud.
In technique scheme, the step being ranked up theme can use the row complementary based on the theme frequency and geometry
Sequence method, including
Step 1, if theme ljInitial time be OTj;Work as vj,0When being not equal to zero, take the time started t of document setsstart
For OTj;Work as vj,0During equal to zero, then take vj,pThose time periods T being not zeropThe minima of left end point as OTj;Calculate every
The initial time of individual theme;
Step 2, if theme ljThe frequency andCalculate each theme the frequency and;
Step 3: newly-built empty list B;If n is even number, then the frequency and that maximum subject write are entered list the first row,
As upper extreme point theme lup, the frequency and time that big subject write are entered list the second row, as lower extreme point theme ldown;If
N is odd number, then the frequency and that maximum subject write are entered list the first row, simultaneously as upper extreme point theme lupWith lower extreme point master
Topic ldown;
Step 4: select a not theme l in list Bi, calculate lupAnd liThe meansigma methods of frequency sum
Calculate lupAnd liGeometry complementary, use varianceRepresent:
WillAnd OTiAfter normalization, calculate weighted value Di:
Di=sOTi+(1-s)σup,i(3);
Wherein s is for controlling parameter, 0≤s≤1;
Step 5: repeat step 4, until obtaining the weighted value of each not theme in list B;
Step 6: select weighted value DiMinimum theme, is inserted into the top of list B upper extreme point theme, as new upper end
Point theme lup;
Step 7: select a not theme l in list Bk, calculate ldownAnd lkThe meansigma methods of frequency sum
Calculate ldownAnd lkGeometry complementary, use variances sigmadown,kRepresent:
By σdown,kAnd OTkAfter normalization, calculate weighted value Dk:
Dk=sOTk+(1-s)σdown,k(6);
Wherein s is for controlling parameter, 0≤s≤1;
Step 8: repeat step 7, until obtaining the weighted value of each not theme in list B;
Step 9: select the theme that weighted value is minimum, be inserted into the lower section of list B lower extreme point theme, as new lower extreme point
Theme ldown;
Step 10: repeated execution of steps 4 to step 9, until all of theme all adds in list B.
In the present invention, the value controlling parameter s is 0.3.
In preceding solution, the method generating word cloud on theme flow graph is:
Step 1: select theme l on theme flow graphjCorresponding region Gj, its time started and end time respectively equal to literary composition
The time started t of shelves collectionstartWith end time tend, by region GjTime period [tstart,tend] it is divided into m-1 section, Mei Geshi
Between section a length of Obtain decile time point tstart+ p Δ t, wherein, p=1,2 ..., m-2;
Step 2: successively with decile time point tstartCentered by+p Δ t, according to Δ t at region GjUpper intercepting subregion Rj,p;
Rj,pIt is by a setWith
Curve
And line segmentThe closed space constituted;
Step 3: at each subregion Rj,pUpper placement key word subset Wj,pIn key word, generate theme ljWord
Cloud;Line segment is used including 3.1 The each point of Pointcut N, obtains subregion Rj,pNear
Like polygon;
3.2 set subregion Rj,pThe each summit of approximate polygon in the y-coordinate value on maximum that summit of y-coordinate be ymax;
If the y-coordinate value on that summit that y-coordinate is minimum is y in each summit of polygonmin;
3.3 with one group of horizontal line H={y=c | ymin≤c≤ymax, c ∈ Z} and subregion Rj,pIntersect, obtain some phases
Intersection section;Take each intersecting lens section and be positioned at the sub-line section of polygonal internal, be expressed asWherein, M
It is positioned at R for this intersecting lens sectionj,pThe bar number of internal sub-line section;By Rj,pIt is expressed as one group of horizontal line section collection
3.4 according to Wj,pIn keyword weight choose a key word the most successively, a height of h is set, a width of w's
Rectangle replaces this key word to be laid out, and then places this key word at placement position;Including
A, detection are at Rj,pCorresponding Lj,pIn, at c=(ymax-yminAn a width of w be placed in position)/2 can, a height of h's
Rectangle, detection method is: detect r corresponding from c to c-hcIn (i), if all there is same i, meet line segment's
Length is more than w;If can, then at position (c, sc(i)) place key word, update scI () is sc(i)=sc(i)+w;If can not, turn
Enter step B;
B, with c=(ymax-yminCentered by)/2, c=c+1, c=c-1 is made alternately to travel through L successivelyj,p, detection is at c position energy
This rectangle of no placement;If can, then at position (c, sc(i)) place key word, update scI () is sc(i)=sc(i)+w;If can not
Then continue to make c=c+1, c=c-1 alternately to travel through Lj,p, until finding the r meeting conditionc(i) or traveled through all of rc(i);If
Travel through all of rcAfter (i), do not find the position c meeting condition yet, then give up this key word;
3.5 repeat step 3.4, until by Wj,pIn key word all place;
Step 4: repetition step 1 to step 3, until generating the word cloud of each theme on theme flow graph.
The method that the invention allows for generating detailed word cloud, including
Step 1: select to express theme ljThe keyword set of content;
Step 2: arrange a border circular areas C, turns to one group of conflict point set P by the boundary discrete method of C;
Step 3: choose a key word from big to small according to the weight of key word from keyword set, uses random greedy
Center algorithm generates position candidate coordinate (word.x, word.y) for it in the C of region;
Step 4: according to the weight setting font size of this key word, further according to the number of words of font size He this key word, near with rectangle r
Seemingly replace key word, if the lower left corner coordinate of rectangle r is equal to coordinate;
Step 5: to each conflict point in P, whether detection each point conflicts with r;If there is conflict, proceed to step 6;As
There is not conflict in fruit, proceeds to step 7;
Step 6: after updating position coordinate along spiral path, repeats step 4, step 5, meets condition until finding
Position pcoordinate or the radius of spin more than 100;When the radius of spin is more than 100, key word will be rejected;
Step 7: place this key word at position coordinate (word.x, word.y), and this key word is taken
Discrete region turns to conflict point, adds in conflict point set P;
Step 8: repeat step 3 and arrive step 7, until all key words are placed in keyword set.
When generating detailed word cloud, theme l can be selectedjAny one key word subset Wj,pFor expressing theme
The keyword set of content, it is also possible to select theme ljAll key word subsets Wj,pThe keyword set formed after merging
Cooperation is for expressing the keyword set of the content of theme.
The present invention has the technical effect that 1 relative to prior art, achieves the theme visualization to Chinese document sets.
2, after using sort method based on the theme frequency and geometry complementarity to be ranked up theme, the theme flow graph of generation is more beautiful
See, more smooth, space availability ratio is high, the more conducively placement of word cloud.3, the word cloud layout method of the present invention is used, can be effective
Utilize space, on the premise of same area size, font size, more key word can be placed;And the layout generated is steady
Fixed, do not change with interactive operation below;Efficiency of algorithm also significantly improves, and this algorithm is by certain regular by irregular area
It is expressed as the most discrete entity, only need to travel through when placing word and find the entity meeting this word placement condition, be not required to
Collision detection to be carried out and border detection, therefore substantially increase positioning efficiency.4, generate detailed word cloud and can show master
All key words contents of topic.
Accompanying drawing explanation
Fig. 1 is the flow chart of the present invention.
Fig. 2 is in first embodiment of the invention, rear generate theme flow graph randomly ordered to theme, then uses TIARA skill
The placement algorithm of word cloud result visual to subject content figure in art.
Fig. 3 is the flow chart of the sort method complementary based on the theme frequency and geometry in second embodiment of the invention.
Fig. 4 is the design sketch of the theme flow graph generated in second embodiment of the invention.
Fig. 5 is the schematic diagram intercepting subregion in third embodiment of the invention.
Fig. 6 is by schematic diagram that subregion approximate representation is one group of horizontal line section collection in third embodiment of the invention.
Fig. 7 is the design sketch after placing key word in third embodiment of the invention on theme flow graph, theme stream therein
Figure rear generation randomly ordered to theme.
Fig. 8 is to the design sketch after the theme visualization of Chinese document sets, wherein theme stream in fourth embodiment of the invention
Figure is to have employed generation after sort method sequence based on the theme frequency and geometry complementarity, and word cloud is by the layout after improving
Algorithm generates.
Fig. 9 be in fifth embodiment of the invention to Chinese document sets theme visualize and add detailed word cloud after
Design sketch.
Detailed description of the invention
Embodiment 1: below as a example by " Journal of Software " journal data, in conjunction with Fig. 1, shows that Chinese document theme is visual
Method.
Step one, classifies to document sets by theme: setting document sets has n theme lj, j=0,1,2 ..., n-1, according to master
All documents in document sets are classified by topic, obtain n document subset Dj, j=0,1,2 ..., n-1;Wherein, theme ljRight
The document subset answered is Dj.Particularly as follows: the paper data of input " Journal of Software " periodical the 1st phase to the 9th phase.By document sets according to
Systems soft ware and soft project, database technology, computer network and information security, pattern recognition and artificial intelligence and operation system
Five themes of uniting are classified, and obtain five document subset.
Step 2, divides the document sets time period: set the document sets time started as tstart, the end time is tend, to document sets
Time period [tstart,tend] carry out decile, obtain time period Tp=(tstart+ (p-1) Δ t, tstart+ p Δ t], wherein, p=1,
2 ..., m-1, Wherein, m is time started, end time and the sum of decile time point.Here, document sets
Time started was the 1st phase, and the end time was the 9th phase, and the document sets time is divided into 8 time periods, and time interval was 1 phase.
Step 3, calculates the theme frequency: set the theme frequency and include vj,0And vj,p, wherein vj,0Be the theme ljCorresponding document
Subset DjAt time started tstartNumber of documents, vj,pIt is theme ljCorresponding document subset DjAt time period TpInterior document
Quantity;Calculate the theme frequency of each theme respectively.Here, the document calculating each theme occurred in the 1st phase and other each phase
The frequency, the number of documents that the most each theme comprised within document sets time started and each time period.
Step 4, is ranked up theme: sort all themes, the subject nucleotide sequence table after being sorted.The present embodiment
In, it is ranked up theme using traditional random alignment method.Use different sort methods, to the theme flow graph generated
Effect has a direct impact.
Step 5, generates theme flow graph: according to the subject nucleotide sequence table after sequence and the theme frequency, uses theme flow algorithm,
Generate theme flow graph, theme intensity is visualized.In the present embodiment, generate the theme as shown in colour band colored in Fig. 2
Flow graph, wherein blue color is pattern recognition and artificial intelligence, and purple colour band is computer network and information security, red ribbon
For operating system, yellow colour band is database technology, and green color bars is systems soft ware and soft project.Theme flow algorithm (quoted from
" ThemeRiver:Visualizing thematic changes in large document collections " literary composition),
(it is the constraint of zero that interpolating function need to meet at extreme point derivative i.e. to carry out interpolation according to each theme weights in discrete time
Condition), then carry out the drafting of stacked graph, generate theme flow graph.In theme flow graph, horizontal axis representing time, longitudinal axis difference in height
Representing theme intensity, different colours band represents different themes.Colour band broadens over time or narrows expression theme intensity at any time
Between differentiation.
Step 6, extracts the key word representing subject content: set Wj,pIt is theme ljCorresponding document subset DjIn the time period
TpInterior document represents the key word subset of this subject content, uses " the general Words partition system of Modern Chinese " from each theme pair
The document subset answered extracts the key word subset representing this subject content in the document of each time period respectively.The present embodiment
In, use word bag model text analysis technique, extract the key word subset representing subject content.Particularly as follows: use Beijing language
" the general Words partition system of Modern Chinese " of speech university language information processing institute exploitation, returns in the document subset to each theme
The all documents belonging to each time period carry out participle, remove stop words, such as auxiliary words of mood, adverbial word, preposition, conjunction etc., finally
Obtain multiple key word subset.
Step 7, calculates the weight of key word and sorts: the weight of key word is that this key word is a key word subset
The number of times of middle appearance;Calculate each key word weight in each key word subset, and according to the weight of key word from greatly to
Little all key words are sorted.
Step 8, generates word cloud: according to key word subset and keyword weight, uses TIARA algorithm at theme flow graph
Upper generation word cloud, visualizes subject content.
According to the Chinese document collection theme method for visualizing of the present embodiment, to the result after Chinese document sets theme visualization
As shown in Figure 2.
Embodiment 2: in above-mentioned Chinese document collection theme method for visualizing, each theme order random alignment.Generating theme
During stream, if the Strength Changes of certain theme is excessive, theme shape the most adjacent thereto can be twisted so that result is the most beautiful
See, and the relative intensity between theme is also difficult to identification.Additionally, the theme after Niu Qu also can affect the placement of word cloud.With this
Simultaneously for all themes of a document sets, user is often more concerned with the particular content of theme of theme maximum intensity.Cause
This, the present invention step to theme being ranked up in embodiment 1, carry out further improvement, devised a kind of based on master
Theme is ranked up by the sort method of the topic frequency and geometry complementarity.This sort method is described in detail below in conjunction with Fig. 3:
Step 1, if theme ljInitial time be OTj;Work as vj,0When being not equal to zero, take the time started t of document setsstart
For OTj;Work as vj,0During equal to zero, then take vj,pThose time periods T being not zeropThe minima of left end point as OTj;Calculate every
The initial time of individual theme;
Step 2, if theme ljThe frequency andCalculate each theme the frequency and;
Step 3: newly-built empty list B;If n is even number, then the frequency and that maximum subject write are entered list the first row,
As upper extreme point theme lup, the frequency and time that big subject write are entered list the second row, as lower extreme point theme ldown;If
N is odd number, then the frequency and that maximum subject write are entered list the first row, simultaneously as upper extreme point theme lupWith lower extreme point master
Topic ldown;Step 4: select a not theme l in list Bi, calculate lupAnd liThe meansigma methods of frequency sum
Calculate lupAnd liGeometry complementary, use varianceRepresent:
WillAnd OTiAfter normalization, calculate weighted value Di:
Di=sOTi+(1-s)σup,i(3);
Wherein s is for controlling parameter, 0≤s≤1;
Step 5: repeat step 4, until obtaining the weighted value of each not theme in list B;
Step 6: select weighted value DiMinimum theme, is inserted into the top of list B upper extreme point theme, as new upper end
Point theme lup;
Step 7: select a not theme l in list Bk, calculate ldownAnd lkThe meansigma methods of frequency sum
Calculate ldownAnd lkGeometry complementary, use variances sigmadown,kRepresent:
By σdown,kAnd OTkAfter normalization, calculate weighted value Dk:
Dk=sOTk+(1-s)σdown,k(6);
Wherein s is for controlling parameter, 0≤s≤1;
Step 8: repeat step 7, until obtaining the weighted value of each not theme in list B;
Step 9: select the theme that weighted value is minimum, be inserted into the lower section of list B lower extreme point theme, as new lower extreme point
Theme ldown;
Step 10: repeated execution of steps 4 to step 9, until all of theme all adds in list B.
In the present embodiment, the value controlling parameter s is 0.3.
After being ranked up theme according to the sort method in the present embodiment, the theme flow graph of generation is as shown in Figure 4, permissible
Finding out that theme flow graph is more attractive, more smooth, space availability ratio is high, the more conducively placement of word cloud.
Embodiment 3: for word cloud shape, the problem of layout instability in TIARA technology, word cloud is also entered by the present invention
Go improvement, first theme has been divided into several subregions, then uses scalable algorithm (quoted from " Tag Cloud++-
Scalable Tag Clouds for Arbitrary Layouts " literary composition) it is one group of horizontal line section collection by this region representation,
It is sequentially placed key word again, generates word cloud.Visual signature is as follows: 1) weight of key word is the biggest, and font is the biggest;2) weight is more
Big key word is the closer to this regional center.It is described in detail below in conjunction with Fig. 5, Fig. 6:
Step 1: select theme l on theme flow graphjCorresponding region Gj, its time started and end time respectively equal to literary composition
The time started t of shelves collectionstartWith end time tend, by region GjTime period [tstart,tend] it is divided into m-1 section, Mei Geshi
Between section a length of Obtain decile time point tstarT+p Δ t, wherein, p=1,2 ..., m-2;
Step 2: as it is shown in figure 5, successively with decile time point tstartCentered by+p Δ t, according to Δ t at region GjUpper intercepting
Subregion Rj,p;Rj,pIt is by a setWith
Curve
And line segmentThe closed space constituted;
Step 3: at each subregion Rj,pUpper placement key word subset Wj,pIn key word, generate theme ljWord
Cloud;Including
3.1 use line segment successively The each point of Pointcut N,
Obtain subregion Rj,pApproximate polygon;
3.2 set subregion Rj,pThe each summit of approximate polygon in the y-coordinate value on maximum that summit of y-coordinate be ymax;
If the y-coordinate value on that summit that y-coordinate is minimum is y in each summit of polygonmin;
3.3 as shown in Figure 6, with one group of horizontal line H={y=c | ymin≤c≤ymax, c ∈ Z} and subregion Rj,pIntersect,
To some crossing line segments;Take each intersecting lens section and be positioned at the sub-line section of polygonal internal, be expressed asWherein, M is that this intersecting lens section is positioned at Rj,pThe bar number of internal sub-line section;By Rj,pIt is expressed as one
Group horizontal line section collection
3.4 according to Wj,pIn keyword weight choose a key word the most successively, a height of h is set, a width of w's
Rectangle replaces this key word to be laid out, and then places this key word at placement position;Including
A, detection are at Rj,pCorresponding Lj,pIn, at c=(ymax-yminAn a width of w be placed in position)/2 can, a height of h's
Rectangle, detection method is: detect r corresponding from c to c-hcIn (i), if all there is same i, meet line segment's
Length is more than w;If can, then at position (c, sc(i)) place key word, update scI () is sc(i)=sc(i)+w;If can not, turn
Enter step B;
B, with c=(ymax-yminCentered by)/2, c=c+1, c=c-1 is made alternately to travel through L successivelyj,p, detection is at c position energy
This rectangle of no placement;If can, then at position (c, sc(i)) place key word, update scI () is sc(i)=sc(i)+w;If can not
Then continue to make c=c+1, c=c-1 alternately to travel through Lj,p, until finding the r meeting conditionc(i) or traveled through all of rc(i);If
Travel through all of rcAfter (i), do not find the position c meeting condition yet, then give up this key word;
3.5 repeat step 3.4, until by Wj,pIn key word all place;
Step 4: repetition step 1 to step 3, until generating the word cloud of each theme on theme flow graph.
Fig. 7 shows the effect after using this method to generate word cloud, wherein generates theme flow graph, uses theme sequence
Randomly ordered method.Comparison diagram 2 is it can be seen that use the word cloud placement algorithm of the present invention to have the advantage that 1) energy
Effectively utilize space.On the premise of same area size, font size, more key word can be placed.2) layout generated
Stable, do not change with interactive operation below.3) efficiency of algorithm significantly improves.This algorithm is by certain regular by region of disorder
Domain representation is the most discrete entity, only need to travel through when placing word and find the entity meeting this word placement condition, no
Need to carry out collision detection and border detection, therefore drastically increase positioning efficiency.
Embodiment 4: the present embodiment combines the word cloud after improving in the theme sort method of embodiment 2 and embodiment 3
Laying method, other step is constant.Fig. 8 shows that the present embodiment is to the Chinese visual result of document sets theme.
Embodiment 5: in TIARA, is limited by area size, it is difficult to place all of key word in a region.
Therefore, the present invention uses a detailed word cloud, and full content or each theme with each theme of visualization further are at each
The full content of time period.The color of theme in the background color correspondence theme flow graph of word cloud, the size of key word corresponds to
The weight of key word.The present invention uses random greedy algorithm (quoted from " TIARA:A Visual Exploratory Text
Analytic System ") generate detailed word cloud, particularly as follows:
Step 1: select to express theme ljThe keyword set of content;
Step 2: arrange a border circular areas C, turns to one group of conflict point set P by the boundary discrete method of C;
Step 3: choose a key word from big to small according to the weight of key word from keyword set, uses random greedy
Center algorithm generates position candidate coordinate (word.x, word.y) for it in the C of region;
Step 4: according to the weight setting font size of this key word, further according to the number of words of font size He this key word, near with rectangle r
Seemingly replace key word, if the lower left corner coordinate of rectangle r is equal to coordinate;
Step 5: to each conflict point in P, whether detection each point conflicts with r;If there is conflict, proceed to step 6;As
There is not conflict in fruit, proceeds to step 7;
Step 6: after updating position coordinate along spiral path, repeats step 4, step 5, meets condition until finding
Position coordinate or the radius of spin more than 100;When the radius of spin is more than 100, key word will be rejected;
Step 7: place this key word at position coordinate (word.x, word.y), and this key word is taken
Discrete region turns to conflict point, adds in conflict point set P;
Step 8: repeat step 3 and arrive step 7, until all key words are placed in keyword set.
When generating detailed word cloud, theme l can be selectedjAny one key word subset Wj,pFor expressing theme
The keyword set of content, it is also possible to select theme ljAll key word subsets Wj,pThe keyword set formed after merging
Cooperation is for expressing the keyword set of the content of theme.
Fig. 9 shows in the theme visualization method of Chinese document collection, adds the effect after detailed word cloud.In detail
Thin word cloud places lower right in the drawings, clicks the colour band that theme is corresponding, each theme of the most changeable display on theme flow graph
Corresponding detailed word cloud.Show in figure is the detailed word cloud of operating system theme.It can be seen that due to by region
The restriction of size, on theme flow graph in the word cloud of operating system theme, whole key words contents are placed the most completely, and detailed
Thin word cloud then illustrates all key words contents of this theme.
Claims (6)
1. the theme method for visualizing of a Chinese document collection, it is characterised in that include
Step document sets classified by theme: setting document sets has n theme lj, j=0,1,2 ..., n-1, according to theme to literary composition
All documents that shelves are concentrated are classified, and obtain n document subset Dj, j=0,1,2 ..., n-1;Wherein, theme ljCorresponding
Document subset is Dj;
Divide the step of document sets time period: set the document sets time started as tstart, the end time is tend, to the document sets time
Section [tstart,tend] carry out decile, obtain time period Tp=(tstart+(p-1)Δt,tstart+ p Δ t], wherein, p=1,2 ...,
M-1,
Calculate the step of the theme frequency: set the theme frequency and include vj,0And vj,p, wherein vj,0Be the theme ljCorresponding document subset Dj
At time started tstartNumber of documents, vj,pIt is theme ljCorresponding document subset DjAt time period TpThe quantity of interior document;
Calculate the theme frequency of each theme respectively;
The step that theme is ranked up: all themes are sorted, the subject nucleotide sequence table after being sorted;
Generate the step of theme flow graph: according to the subject nucleotide sequence table after sequence and the theme frequency, use theme flow algorithm, generate main
Topic flow graph;
Extract the step of the key word representing subject content: set Wj,pIt is theme ljCorresponding document subset DjAt time period TpIn
Document represents the key word subset of this subject content;Use the general Words partition system of Modern Chinese from document corresponding to each theme
Subset extracts the key word subset representing this subject content in the document of each time period respectively;
The weight calculating key word the step sorted: the weight setting key word is that this key word goes out in a key word subset
Existing number of times;Calculate each key word weight in each key word subset, and according to key in each key word subset
All key words are sorted by the weight of word from big to small;
Generate the step of word cloud: according to key word subset and keyword weight, theme flow graph generates word cloud;
The described step being ranked up theme, is carried out according to the sort method complementary based on the theme frequency and geometry, including
Step 1, if theme ljInitial time be OTj;Work as vj,0When being not equal to zero, take the time started t of document setsstartFor OTj;
Work as vj,0During equal to zero, then take vj,pThose time periods T being not zeropThe minima of left end point as OTj;Calculate each
The initial time of theme;
Step 2, if theme ljThe frequency andCalculate each theme the frequency and;
Step 3: newly-built empty list B;If n is even number, then the frequency and that maximum subject write are entered list the first row, as
Upper extreme point theme lup, the frequency and time that big subject write are entered list the second row, as lower extreme point theme ldown;If n is
Odd number, then enter list the first row, simultaneously as upper extreme point theme l the frequency and that maximum subject writeupWith lower extreme point theme
ldown;
Step 4: select a not theme l in list Bi, calculate lupAnd liThe meansigma methods of frequency sum
Calculate lupAnd liGeometry complementary, use variances sigmaup,iRepresent:
By σup,iAnd OTiAfter normalization, calculate weighted value Di:
Di=sOTi+(1-s)σup,i(3);
Wherein s is for controlling parameter, 0≤s≤1;
Step 5: repeat step 4, until obtaining the weighted value of each not theme in list B;
Step 6: select weighted value DiMinimum theme, is inserted into the top of list B upper extreme point theme, as new upper extreme point master
Topic lup;
Step 7: select a not theme l in list Bk, calculate ldownAnd lkThe meansigma methods of frequency sum
Calculate ldownAnd lkGeometry complementary, use variances sigmadown,kRepresent:
By σdown,kAnd OTkAfter normalization, calculate weighted value Dk:
Dk=sOTk+(1-s)σdown,k(6);
Wherein s is for controlling parameter, 0≤s≤1;
Step 8: repeat step 7, until obtaining the weighted value of each not theme in list B;
Step 9: select the theme that weighted value is minimum, be inserted into the lower section of list B lower extreme point theme, as new lower extreme point theme
ldown;
Step 10: repeated execution of steps 4 to step 9, until all of theme all adds in list B.
2. the theme method for visualizing of Chinese document collection as claimed in claim 1, it is characterised in that described control parameter s=
0.3。
3. the theme method for visualizing of Chinese document collection as claimed in claim 1, it is characterised in that described generation word cloud
Step, including
Step 1: select theme l on theme flow graphjCorresponding region Gj, its time started and end time be respectively equal to document sets
Time started tstartWith end time tend, by region GjTime period [tstart,tend] it is divided into m-1 section, each time period
A length ofObtain decile time point tstart+ p Δ t, wherein, p=1,2 ..., m-2;
Step 2: successively with decile time point tstartCentered by+p Δ t, according to Δ t at region GjUpper intercepting subregion Rj,p;Rj,p
It is by a setAnd curveAnd line
SectionThe closed space constituted;
Step 3: at each subregion Rj,pUpper placement key word subset Wj,pIn key word, generate theme ljWord cloud;Bag
Include
3.1 use line segment Pointcut N's is each
Point, obtains subregion Rj,pApproximate polygon;
3.2 set subregion Rj,pThe each summit of approximate polygon in the y-coordinate value on maximum that summit of y-coordinate be ymax;If it is many
In Xing Ge summit, limit, the y-coordinate value on that summit that y-coordinate is minimum is ymin;
3.3 with one group of horizontal line H={y=c | ymin≤c≤ymax, c ∈ Z} and subregion Rj,pIntersect, obtain some intersecting lenses
Section;Take each intersecting lens section and be positioned at the sub-line section of polygonal internal, be expressed asWherein, M is for being somebody's turn to do
Intersecting lens section is positioned at Rj,pThe bar number of internal sub-line section;By Rj,pIt is expressed as one group of horizontal line section collection
3.4 according to Wj,pIn keyword weight choose a key word the most successively, a height of h, the rectangle of a width of w are set
Replace this key word to be laid out, then place this key word at placement position;Including
A, detection are at Rj,pCorresponding Lj,pIn, at c=(ymax-yminAn a width of w be placed in position)/2 can, the rectangle of a height of h,
Detection method is: detect r corresponding from c to c-hcIn (i), if all there is same i, meet line segmentLength big
In w;If can, then at position (c, sc(i)) place key word, update scI () is sc(i)=sc(i)+w;If can not, proceed to step
B;
B, with c=(ymax-yminCentered by)/2, c=c+1, c=c-1 is made alternately to travel through L successivelyj,p, can detection in c position
Place this rectangle;If can, then at position (c, sc(i)) place key word, update scI () is sc(i)=sc(i)+w;If can not,
Continue to make c=c+1, c=c-1 alternately to travel through Lj,p, until finding the r meeting conditionc(i) or traveled through all of rc(i);If
Travel through all of rcAfter (i), do not find the position c meeting condition yet, then give up this key word;
3.5 repeat step 3.4, until by Wj,pIn key word all place;
Step 4: repetition step 1 to step 3, until generating the word cloud of each theme on theme flow graph.
4. the theme method for visualizing of Chinese document collection as claimed in claim 1, it is characterised in that also include generating detailed
The step of word cloud, including
Step 1: select to express theme ljThe keyword set of content;
Step 2: arrange a border circular areas C, turns to one group of conflict point set P by the boundary discrete method of C;
Step 3: choose a key word from big to small according to the weight of key word from keyword set, uses random greed to calculate
Method generates position candidate coordinate (word.x, word.y) for it in the C of region;
Step 4: according to the weight setting font size of this key word, further according to the number of words of font size He this key word, approximate generation with rectangle r
For key word, if the lower left corner coordinate of rectangle r is equal to coordinate;
Step 5: to each conflict point in P, whether detection each point conflicts with r;If there is conflict, proceed to step 6;If no
There is conflict, proceed to step 7;
Step 6: after updating position coordinate along spiral path, repeats step 4, step 5, until finding the position meeting condition
Put pcoordinate or the radius of spin more than 100;When the radius of spin is more than 100, key word will be rejected;
Step 7: place this key word, and the region taken by this key word at position coordinate (word.x, word.y)
Discrete turn to conflict point, add in conflict point set P;
Step 8: repeat step 3 and arrive step 7, until all key words are placed in keyword set.
5. the theme method for visualizing of Chinese document collection as claimed in claim 4, it is characterised in that described expression theme lj's
The keyword set of content is theme ljAny one key word subset Wj,p。
6. the theme method for visualizing of Chinese document collection as claimed in claim 4, it is characterised in that described expression theme lj's
The keyword set of content, is obtained by following steps:
Step 1, merges theme ljAll key word subsets Wj,p, p=1,2 ..., m-1;
Step 2: calculate the weight of all key words in the set after merging, the weight of described key word is that this key word is all
The number of times occurred in key word subset.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310488312.7A CN103631856B (en) | 2013-10-17 | 2013-10-17 | Subject visualization method for Chinese document set |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310488312.7A CN103631856B (en) | 2013-10-17 | 2013-10-17 | Subject visualization method for Chinese document set |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103631856A CN103631856A (en) | 2014-03-12 |
CN103631856B true CN103631856B (en) | 2017-01-11 |
Family
ID=50212898
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310488312.7A Active CN103631856B (en) | 2013-10-17 | 2013-10-17 | Subject visualization method for Chinese document set |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103631856B (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105320683A (en) * | 2014-07-24 | 2016-02-10 | 贾新志 | Graphical display method of literature theme content analysis |
CN105989090A (en) * | 2015-02-12 | 2016-10-05 | 中兴通讯股份有限公司 | Critical data processing method and device as well as critical data display method and system |
CN105373579B (en) * | 2015-08-18 | 2018-08-03 | 天津大学 | A kind of news competitiveness analysis method and its visualization device based on regression analysis |
CN106250512B (en) * | 2016-08-04 | 2019-07-26 | 国家基础地理信息中心 | A kind of subject network information collecting method for taking time intention into account |
CN106681983A (en) * | 2016-11-25 | 2017-05-17 | 北京掌行通信息技术有限公司 | Station name participle display method and device |
CN106909381B (en) * | 2017-02-24 | 2020-01-03 | 西南交通大学 | Interactive theme river visualization method |
CN109144504A (en) * | 2017-06-26 | 2019-01-04 | 华东师范大学 | Data visualization image generation method and storage medium based on D3 |
CN107622132B (en) * | 2017-10-09 | 2020-07-03 | 四川大学 | Online question-answer community oriented association analysis visualization method |
CN109783616A (en) * | 2018-12-03 | 2019-05-21 | 广东蔚海数问大数据科技有限公司 | A kind of text subject extracting method, system and storage medium |
CN109933702B (en) * | 2019-03-11 | 2022-12-16 | 智慧芽信息科技(苏州)有限公司 | Retrieval display method, device, equipment and storage medium |
CN110189393B (en) * | 2019-06-05 | 2021-04-23 | 山东大学 | Shape word cloud generation method and device |
CN111737523B (en) * | 2020-04-22 | 2023-11-14 | 聚好看科技股份有限公司 | Video tag, generation method of search content and server |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101996234A (en) * | 2009-08-17 | 2011-03-30 | 阿瓦雅公司 | Word cloud audio navigation |
US8402030B1 (en) * | 2011-11-21 | 2013-03-19 | Raytheon Company | Textual document analysis using word cloud comparison |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8753184B2 (en) * | 2012-04-04 | 2014-06-17 | David Goldenberg | System and method for interactive gameplay with song lyric database |
-
2013
- 2013-10-17 CN CN201310488312.7A patent/CN103631856B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101996234A (en) * | 2009-08-17 | 2011-03-30 | 阿瓦雅公司 | Word cloud audio navigation |
US8402030B1 (en) * | 2011-11-21 | 2013-03-19 | Raytheon Company | Textual document analysis using word cloud comparison |
Non-Patent Citations (1)
Title |
---|
WordStream: Visualizing Theme Summarization and Comparison in Document Collections over Time;Ting Liang et.al;《Advances in information Sciences and Service Sciences(AISS)》;20130228;正文第975页3-9段、第976页第1-6段、第977页第1-6段、第978页第1-9段、第979页第1-14段、第980页第1-5段、第981页第1-3段,附图1-7 * |
Also Published As
Publication number | Publication date |
---|---|
CN103631856A (en) | 2014-03-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103631856B (en) | Subject visualization method for Chinese document set | |
US10552468B2 (en) | Topic predictions based on natural language processing of large corpora | |
Cao et al. | Introduction to text visualization | |
Van Ham et al. | Mapping text with phrase nets | |
Andrienko et al. | Designing visual analytics methods for massive collections of movement data | |
Hilpert | Dynamic visualizations of language change: Motion charts on the basis of bivariate and multivariate data from diachronic corpora | |
CN107578292B (en) | User portrait construction system | |
Archambault et al. | ThemeCrowds: Multiresolution summaries of twitter usage | |
Fried et al. | Maps of computer science | |
Liang et al. | Highlighting in information visualization: A survey | |
Sperberg‐McQueen | Classification and its Structures | |
Nocaj et al. | Organizing search results with a reference map | |
US11256687B2 (en) | Surfacing relationships between datasets | |
CN105975597B (en) | A kind of international shared platform of Dongba classics ancient books succession system digitlization | |
CN110276014A (en) | Recommended method, device, equipment and the storage medium of copyright | |
Xia et al. | Visualizing rank time series of wikipedia top-viewed pages | |
US11650073B2 (en) | Knowledge space analytics | |
Rayson et al. | Towards interactive multidimensional visualisations for corpus linguistics | |
CN107908749B (en) | Character retrieval system and method based on search engine | |
Seifert et al. | Visual analysis and knowledge discovery for text | |
CN110083760B (en) | Multi-recording dynamic webpage information extraction method based on visual block | |
Peña et al. | Linked open data visualization revisited: a survey | |
JP2000020538A (en) | Method and device for retrieving information, and storage medium for information retrieving program | |
Chen et al. | Research on Data Analysis and Visualization of Recruitment Positions Based on Text Mining | |
Sabol et al. | Visual knowledge discovery in dynamic enterprise text repositories |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |