CN101984435A - Method and device for distributing texts - Google Patents

Method and device for distributing texts Download PDF

Info

Publication number
CN101984435A
CN101984435A CN 201010549183 CN201010549183A CN101984435A CN 101984435 A CN101984435 A CN 101984435A CN 201010549183 CN201010549183 CN 201010549183 CN 201010549183 A CN201010549183 A CN 201010549183A CN 101984435 A CN101984435 A CN 101984435A
Authority
CN
China
Prior art keywords
column
text
distributed
bunch
under
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 201010549183
Other languages
Chinese (zh)
Other versions
CN101984435B (en
Inventor
蔡勋梁
彭学政
王广彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201010549183A priority Critical patent/CN101984435B/en
Publication of CN101984435A publication Critical patent/CN101984435A/en
Application granted granted Critical
Publication of CN101984435B publication Critical patent/CN101984435B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and device for distributing texts, which is applied to a column frame comprising at least two levels of columns, wherein the method comprises the following steps: A, respectively executing the following distributing steps aiming at each grabbed text: matching the degrees of similarity of the keywords of the texts to be distributed currently and the central vectors of each column, and distributing the texts to be distributed currently to the column meeting a distribution matching polity according to the matching result, wherein the central vectors of the column are generated based on seed words set for the column in advance; and B, according to the hierarchy relation of the columns, distributing all or part of the texts under the set columns to an upper level parent column or the lower level sub-column. The method and device of the invention can reduce the workload and cost of text distribution, shortens the text distributing duration, and is convenient for flexibly increasing and decreasing the columns.

Description

A kind of method and apparatus that text is distributed
[technical field]
The present invention relates to Internet technical field, particularly a kind of method and apparatus that text is distributed.
[background technology]
Along with internet popularizing in the whole world, and the continuous development of internet, applications, text message on the webpage is explosive growth, how fully to effectively utilize the text message on the webpage, and how to organize these text messages effectively and offer the user, become an important research direction in the data mining field gradually and had very high industry and be worth.At present, text classification has been applied in many fields, for example: the news pages of each column is recalled, delivery of electronic mail, generation user interest pattern or the like.
Text classification is exactly that a large amount of texts are distributed under the different columns, and wherein column can belong to different classification, also can belong to the different subclasses under the same classification.The ways of distribution of existing text promptly is provided with the collection of document that a manual sort handled based on training sample, trains the distribution that realizes text according to this training sample.But there is following defective in this mode based on training sample:
The foundation of one, training sample need be carried out the stages such as language material collection, training pattern foundation, needs very big workload, and especially language material is collected the artificial mark that needs a large amount of professional domains, causes the workload and the cost of text distribution excessive.
Two, the training duration is long, and the foundation of training sample can bring week other distribution duration of level usually.
In addition, because training sample is corresponding with the column framework, in case the column framework changes, just need redefine training sample, and training sample be difficult to very much that obtain and consuming time very long, the cost that can further bring text to distribute is excessive, the distribution duration is long, can not increase and decrease column neatly.
[summary of the invention]
The invention provides a kind of method and apparatus that text is distributed,, shorten the distribution duration, to make things convenient for the flexible increase and decrease of column can reduce the cost of text distribution.
Concrete technical scheme is as follows:
A kind of method that text is distributed is applied to comprise the column framework of two-stage column at least, and this method comprises:
A, carry out following distributing step respectively at each text that grasps:
Distributing step: the keyword of current text to be distributed and the center vector of each column are carried out the similarity coupling,, current text to be distributed is distributed under the column that satisfies the distribution matching strategy according to matching result; Wherein, the center vector of described column generates based on the seed speech that is provided with for this column in advance;
B, according to the hierarchical relationship between each column, will set all or part of upper level father column or the sub-column of next stage of being distributed to of text under the column.
Wherein, the described distribution matching strategy of column comprises at least: the similarity between the keyword of described text to be distributed and the center vector of column surpasses the similarity threshold that is provided with at this column; Perhaps,
The result that similarity between the keyword of described text to be distributed and the center vector of column deducts after the similarity between the opposite vector of the keyword of described text to be distributed and same column surpasses the similarity threshold that is provided with at this column, and the opposite vector of wherein said column is based on the reverse speech generation that is provided with for this column in advance.
More preferably, described step B specifically comprises a kind of or combination in any in the following mode:
The column that is distributed text according to the mode of described steps A is sub-column, to be distributed all texts under each sub-column of text or ordering according to the mode of described steps A and gather the column to the upper level father at preceding N1 text, wherein N1 is default positive integer; Perhaps,
The column that is distributed text according to the mode of described steps A is father's column, and all texts that will be distributed according to the mode of described steps A under father's column of text are distributed to the sub-column of next stage; Perhaps,
The column that is distributed text according to the mode of described steps A comprises father's column and sub-column, will be distributed the sub-column of next stage that part text under father's column of text is distributed to not distributed text according to the mode of described steps A.
Further, described column can comprise: have the common column of text exhibition attribute and have the not hiding column of text exhibition attribute.
More preferably, this method further comprises: extract the keyword that is distributed text down from the column that is provided with the seed speech, with the keyword that extracts in conjunction with the seed speech of this column to form the new center vector of this column.
Further, after described step B, carry out following steps respectively at each column:
C1, the text under the column is carried out cluster, form this column next one above bunch;
C2, choose strategy, in each bunch, choose of the expression of top text respectively as each bunch according to default top news.
Behind described step C2, also comprise:
Calculate the weight of each text under the column according to text attribute, the weight that the weight of each text is determined bunch in utilizing bunch, according to bunch weight each bunch under the column sorted; Perhaps,
According to default focus text selection strategy, choose the focus text the text under each column respectively and under each column, show.
Wherein, described top news is chosen strategy and is comprised a kind of or combination in any in the following strategy: choose the text of text issuing time in setting range, choose title and satisfy and set the text that requires, choose and the text of bunch center vector similarity in setting range, choose the text that text quality satisfies preset requirement.
Particularly, the weights W of each text PageComputing formula be:
Figure BSA00000350904700031
Wherein, α is default inverse ratio factor die-away time, Δ tBe the current mistiming of text issuing time distance, δ (site) is the computing function of text quality's factor, and φ (segcount) is the computing function of the reprinting rate factor.
A kind of device that text is distributed is applied to comprise the column framework of two-stage column at least, and this device comprises: text acquiring unit, first Dispatching Unit and second Dispatching Unit;
Described text acquiring unit is used for each text that grasps is delivered to described first Dispatching Unit as text to be distributed respectively;
Described first Dispatching Unit is used for the keyword of current text to be distributed and the center vector of each column are carried out the similarity coupling, according to matching result, current text to be distributed is distributed under the column that satisfies the distribution matching strategy; Wherein, the center vector of described column generates based on the seed speech that is provided with for this column in advance;
Described second Dispatching Unit, be used to treat that described first Dispatching Unit is finished distribution to all texts to be distributed after, according to the hierarchical relationship between each column, will set all or part of upper level father column or the sub-column of next stage of being distributed to of text under the column.
Wherein, the described distribution matching strategy of column comprises at least: the similarity between the keyword of described text to be distributed and the center vector of column surpasses the similarity threshold that is provided with at this column; Perhaps,
The result that similarity between the keyword of described text to be distributed and the center vector of column deducts after the similarity between the opposite vector of the keyword of described text to be distributed and same column surpasses the similarity threshold that is provided with at this column, and the opposite vector of wherein said column is based on the reverse speech generation that is provided with for this column in advance.
The column of described first Dispatching Unit distribution is sub-column, this moment, described second Dispatching Unit gathered all texts under each sub-column of described first Dispatching Unit distribution or ordering with the column to the upper level father at the individual text of preceding N1, and wherein N1 is default positive integer; Perhaps,
The column of described first Dispatching Unit distribution is father's column, and this moment, described second Dispatching Unit was distributed to the sub-column of next stage with all texts under each sub-column of described first Dispatching Unit distribution; Perhaps,
The column of described first Dispatching Unit distribution comprises father's column and sub-column, and this moment, described second Dispatching Unit was distributed to the part text under father's column of described first Dispatching Unit distribution in the sub-column of next stage of not distributed text.
Particularly, described column comprises: have the common column of text exhibition attribute and have the not hiding column of text exhibition attribute.
More preferably, this device also comprises: keyword extracting unit, be used for extracting the keyword that is distributed text down from the column that is provided with the seed speech, with the keyword that extracts in conjunction with the seed speech of this column to form the new center vector of this column and to offer described first Dispatching Unit.
Further, this device also comprises: text cluster unit and top news are chosen the unit;
Described text cluster unit is used for the Distribution Results according to described first Dispatching Unit and described second Dispatching Unit, and the text under the column is carried out cluster, form each column next one above bunch;
Described top news is chosen the unit, is used for choosing strategy according to default top news, chooses the expression of top text as each bunch in each bunch respectively.
More preferably, this device also comprises: bunch sequencing unit or focus are chosen a kind of or whole in the unit;
Described bunch of sequencing unit is used for calculating according to text attribute the weight of each text under the column, the weight that the weight of each text is determined bunch in utilizing bunch, according to bunch weight each bunch under the column sorted;
Described focus is chosen the unit, is used for the Distribution Results according to described first Dispatching Unit and the described second point-score unit, according to default focus text selection strategy, chooses the focus text the text under each column respectively and shows under each column.
Wherein, described top news is chosen strategy and is comprised a kind of or combination in any in the following strategy: choose the text of text issuing time in setting range, choose title and satisfy and set the text that requires, choose and the text of bunch center vector similarity in setting range, choose the text that text quality satisfies preset requirement.
Particularly, the weights W of each text PageComputing formula be:
Figure BSA00000350904700051
Wherein, α is default inverse ratio factor die-away time, Δ tBe the current mistiming of text issuing time distance, δ (site) is the computing function of text quality's factor, and φ (segcount) is the computing function of the reprinting rate factor.
As can be seen from the above technical solutions, the text distribution that the center vector distribution text that the present invention's employing generates based on column seed speech is given column and binding layer inter-stage, the duration that text is distributed is controlled at a second level, has improved the efficient of text classification greatly.In addition, adopt method and apparatus of the present invention to avoid complicated training sample to set up process, in case and the column framework changes, only need set suitable seed speech and the text distribution rules between level at the column that increases, the text distribution rules of revising between level at the column of deletion gets final product, obviously need to redefine the mode of training sample in the prior art of comparing, can reduce the cost of text distribution, increase and decrease column more neatly.
[description of drawings]
Fig. 1 is a main method process flow diagram provided by the invention;
The news distribution flow figure of each column that Fig. 2 provides for the embodiment of the invention one;
First kind of news pages ways of distribution that Fig. 3 a provides for the embodiment of the invention one;
Second kind of news pages ways of distribution that Fig. 3 b provides for the embodiment of the invention one;
The third news pages ways of distribution that Fig. 3 c provides for the embodiment of the invention one;
Fig. 4 mixes the synoptic diagram of news pages ways of distribution for the employing that the embodiment of the invention one provides;
The process flow diagram of the formation news that Fig. 5 provides for the embodiment of the invention two bunch;
Fig. 6 is an apparatus structure synoptic diagram provided by the invention.
[embodiment]
In order to make the purpose, technical solutions and advantages of the present invention clearer, describe the present invention below in conjunction with the drawings and specific embodiments.
Fig. 1 is a main method process flow diagram provided by the invention, as shown in Figure 1, can mainly may further comprise the steps:
Step 101: carry out following distributing step respectively at each text that grasps:
Distributing step: the keyword of current text to be distributed and the center vector of each column are carried out the similarity coupling,, current text to be distributed is distributed under the column that satisfies the distribution matching strategy according to matching result; Wherein the center vector of above-mentioned column generates based on the seed speech that is provided with for this column in advance.
In this step, the distribution matching strategy of column can be provided with flexibly, comprises at least: the similarity between the keyword of text to be distributed and the center vector of column surpasses the similarity threshold that is provided with at this column.In addition, the distribution matching strategy of column can further include but is not limited to a kind of or combination in any in the following strategy: the similarity between the keyword of text to be distributed and the center vector of column is the highest, perhaps, the website source of text to be distributed meets the website requirement of column, perhaps, the author of text to be distributed meets author's requirement of column, perhaps, text to be distributed meets the requirement of column for picture or video, perhaps, the title regular expression of text to be distributed meets the requirement of column for the title regular expression, and perhaps, the URL(uniform resource locator) of text to be distributed (URL) type meets the URL type requirement of column.
Step 102: according to the hierarchical relationship between each column, will set all or part of upper level father column or the sub-column of next stage of being distributed to of news under the column, to finish text distribution to each column in the column framework.
After in the column framework, can preestablishing some column and utilizing the mode of step 101 or other existing modes to be distributed text, the text under this column is distributed in upper level father column or the sub-column of next stage.Can be by this step to the column distribution news of seed speech is not set, this partial content will be described in detail in embodiment one.
Below by specific embodiment said method provided by the invention is described, in following embodiment, all adopts this text of news pages is distributed as example.At first adopt the news pages distribution flow of a pair of each column of embodiment to be described in detail.
Embodiment one,
The news distribution flow figure of each column that Fig. 2 provides for the embodiment of the invention one as shown in Figure 2, can specifically may further comprise the steps:
Step 201: in advance for the column in the column framework is provided with the seed speech, and for being provided with the column formation center vector of seed speech.
In the column framework, the seed speech is usually by artificial setting, and the column that the seed speech is set can be the root column, also can be sub-column.At a column one or more seed speech can be set and constitute one group of seed speech.
Because the artificial seed speech that is provided with is limited, can not all possible keyword of exhaustive this column, therefore the simple center vector that relies on the artificial seed speech that is provided with may cause the part news pages can't recall (recall refer to news pages be dispensed under certain column), therefore, more preferably, in the time of can under column, being called back the part news pages, the news pages that utilization is called back is extracted keyword, and utilize keyword to become the new center vector of this column in conjunction with the seed morphology of this column, thereby make the center vector that forms describe the content guiding of this column, the accuracy rate and the recall rate of the news that the raising column is recalled more accurately.Corresponding following step 206, the cycle index of utilizing the news pages that is called back to extract keyword can for example be set to circulate 3 times according to the empirical value setting.
Step 202: to each news pages of grabbing one by one execution in step 203 to step 204.
After search engine grabs news pages in batches, the news pages that grasps can be distributed one by one.
More preferably, after grabbing news pages, can at first carry out feature selecting, go heavily to wait processing, at first filtering out part news pages useless or that repeat, thereby improve the efficient that news is recalled the news pages that grabs.
Step 203: extract the keyword of current news pages to be distributed, the keyword of extraction and the center vector of each column to be matched are carried out the similarity coupling.
Step 204: according to matching result, it is the highest that current news pages to be distributed is distributed to similarity, and surpass under the column of column similarity threshold.
In this embodiment, the distribution matching strategy is the highest and to surpass the column similarity threshold be example with similarity, can also adopt any other strategies described in the step 101, no longer repeats to give unnecessary details at this.
In addition; because the granularity of seed speech is bigger usually; under carrying out column during the recalling of news pages; usually can introduce noise; therefore; when under each column, realizing the recalling of news pages; can further reverse speech be set at column; become opposite vector based on reverse morphology; when carrying out the similarity coupling; the similarity that can determine the keyword of news pages to be distributed and center vector deducts the result after the similarity with opposite vector, judges that whether the result who determines satisfies the distribution matching strategy, promptly comprises at least: judge whether the result who determines surpasses the similarity threshold that is provided with at column.
In the column framework, the news ways of distribution of each column can dispose in the column attribute, specifically can in the column attribute, dispose based on the center vector mode of seed speech and obtain news pages (these columns obtain the overall news pages resource that the set of news pages can grab for web crawlers), perhaps obtain news pages (these columns obtain the news pages set that the set of news pages can be obtained for his father's column or sub-column) in uncle's column or the sub-column, perhaps adopt alternate manner to obtain news pages.For example, can adopt step 203 to the mode of step 205 can realize recalling of news pages,, then can obtain news pages from other columns for the column that does not dispose the seed speech for the column that has disposed the seed speech.Obtain in uncle's column or the sub-column described in the mode such as following step of news pages.
Step 205: according to the hierarchical relationship between each column, with all or part of upper level father column or the sub-column of next stage of being distributed to of news under the column.
Usually there is certain hierarchical relationship between each column, can adopts at this but be not limited to the mode of recalling of following three kinds of news pages:
First kind of mode: each sub-column is realized recalling of each sub-column news pages by step 203 to the mode of step 204, the news pages under each sub-column is gathered to be distributed to upper level father column then.Shown in Fig. 3 a, shaded nodes represents to be provided with the column of seed speech among Fig. 3 a, arrow points be news pages distributor to.This mode is generally suitable for each sub-column and differs greatly, the not high situation of seed speech degree of overlapping between the column.For example, father's column is " amusement ", sub-column is respectively " domestic amusement ", " Hong Kong, Macao and Taiwan amusement ", " Japan and Korea S's amusement " and " American-European amusement " etc., the Artists of the seed speech of each sub-column for corresponding area is set, because the seed speech degree of overlapping between each sub-column is lower, therefore, each sub-column adopts step 203 to the mode of step 204 to recall news pages, gathers then to father's column " amusement ".
Wherein, all news pages under each sub-column all can be gathered and be distributed to upper level father column, also ordering in each sub-column can be gathered the column to the upper level father in preceding several news pages.Wherein, news pages can be according to the sequencing of similarity of the keyword and the column center vector of news pages in each sub-column, also can according to the weighted value of place news bunch and with the relevancy ranking of place news bunch, concrete ranking criteria can be provided with flexibly.Wherein the formation of news bunch will be described in embodiment two under the column.
Can limit for gathering the news pages total amount that is distributed to father's column, for example, the news total amount that father's column is set is N, and its sub-column quantity is m, the news pages quantity that each sub-column is distributed to father's column can be set so be no more than 2 * N/m.
The second way: father's column is realized recalling of father's column news pages by step 203 to the mode of step 204, then the news pages under father's column is distributed to the sub-column of next stage.Shown in Fig. 3 b, shaded nodes represents that father node adopts the similarity matching way of the center vector that becomes based on the seed morphology to recall news pages among Fig. 3 b, arrow points be news pages distributor to.It is less that this mode is generally suitable for the difference of each sub-column, seed speech degree of the overlapping condition with higher between the column.For example, father node is " electronic product ", sub-column is " new product " and " product shopping guide ", because the diversity factor between " new product " and " product shopping guide " is smaller, the seed speech degree of overlapping between the column is higher, for example, may all there be seed speech such as " trendy ", " electronics ", therefore, can adopt configuration seed speech, the mode of distributing again on father's column to the sub-column of next stage.
The sub-column of next stage also can be according to the similarity matching way of step 203 to the center vector that becomes based on the seed morphology shown in the step 204, recall the part news pages in the news pages of uncle's column distribution, at this moment, also can adopt other matching ways, for example mate according to the URL type of website source, author, picture or video requirement or news pages.
If the news pages that father's column issues does not belong to any one existing sub-column, can be distributed to an independently sub-column, suppose existing sub-column m, so finally form sub-column m+1 altogether, if it is N that father node is distributed the news pages of getting off, can limiting the news pages quantity that enters each sub-column so, to be no more than 2 * N/ (m+1) individual.
The third mode: father's column and parton column are realized recalling of each sub-column news pages by step 203 to the mode of step 204, obtain the news pages with this sub-column coupling in remaining a part of sub-column uncle column.Shown in Fig. 3 c, shaded nodes represents to be provided with the column of seed speech among Fig. 3 c, arrow points be news pages distributor to.It is little that this mode is generally suitable under father's column certain a little column discrimination, and the relatively large situation of another a little column discrimination.For example, father's column is " society ", sub-column is " society and a method " and " social everything ", because sub-column " society and method " has higher discrimination, and the discrimination of " social everything " is less, therefore, and can be to father's column " society " and sub-column " society and method " configuration seed speech, recall news pages according to mode, and sub-column " social everything " uncle's column obtains the part news pages based on center vector.Need to prove, owing to may comprise the column of multilayer level in the column framework, can adopt more than one the mode in the above-mentioned news obtain manner to mix use, even can in a column framework, mix use with the existing mode of recalling.Give one example at this, as shown in Figure 4, arrow points is the direction of news distribution, and frame of broken lines is for hiding column (hiding column will relate to) in subsequent descriptions, and solid box is non-hiding column (being common column).In this example, first order column 2,3,4 and 5 and second level column a, b and e on all disposed the seed speech, adopt mode to be distributed news pages based on center vector.It is column 1 that column a and column b converge to its upper level father column with the news pages that is distributed, corresponding above-mentioned first kind of mode; Column 2 further is distributed to the sub-column of its next stage with the news pages that is distributed, i.e. column c and column d, the corresponding above-mentioned second way; The part news pages that column 3 will be distributed is distributed to the sub-column of other next stage except column e, i.e. column f and column g, corresponding above-mentioned the third mode.
Because may there be incomplete factor in column when being provided with, " deep bid " column for example is set, what this column needed is the deep bid information of domestic stock market, but owing to do not carry out the differentiation of Hong Kong stock, stock in America etc., therefore can introduce the noise of some Hong Kong stocks and the stock in America related news page, can be provided with the hiding column of Hong Kong stock and stock in America this moment, and this hiding column is not showed, thereby filtered out the related news pages such as Hong Kong stock and stock in America.Again for example, can hiding columns such as yellow or reaction be set under the column framework, from the news that grabs, recall news pages such as yellow or reaction and hide and to show.Hiding column also adopts step 203 to the mode of the described center vector based on the seed speech of step 204 to carry out recalling of news pages.Similarly, also can from the news pages of having recalled, extract keyword at hiding column and expand the seed speech, thereby reach than the better filter effect of the reverse speech of configuration.
Step 206: extract keyword the news pages under column,, when treating the news pages that next round grabs distributed, can adopt new center vector with the keyword that extracts center vector in conjunction with the seed morphology Cheng Xin of this column.
When extracting keyword, can from news pages, extract keyword according to word frequency, meaning of a word weight or part of speech weight etc., the extracting mode of concrete keyword is a prior art, no longer specifically describes at this.
By above-mentioned flow process as can be seen, at each the column node in the column framework, can concrete configuration: the distribution matching strategy of this column, the node structure of this column (being the information of upper level father node and next stage child node) and show attribute (whether being to hide column) etc.
So far, flow process shown in the embodiment one finishes.A large amount of news pages that has been called back under each column can't all be illustrated in it under column, and this just need select focus and shows, below by embodiment two this process is specifically described.
Embodiment two,
The process flow diagram of the formation news that Fig. 5 provides for the embodiment of the invention two bunch, as shown in Figure 4, carry out following steps at the news pages under each column:
Step 501: the news pages under the column is carried out cluster, form more than one news bunch.
Owing to recalled a large amount of news pages under each column, and the classification granularity that with the column is news pages is excessive, therefore, the mode of the news pages under each column by cluster can be divided into a plurality of news bunch, the news pages in the identical news bunch has higher similarity.
Can adopt in the embodiment of the invention but be not limited to hierarchical clustering mode, cohesion cluster mode, divide formula cluster mode, based on the cluster mode of density or grid cluster mode etc.Particularly, if present embodiment adopts the hierarchical clustering mode, then can be provided with the cluster termination condition for less than default similarity threshold or news number of clusters amount less than preset threshold value.
If directly the news pages under each column is carried out cluster, may bring relatively poor cluster effect: because the news pages under the same column all is and the very high document of same center vector similarity, may cause a large amount of news is a class by gathering, and remaining news becomes many groups again.Therefore, when the news pages under the column is carried out cluster, can at first reduce the weights of column center vector in cluster calculation, can give prominence to the content of each news outside center vector like this, and carry out polymerization according to the difference of these contents.
More preferably, can at first screen before the execution in step 501, for example only keep preceding M news pages with the center vector similarity maximum of column, wherein the positive integer of M for presetting to the news pages under each column.
Step 502: choose strategy according to default top news, in news bunch, choose the expression of top news as this news bunch.
The top news of news bunch is chosen strategy and can be provided with flexibly, can include but not limited to a kind of or its combination in any in the following strategy: choose the news pages of news briefing time in setting range, choose title and satisfy and set the news pages that requires, choose with the news pages of news bunch center vector similarity in setting range, choose the news pages that the news quality satisfies preset requirement.For example, can choose that issuing time is new, title is long and with the higher news pages of the center vector similarity of news bunch as top news.Wherein, the news quality can depend on: a kind of or combination in any in the flow of website weight, news pages, the response speed of news pages, the clutter etc.Need to prove,, then can adopt the text quality's form that adapts with concrete text for other texts owing to be example with this text of news pages in the present embodiment.
Lift an example of in certain news bunch, choosing top news: obtain in this news bunch the highest preceding 3 news pages of center vector similarity with news bunch, therefrom choose the readable good conduct top news of a title then; If readability is all bad, then choose following 3 news pages with the center vector similarity of news bunch, therefrom choose the readable good conduct top news of a title, and the like good until selecting a readability.
Step 503: according to the weight of each news pages under the property calculation column of news pages, utilize the weight of each news pages in the news bunch to determine the weight of news bunch, each news under the column bunch is sorted according to the weight of news bunch.
The attribute of the news pages of mentioning in this step can include but not limited to a kind of or combination in any in the following attribute: news briefing time, news quality, reprinting rate.At example that calculates the weight of news pages of this measure, for example can adopt formula (1) to calculate the weights W of news pages Page:
W page = α Δ t + α × δ ( site ) × φ ( segcount ) - - - ( 1 )
Wherein, α is default inverse ratio factor die-away time, Δ tBe the news briefing current mistiming of time interval, δ (site) is the computing function of news quality factor, and φ (segcount) is the computing function of the reprinting rate factor.
When determining the weight of news bunch, can adopt multiple mode, for example directly with the weight of each news pages in the news bunch with as the weight of news bunch, perhaps, with the weight average of each news pages in the news bunch as the weight of news bunch etc.
Step 504: choose strategy according to default focus, choose focus the news pages under column and under this column, show.
Focus is chosen strategy and can be provided with flexibly, for example: can from each news bunch, choose the focus of several news pages respectively as this column, perhaps, ordering situation according to each news bunch, choose the focus of K2 news pages respectively K1 news bunch before come as this column, wherein K1 and K2 are positive integer, or the like, no longer exhaustive at this.
Step 502, step 503 and step 504 do not have fixing sequencing, and this flow process only is a kind of embodiment wherein.
Need to prove in each column, whether show focus, and whether each news bunch shows that top news all is configurable.That is to say, can be in the display properties of column concrete configuration: the content of text of demonstration and concrete mode.
So far flow process shown in the embodiment two finishes.
More than be the description that method provided by the present invention is carried out, below device provided by the present invention be described in detail.As shown in Figure 6, this device can comprise: text acquiring unit 601, first Dispatching Unit 602 and second Dispatching Unit 603.
Text acquiring unit 601 is used for each text that grasps is delivered to first Dispatching Unit 602 as text to be distributed respectively.
First Dispatching Unit 602 is used for the keyword of current text to be distributed and the center vector of each column are carried out the similarity coupling, according to matching result, current text to be distributed is distributed under the column that satisfies the distribution matching strategy; Wherein, the center vector of column generates based on the seed speech that is provided with for this column in advance.
Second Dispatching Unit 603, be used to treat that first Dispatching Unit 602 is finished distribution to all texts to be distributed after, according to the hierarchical relationship between each column, will set all or part of upper level father column or the sub-column of next stage of being distributed to of text under the column.
Wherein, the distribution matching strategy of above-mentioned column comprises at least: the similarity between the keyword of text to be distributed and the center vector of column surpasses the similarity threshold that is provided with at this column; Perhaps, the result that similarity between the keyword of text to be distributed and the center vector of column deducts after the similarity between the opposite vector of the keyword of text to be distributed and same column surpasses the similarity threshold that is provided with at this column, and wherein the opposite vector of column is based on the reverse speech generation that is provided with for this column in advance.
In addition, the distribution matching strategy can further include but is not limited to a kind of or combination in any in the following strategy: the similarity between the keyword of text to be distributed and the center vector of column is the highest, perhaps, the website source of text to be distributed meets the website requirement of column, perhaps, the author of text to be distributed meets author's requirement of column, perhaps, text to be distributed meets the requirement of column for picture or video, perhaps, the title regular expression of text to be distributed meets the requirement of column for the title regular expression, and perhaps, the URL type of text to be distributed meets the URL type requirement of column.
Particularly, if the column of first Dispatching Unit, 602 distributions is sub-column, this moment, second Dispatching Unit 603 can gather the column to the upper level father at the individual text of preceding N1 with all texts under each sub-column of first Dispatching Unit, 602 distributions or ordering, and wherein N1 is default positive integer.
If the column of first Dispatching Unit, 602 distributions is father's column, this moment, second Dispatching Unit 603 can be distributed to the sub-column of next stage with all texts under each sub-column of first Dispatching Unit, 602 distributions.
If the column of first Dispatching Unit, 602 distributions comprises father's column and sub-column, this moment, second Dispatching Unit 603 can be distributed to the part text under father's column of first Dispatching Unit, 602 distributions the sub-column of next stage of not distributed text.
Column involved in the present invention can comprise: have the common column of text exhibition attribute and have the not hiding column of text exhibition attribute.Wherein, hiding column can be used to realize the filtering function to text.
This device can also comprise: keyword extracting unit 604, be used for extracting the keyword that is distributed text down from the column that is provided with the seed speech, with the keyword that extracts in conjunction with the seed speech of this column to form the new center vector of this column and to offer first Dispatching Unit 602.By the renewal of 604 pairs of column center vectors of this keyword extracting unit, can improve the accuracy rate that column is distributed text so that the center vector that upgrades is described the content guiding of this column more exactly.
Further, this device can also comprise: text cluster unit 605 and top news are chosen unit 606.
Text cluster unit 605 is used for the Distribution Results according to first Dispatching Unit 602 and second Dispatching Unit 603, and the text under the column is carried out cluster, form each column next one above bunch.
Top news is chosen unit 606, is used for choosing strategy according to default top news, chooses the expression of top text as each bunch respectively in each bunch that text cluster unit 605 forms.
More preferably, this device can also comprise: bunch sequencing unit 607 or focus are chosen a kind of or whole (among the Fig. 6 be example to comprise two unit simultaneously) in the unit 608.
Bunch sequencing unit 607, be used for text cluster unit 605 form under each column bunch after, calculate the weight of each text under the column according to text attribute, the weight that the weight of each text is determined bunch in utilizing bunch, according to bunch weight each bunch under the column sorted.
Focus is chosen unit 608, is used for the Distribution Results according to first Dispatching Unit 602 and second Dispatching Unit 603, according to default focus text selection strategy, chooses the focus text the text under each column respectively and shows under each column.
Wherein, above-mentioned top news is chosen strategy and can be comprised a kind of or combination in any in the following strategy: choose the text of text issuing time in setting range, choose title and satisfy and set the text that requires, choose and the text of bunch center vector similarity in setting range, choose the text that text quality satisfies preset requirement.
More preferably, the weights W of each text PageCan adopt following computing formula:
Figure BSA00000350904700161
Wherein, α is default inverse ratio factor die-away time, Δ tBe the current mistiming of text issuing time distance, δ (site) is the computing function of text quality's factor, and φ (segcount) is the computing function of the reprinting rate factor.
As can be seen from the above technical solutions, method and apparatus provided by the invention can possess following advantage:
1) the present invention's employing distributes for the text of column and binding layer inter-stage based on the center vector distribution text that column seed speech generates, and the duration that text is distributed is controlled at a second level, has improved the efficient of text distribution greatly.In addition, adopt method and apparatus of the present invention can avoid complicated training sample to set up process, in case and the column framework changes, only need set suitable seed speech and the text distribution rules between level at the column that increases, the text distribution rules of revising between level at the column of deletion gets final product, obviously need to redefine the mode of training sample in the prior art of comparing, can reduce the cost of text distribution, increase and decrease column more neatly.
2) can modes such as hiding column perhaps be set realize text filtering by reverse speech is set in column in the present invention, improve the accuracy rate that text represents in column.
3) in the present invention, the text that can distribute from column extracts keyword, and the keyword of utilization extraction is in conjunction with the center vector of the seed morphology Cheng Xin of this column, make the center vector description bar purpose content guiding more accurately of column, thereby improve accuracy rate and the recall rate that column is distributed text.
4) the invention provides multiple distribution matching strategy, can be according to demand control group purpose text recall rate neatly.
The above only is preferred embodiment of the present invention, and is in order to restriction the present invention, within the spirit and principles in the present invention not all, any modification of being made, is equal to replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims (18)

1. the method that text is distributed is applied to comprise the column framework of two-stage column at least, it is characterized in that this method comprises:
A, carry out following distributing step respectively at each text that grasps:
Distributing step: the keyword of current text to be distributed and the center vector of each column are carried out the similarity coupling,, current text to be distributed is distributed under the column that satisfies the distribution matching strategy according to matching result; Wherein, the center vector of described column generates based on the seed speech that is provided with for this column in advance;
B, according to the hierarchical relationship between each column, will set all or part of upper level father column or the sub-column of next stage of being distributed to of text under the column.
2. method according to claim 1 is characterized in that, the described distribution matching strategy of column comprises at least: the similarity between the keyword of described text to be distributed and the center vector of column surpasses the similarity threshold that is provided with at this column; Perhaps,
The result that similarity between the keyword of described text to be distributed and the center vector of column deducts after the similarity between the opposite vector of the keyword of described text to be distributed and same column surpasses the similarity threshold that is provided with at this column, and the opposite vector of wherein said column is based on the reverse speech generation that is provided with for this column in advance.
3. method according to claim 1 is characterized in that, described step B specifically comprises a kind of or combination in any in the following mode:
The column that is distributed text according to the mode of described steps A is sub-column, to be distributed all texts under each sub-column of text or ordering according to the mode of described steps A and gather the column to the upper level father at preceding N1 text, wherein N1 is default positive integer; Perhaps,
The column that is distributed text according to the mode of described steps A is father's column, and all texts that will be distributed according to the mode of described steps A under father's column of text are distributed to the sub-column of next stage; Perhaps,
The column that is distributed text according to the mode of described steps A comprises father's column and sub-column, will be distributed the sub-column of next stage that part text under father's column of text is distributed to not distributed text according to the mode of described steps A.
4. method according to claim 1 is characterized in that, described column comprises: have the common column of text exhibition attribute and have the not hiding column of text exhibition attribute.
5. according to the described method of the arbitrary claim of claim 1 to 4, it is characterized in that, this method further comprises: extract the keyword that is distributed text down from the column that is provided with the seed speech, with the keyword that extracts in conjunction with the seed speech of this column to form the new center vector of this column.
6. according to the described method of the arbitrary claim of claim 1 to 4, it is characterized in that, after described step B, carry out following steps respectively at each column:
C1, the text under the column is carried out cluster, form this column next one above bunch;
C2, choose strategy, in each bunch, choose of the expression of top text respectively as each bunch according to default top news.
7. method according to claim 6 is characterized in that, also comprises behind described step C2:
Calculate the weight of each text under the column according to text attribute, the weight that the weight of each text is determined bunch in utilizing bunch, according to bunch weight each bunch under the column sorted; Perhaps,
According to default focus text selection strategy, choose the focus text the text under each column respectively and under each column, show.
8. method according to claim 6, it is characterized in that described top news is chosen strategy and comprised a kind of or combination in any in the following strategy: choose the text of text issuing time in setting range, choose title and satisfy and set the text that requires, choose and the text of bunch center vector similarity in setting range, choose the text that text quality satisfies preset requirement.
9. method according to claim 7 is characterized in that, the weights W of each text PageComputing formula be:
Figure FSA00000350904600021
Wherein, α is default inverse ratio factor die-away time, Δ tBe the current mistiming of text issuing time distance, δ (site) is the computing function of text quality's factor, and φ (segcount) is the computing function of the reprinting rate factor.
10. the device that text is distributed is applied to comprise the column framework of two-stage column at least, it is characterized in that this device comprises: text acquiring unit, first Dispatching Unit and second Dispatching Unit;
Described text acquiring unit is used for each text that grasps is delivered to described first Dispatching Unit as text to be distributed respectively;
Described first Dispatching Unit is used for the keyword of current text to be distributed and the center vector of each column are carried out the similarity coupling, according to matching result, current text to be distributed is distributed under the column that satisfies the distribution matching strategy; Wherein, the center vector of described column generates based on the seed speech that is provided with for this column in advance;
Described second Dispatching Unit, be used to treat that described first Dispatching Unit is finished distribution to all texts to be distributed after, according to the hierarchical relationship between each column, will set all or part of upper level father column or the sub-column of next stage of being distributed to of text under the column.
11. device according to claim 10 is characterized in that, the described distribution matching strategy of column comprises at least: the similarity between the keyword of described text to be distributed and the center vector of column surpasses the similarity threshold that is provided with at this column; Perhaps,
The result that similarity between the keyword of described text to be distributed and the center vector of column deducts after the similarity between the opposite vector of the keyword of described text to be distributed and same column surpasses the similarity threshold that is provided with at this column, and the opposite vector of wherein said column is based on the reverse speech generation that is provided with for this column in advance.
12. device according to claim 10, it is characterized in that, the column of described first Dispatching Unit distribution is sub-column, this moment, described second Dispatching Unit gathered all texts under each sub-column of described first Dispatching Unit distribution or ordering with the column to the upper level father at the individual text of preceding N1, and wherein N1 is default positive integer; Perhaps,
The column of described first Dispatching Unit distribution is father's column, and this moment, described second Dispatching Unit was distributed to the sub-column of next stage with all texts under each sub-column of described first Dispatching Unit distribution; Perhaps,
The column of described first Dispatching Unit distribution comprises father's column and sub-column, and this moment, described second Dispatching Unit was distributed to the part text under father's column of described first Dispatching Unit distribution in the sub-column of next stage of not distributed text.
13. device according to claim 10 is characterized in that, described column comprises: have the common column of text exhibition attribute and have the not hiding column of text exhibition attribute.
14. according to the described device of the arbitrary claim of claim 10 to 13, it is characterized in that, this device also comprises: keyword extracting unit, be used for extracting the keyword that is distributed text down from the column that is provided with the seed speech, with the keyword that extracts in conjunction with the seed speech of this column to form the new center vector of this column and to offer described first Dispatching Unit.
15. according to the described device of the arbitrary claim of claim 10 to 13, it is characterized in that this device also comprises: text cluster unit and top news are chosen the unit;
Described text cluster unit is used for the Distribution Results according to described first Dispatching Unit and described second Dispatching Unit, and the text under the column is carried out cluster, form each column next one above bunch;
Described top news is chosen the unit, is used for choosing strategy according to default top news, chooses the expression of top text as each bunch in each bunch respectively.
16. device according to claim 15 is characterized in that, this device also comprises: bunch sequencing unit or focus are chosen a kind of or whole in the unit;
Described bunch of sequencing unit is used for calculating according to text attribute the weight of each text under the column, the weight that the weight of each text is determined bunch in utilizing bunch, according to bunch weight each bunch under the column sorted;
Described focus is chosen the unit, is used for the Distribution Results according to described first Dispatching Unit and the described second point-score unit, according to default focus text selection strategy, chooses the focus text the text under each column respectively and shows under each column.
17. device according to claim 15, it is characterized in that described top news is chosen strategy and comprised a kind of or combination in any in the following strategy: choose the text of text issuing time in setting range, choose title and satisfy and set the text that requires, choose and the text of bunch center vector similarity in setting range, choose the text that text quality satisfies preset requirement.
18. device according to claim 16 is characterized in that, the weights W of each text PageComputing formula be:
Wherein, α is default inverse ratio factor die-away time, Δ tBe the current mistiming of text issuing time distance, δ (site) is the computing function of text quality's factor, and φ (segcount) is the computing function of the reprinting rate factor.
CN201010549183A 2010-11-17 2010-11-17 Method and device for distributing texts Active CN101984435B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010549183A CN101984435B (en) 2010-11-17 2010-11-17 Method and device for distributing texts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010549183A CN101984435B (en) 2010-11-17 2010-11-17 Method and device for distributing texts

Publications (2)

Publication Number Publication Date
CN101984435A true CN101984435A (en) 2011-03-09
CN101984435B CN101984435B (en) 2012-10-10

Family

ID=43641604

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010549183A Active CN101984435B (en) 2010-11-17 2010-11-17 Method and device for distributing texts

Country Status (1)

Country Link
CN (1) CN101984435B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102629272A (en) * 2012-03-14 2012-08-08 北京邮电大学 Clustering based optimization method for examination system database
CN102760156A (en) * 2012-06-05 2012-10-31 百度在线网络技术(北京)有限公司 Method, device and equipment used for generating release information corresponding to key words
CN103324628A (en) * 2012-03-21 2013-09-25 腾讯科技(深圳)有限公司 Industry classification method and system for text publishing
CN106407210A (en) * 2015-07-29 2017-02-15 阿里巴巴集团控股有限公司 Display method and device of business object
CN106776652A (en) * 2015-11-24 2017-05-31 北京国双科技有限公司 Data processing method and device
CN108809919A (en) * 2017-05-04 2018-11-13 北京大学 Secret communication method and device for text carrier
CN109522414A (en) * 2018-11-26 2019-03-26 吉林大学 A kind of document delivery object selection system
CN109992583A (en) * 2019-03-15 2019-07-09 上海益普索信息技术有限公司 A kind of management platform and method based on DMP label
CN112800083A (en) * 2021-02-24 2021-05-14 山东省建设发展研究院 Government decision-oriented government affair big data analysis method and equipment
CN113111216A (en) * 2020-01-13 2021-07-13 百度在线网络技术(北京)有限公司 Advertisement recommendation method, device, equipment and storage medium
CN114462415A (en) * 2020-11-10 2022-05-10 国际商业机器公司 Context-aware machine language identification

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002025479A1 (en) * 2000-09-25 2002-03-28 Telstra New Wave Pty Ltd A document categorisation system
CN1536483A (en) * 2003-04-04 2004-10-13 陈文中 Method for extracting and processing network information and its system
CN101609450A (en) * 2009-04-10 2009-12-23 南京邮电大学 Web page classification method based on training set
CN101727463A (en) * 2008-10-24 2010-06-09 中国科学院计算技术研究所 Text training method and text classifying method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002025479A1 (en) * 2000-09-25 2002-03-28 Telstra New Wave Pty Ltd A document categorisation system
CN1536483A (en) * 2003-04-04 2004-10-13 陈文中 Method for extracting and processing network information and its system
CN101727463A (en) * 2008-10-24 2010-06-09 中国科学院计算技术研究所 Text training method and text classifying method
CN101609450A (en) * 2009-04-10 2009-12-23 南京邮电大学 Web page classification method based on training set

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102629272A (en) * 2012-03-14 2012-08-08 北京邮电大学 Clustering based optimization method for examination system database
CN103324628A (en) * 2012-03-21 2013-09-25 腾讯科技(深圳)有限公司 Industry classification method and system for text publishing
CN103324628B (en) * 2012-03-21 2016-06-08 腾讯科技(深圳)有限公司 A kind of trade classification method and system for issuing text
CN102760156A (en) * 2012-06-05 2012-10-31 百度在线网络技术(北京)有限公司 Method, device and equipment used for generating release information corresponding to key words
CN106407210A (en) * 2015-07-29 2017-02-15 阿里巴巴集团控股有限公司 Display method and device of business object
CN106776652A (en) * 2015-11-24 2017-05-31 北京国双科技有限公司 Data processing method and device
CN108809919A (en) * 2017-05-04 2018-11-13 北京大学 Secret communication method and device for text carrier
CN109522414A (en) * 2018-11-26 2019-03-26 吉林大学 A kind of document delivery object selection system
CN109522414B (en) * 2018-11-26 2021-06-04 吉林大学 Document delivery object selection system
CN109992583A (en) * 2019-03-15 2019-07-09 上海益普索信息技术有限公司 A kind of management platform and method based on DMP label
CN113111216A (en) * 2020-01-13 2021-07-13 百度在线网络技术(北京)有限公司 Advertisement recommendation method, device, equipment and storage medium
CN113111216B (en) * 2020-01-13 2023-11-03 百度在线网络技术(北京)有限公司 Advertisement recommendation method, device, equipment and storage medium
CN114462415A (en) * 2020-11-10 2022-05-10 国际商业机器公司 Context-aware machine language identification
US11907678B2 (en) 2020-11-10 2024-02-20 International Business Machines Corporation Context-aware machine language identification
CN112800083A (en) * 2021-02-24 2021-05-14 山东省建设发展研究院 Government decision-oriented government affair big data analysis method and equipment

Also Published As

Publication number Publication date
CN101984435B (en) 2012-10-10

Similar Documents

Publication Publication Date Title
CN101984435B (en) Method and device for distributing texts
Surwase et al. Co-citation analysis: An overview
CN100557612C (en) A kind of search result ordering method and device based on search engine
CN102841946B (en) Commodity data retrieval ordering and Method of Commodity Recommendation and system
CN103136337B (en) For distributed knowledge data mining device and the method for digging of complex network
CN103020066B (en) A kind of method and apparatus identifying search need
JP2013529805A (en) Search method, storage medium, identification method, advertising method, processing method and system
CN104077377A (en) Method and device for finding network public opinion hotspots based on network article attributes
CN103309894B (en) Based on search implementation method and the system of user property
CN102004792A (en) Method and system for generating hot-searching word
CN102509233A (en) User online action information-based recommendation method
TW201224973A (en) Method and system of displaying cross-website information
CN104199822A (en) Method and system for identifying demand classification corresponding to searching
CN106354872A (en) Text clustering method and system
CN107180075A (en) The label automatic generation method of text classification integrated level clustering
CN106445994A (en) Mixed algorithm-based web page classification method and apparatus
CN111143690A (en) Expert recommendation method and system based on associated expert database
CN105069177B (en) A kind of selected topic optimization system and its method for Publishing Industry
CN106777193A (en) A kind of method for writing specific contribution automatically
CN107180078A (en) A kind of method for vertical search based on user profile learning
Xie et al. Dynamic assessment of environmental efficiency in Chinese industry: A multiple DEA model with a Gini criterion approach
CN106294358A (en) The search method of a kind of information and system
CN110473073A (en) The method and device that linear weighted function is recommended
CN104182539A (en) Abnormal information batch processing method and system
Piwowarski et al. The problem of non-typical objects in the multidimensional comparative analysis of the level of renewable energy development

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant