CN105528421B

CN105528421B - A kind of search dimension method for digging for query word in mass data

Info

Publication number: CN105528421B
Application number: CN201510890422.5A
Authority: CN
Inventors: 窦志成; 文继荣; 李谨秀
Original assignee: Renmin University of China
Current assignee: BEIJING YILANQUNZHI DATA TECHNOLOGY Co.,Ltd.
Priority date: 2015-12-07
Filing date: 2015-12-07
Publication date: 2018-09-04
Anticipated expiration: 2035-12-07
Also published as: CN105528421A

Abstract

The invention discloses a kind of search dimension method for digging for query word in mass data, and this method comprises the following steps：1) it is based on text, html tag and repeat region isotype, Lists is extracted from each webpage in the data set grabbed；2) increase extracting mechanism, validity expansion is carried out to the Lists being drawn into step 1) to realize；3) importance for each List that assessment extracts；4) lexical item list clusters：Similar lexical item list is merged to form an inquiry dimension；5) sequence of dimension and lexical item list is inquired：Calculate different inquiry facets, the importance of lexical item.The present invention can obtain more effective lexical item lists, after the lexical item list after being supplemented, it gives a mark to new lexical item list, similar lexical item list is merged into classification, calculate different inquiry facets, the importance of lexical item list, the final inquiry dimension for excavate is more perfect so that user can obtain more complete information.

Description

A kind of search dimension method for digging for query word in mass data

Technical field

The present invention relates to a kind of search dimension method for digging for query word in mass data.

Background technology

Currently, in research work in our prior, for the search dimension method for digging master of query word in mass data There is following four step：(1) lexical item list is extracted according to text, html tag, repeat region isotype on webpage (List)；(2) it gives a mark to lexical item list, assesses the importance of lexical item list；(3) similar lexical item list is merged Form an inquiry dimension；(4) different inquiry facets, the importance of lexical item list are calculated；Said program is primarily present as follows Problem：There is no the webpage of repeat region and html tag to have very much (news data, microblogging blog articles etc.), existing method is for this A little data are simultaneously not suitable for, especially news data, and the lexical item list being drawn into can be seldom, or take out less than.

Therefore, the technical issues of how solving the above problems as those skilled in the art's urgent need to resolve.

Invention content

The problem of for background technology, the purpose of the present invention is to provide one kind for query word in mass data Search dimension method for digging, this method can obtain more effective lexical item lists, the lexical item list after being supplemented it Afterwards, it gives a mark to new lexical item list, similar lexical item list is merged into classification, calculates different inquiry facets, word The importance of item list, it is final so that the inquiry dimension excavated is more perfect so that user can obtain more complete letter Breath.

The purpose of the present invention is achieved through the following technical solutions：

A kind of search dimension method for digging for query word in mass data, described method includes following steps：

1) lexical item list is extracted：Based on text, html tag or repeat region pattern, from every in the data set grabbed Lists is extracted in one webpage；

2) increase extracting mechanism, validity expansion is carried out to the Lists being drawn into step 1) to realize；

3) lexical item list is given a mark：Assess the importance of each List extracted；

4) lexical item list clusters：Similar lexical item list is merged to form an inquiry dimension；

5) sequence of dimension and lexical item list is inquired：Calculate different inquiry facets, the importance of lexical item.

Further, the step 2) is specially：

(1) for each news search word, relevant news data K items are crawled in a search engine as data set；

(2) text therein is extracted to each document crawled；

(3) data of each document are handled, is extracted in a word, the same paragraph or the same chapters and sections Name is extracted to extract as a List, place name and be extracted as a List as a List, mechanism name；

(4) List extracted in step (3) is filtered.

Further, the extraction in the step (3) for the name, place name, mechanism name of Chinese, uses tool first Nlpir Chinese word segmentation systems segment Chinese text, and name, place name and mechanism name can be obtained after participle；For English The extraction of the name, place name, mechanism name of text identifies name, place name, mechanism using the name Entity recognition device of Stanford University Name.

Further, the step (4) is specially：

A) each webpage of the lexical item in wikipedia in the List of step (3) extraction is crawled, and is obtained every in the List " classification " property set of a lexical item；

B) " classification " property set of each lexical item in List is sought into union, obtains a big categorical attribute collection C；

C) each classification in C is traversed, for each classification, the lexical item comprising the classification in the List is put together, such as Lexical item in the fruit classification is more than three, then forms a new List, and the List by lexical item less than three gives up；

D) step c) cycles can obtain a series of Lists after terminating, and each List is classified according to one What attribute obtained；

E) it for List new each of Lists, is scored the List of extraction using idf information；

F) select a highest List of scoring as final List.

Further, the idf calculation formula in the step e) are：Idf=(N-n+0.5)/(n+0.5)；Wherein, wherein N The item numbers in total for including in wikipedia, n indicate List institutes according to categorical attribute include in wikipedia Entry sum.

Further, it is to the List of the extraction calculation formula to score using idf information in the step e)：Score =length*idf, wherein length indicate the length of List.

Further, the step 2) is specially：Entity word in same a word, same paragraph or same piece news is extracted Out it is used as a List；Then processing is filtered using wikipedia to the List being drawn into.

The present invention has following positive technique effect：

The method of the present invention can obtain more effective lexical item lists, right after the lexical item list after being supplemented New lexical item list is given a mark, and similar lexical item list is merged classification, calculates different inquiry facets, lexical item list Importance, it is final so that the inquiry dimension excavated is more perfect so that user can obtain more complete information.

Description of the drawings

Fig. 1 is the news data example used in the embodiment of the present invention；

Fig. 2 a are categorical attribute information of " Beijing " lexical item in wikipedia；

Fig. 2 b are categorical attribute information of " Shanghai " lexical item in wikipedia；

Fig. 2 c are categorical attribute information of " China " lexical item in wikipedia；

Fig. 3 is search term " Cheng Long " categorical attribute information in wikipedia.

Specific implementation mode

The application is further described below in conjunction with the accompanying drawings.

With the fast development of internet, the information content of internet is increasing, and user plane uses omnifarious information Family is difficult often to be quickly obtained desired information.Desired information is quickly obtained in order to facilitate user, we are to largely examining Rope information is handled, and is classified according to the inquiry dimension of information, then be presented to the user, and inquiry dimension is for describing one A series of words of some important aspect of query word, a series of this word is one group of semantic relevant lexical item arranged side by side, at this It is referred to as lexical item list (List) in invention.Such as wrist-watch, it can be by the bulk information retrieved according to brand, feature, performance, The inquiry such as model dimension is classified, and a TV play " Lost " can be according to the collection of drama in each season, performer, the angle in play Color, the dimensions such as plot are classified, query word " flower ", then can have colored use, type, the dimensions such as color to classify, table First, the example of the inquiry dimension of some query words.If can will divide according to dimension with the relevant information of query word on internet Class, then user very easily can be quickly found corresponding information according to the dimension of query word on the internet.And herein Work be exactly to excavate the inquiry dimension of query word.

During the information retrieved is classified according to dimension, the query word being presently mainly directed on network obtains To inquiry dimension, there is following four processing procedure (1), according to text, html tag, repeat region isotype, to be extracted on webpage Lexical item list (List)；(2) it gives a mark to lexical item list, assesses the importance of lexical item list；(3) by similar lexical item list It merges to form an inquiry dimension；(4) different inquiry facets, the importance of lexical item list are calculated.It is extracted in the first step During lexical item list, original method is extracted in web data according to text, html tag, repeat region isotype List's, however the webpage of no repeat region and html tag has very much (news data, microblogging blog articles etc.), side originally Method is for these data and is not suitable for, especially news data.Herein by taking news data as an example, it is largely in news data Plain text information, abstracting method originally is difficult to be drawn into suitable lexical item list, and more targetedly examine herein herein The feature for considering news data is improved on the basis of original method for extracting lexical item list, is increased for news data Some extracting mechanisms effectively expand the former methodical lexical item list being drawn into.

Present invention primarily contemplates the features of news data, have mainly done the improvement of following three aspects：(1) name, Name, mechanism name：The noun of personage, place etc frequently occur in news data, and this class noun is very heavy in news data It wants, and is likely to related with the name that in short, in the same paragraph or same piece news occurs, place name, mechanism name, it can be with Original Lists is expanded as lexical item list (Lists)；(2) wikipedia is filtered：For the people in problem (1) Name, place name, mechanism name are filtered processing using wikipedia, and the description inquiry dimension in the same paragraph is more suitable Lexical item as new List, inappropriate word is deleted from List；(3)entity linking：Consider news data In, entity word (entity word refers here to the lexical item that can be searched out in wikipedia) meaning in the same paragraph very may be used It can be related, it is likely that can be used for describing the same inquiry dimension, consider using the entity word in the same paragraph as one Then List utilizes the new Lists obtained after wikipedia filtration treatments.The present invention is mainly by considering three above aspect The problem of, it once tests, is drawn into after new Lists, the Lists newly obtained is beaten with original scoring method Point, then similar Lists is merged together and to form an inquiry dimension, finally calculates different inquiry facets, lexical item again Importance.

In news data, the sentence of structuring and containing the seldom of repeat region pattern, if according to structuring If sentence extracts, it can only be drawn into seldom or extract less than thing, for example, according to the data in Fig. 1, according to original extraction Mode is just extracted less than List.It is contemplated that in news data, personage, place are information critically important in news, and It frequently occurs, the name in news data is extracted and extracted as one as a List, place name by the present embodiment List, mechanism name are extracted as a List, are expanded former methodical extraction lexical item list.

Present invention primarily contemplates following three kinds of schemes：

Name in same a word is extracted and is extracted as one as a List, place name by scheme one List, mechanism name are extracted as a List.

Name in same paragraph is extracted and is extracted as one as a List, place name by scheme two List, mechanism name are extracted as a List.

Name in same piece news is extracted and is extracted as one as a List, place name by scheme three List, mechanism name are extracted as a List.

The present embodiment mainly introduces the processing method of scheme two, similar with scheme two for scheme one and scheme three.

For scheme two, appears in the information such as the name in the same paragraph, place name, mechanism name and be likely to have prodigious pass Connection.By taking figure one as an example, in first segment, " outer dragon, Zheng Yourong, rice Zorovic " while same paragraph is appeared in, " horse in second segment Ding Neisi, Gao Lin " while same paragraph is appeared in, they are football players, they are some semantic relevant words arranged side by side , it is well suited for being put into inquiry dimension, so we can out regard these very relevant information extractions as List.The present invention In, it is contemplated that putting the name of same paragraph, place name, mechanism name together as a List respectively, table one is to be added to extract The Lists being drawn into according to this section of word after name, place name, mechanism name, but only List length is more than that 3 can just retain, So the List being finally drawn into is the first two.

Specific abstracting method is as follows：

(1) for each news search word, relevant news data K items are crawled in a search engine as data set.

(2) text therein is extracted to each document crawled.

(3) each paragraph in each document is handled, extract name in each paragraph as a List, Place name, which is extracted, to be extracted as a List, mechanism name as a List.

Extraction for the name, place name, mechanism name of Chinese, we use existing tool nlpir Chinese word segmentation systems Chinese text is segmented, name, place name and the mechanism name of same paragraph can be readily available after participle.

For English, the name Entity recognition device of Stanford University can be used to identify name, place name, mechanism name.

If directly expanded lexical item list with the List obtained, there is some shortcomings, so needing by dimension Lists processing of the base encyclopaedia (wikipedia) to obtaining here.

It is somewhat coarse directly to extract the certain lists of List tables out by the above method, understands the lexical item in some List and less phase Close, be merged into the same inquiry dimension and improper, such as place name, if the same paragraph occur simultaneously " China, it is northern Capital, Shanghai, Tianjin, Chongqing ", it is evident that " Beijing, Shanghai, Tianjin, Chongqing " is four municipalities directly under the Central Government, and " China " is then one Include a country in the cities Zhe Sige, countries and cities are put into improper inside a List, they are not a ranks, such as Fruit filters out " China " from the list, this List seems more suitably go description inquiry dimension.To understand Certainly this problem, we are by the data in wikipedia, the lexical item obtained to us by extracting name, place name, mechanism name List is filtered.

Each lexical item in every List directly should if corresponding entry information can be can not find in wikipedia Lexical item is deleted from List, it may be possible to which noun extraction is wrong, and corresponding entry letter is found if can look in wikipedia Breath, we grab the entry information, then " classification " attribute in the entry information are utilized to be filtered, categorical attribute As shown in figs. 2 a-2 c.

If the classification information lap of entry is relatively more, illustrate that they are very close, such as " Beijing " and " Shanghai " The two nouns have " the provincial administrative area of the People's Republic of China " and " Chinese megalopolis " in categorical attribute, have in picture Two overlapped attributes, then the two nouns can appear in and inquire dimension described in the same list, if lap compares It is few, for example, " Beijing " and " China " two nouns do not have identical classification, then illustrate that they improper are described together an inquiry Dimension.In this application, we be exactly according to entry in wikipedia " classification " information to extract and arrive name, place name, Mechanism name is filtered, and here by taking the place name List in the same paragraph of extraction as an example, the lexical item of name and mechanism name arranges The filter method of table is also the same, and detailed process is as follows：

(1) each webpage of the lexical item in wikipedia in the List is crawled, and obtains each lexical item in the List " classification " property set.

(2) " classification " property set of each lexical item in List is sought into union, obtains a big categorical attribute collection C.

(3) each classification in C is traversed, for each classification, the lexical item comprising the classification in the List is put together, such as Lexical item in the fruit classification is more than three, then forms a new List, and List of the lexical item less than three gives up.

(4) third step cycle can obtain a series of Lists (0,1, or more) after terminating, and each List is obtained according to a categorical attribute.

(5) it for List new each of Lists, is scored the List of extraction using idf (formula 1) information, For standards of grading according to formula 2, wherein N is the item numbers in total for including in wikipedia, n indicate List institutes according to point The entry sum (this specific object for clicking classification can be obtained) that generic attribute includes in wikipedia, length is indicated The length of List is (it is intended that the List that selection is long ties up inquiry because the lexical item that long List includes is more Degree has better supplement),

(6) a highest List of scoring is selected.

Score=length*idf formula 2

The highest List of scoring of final choice is exactly to filter it to the List being drawn into according to the information in wikipedia The new List obtained afterwards.

Here illustrate to using idf to do, if the entry for including in some classification of a wikipedia is special It is more, then illustrate that the semanteme of the classification is very wide in range, the List obtained using the classification as benchmark is likely to uncorrelated, is not well suited for One inquiry dimension of description is not suitable for the such List of selection and supplements original Lists, such as search term " Cheng Long ", Categorical attribute in wikipedia as shown in figure 3, for affiliated classification " alive personage ", enter it can be seen that belonging to this minute by point The entry that class includes has 62,205.Therefore the lexical item that the List generated on the basis of " alive personage " includes is likely to not phase Close, it is intended that it is relatively low according to such obtained scoring of List of classifying, so we constrained using idf it is such List。

It can be obtained to deleting directly extraction name, place name, mechanism name using the filtered List of wikipedia Incoherent lexical item in Lists so that the lexical item in finally obtained List is more related side by side, and the List that can make is more Effectively supplemented.

It is obtained supplementing original method with the Lists that extraction name, place name, mechanism name are drawn into filtered same paragraph After the Lists obtained, it is contemplated that the entity in news data in the same paragraph there is a possibility that contact is also very big, very may be used Can be related, because the noun in news data is likely to the meaning for having special, if the relatively high entity of correlation can be had It adds in Lists to effect, the excavation to inquiring dimension has better expansion.Another method of the application is to use Wikipedia miner find out entity word as initial List, and then above-described filter method filtering, finally obtains new List is expanded.

The application mainly considers following three kinds of schemes：

Scheme one extracts the entity word in same a word as a List.

Scheme two extracts the entity word in same paragraph as a List.

Scheme three extracts the entity word in same piece news as a List.

For scheme two, we find out all entities in text in each paragraph using wikipedia miner (entity), the noun lexical item but in the same paragraph has very much, it is likely that some are incoherent, in order to what is ensured The correlation of List ensures that they are relatively suitble to one inquiry dimension of description, we utilize the List that each paragraph is drawn into Wikipedia is filtered processing, the List obtained after filtering is added in the Lists that original method obtains.For side Case one and scheme three, method are similar with scheme two.

In conclusion the application can obtain more effective lexical item lists, after the lexical item list after being supplemented, We give a mark to new lexical item list, and similar lexical item list is then merged classification, calculate different inquiry point Face, lexical item list importance, it is final so that the inquiry dimension excavated is more perfect so that user can obtain more complete Information.

It is described above simply to illustrate that of the invention, it is understood that the present invention is not limited to the above embodiments, meets The various variants of inventive concept are within protection scope of the present invention.

Claims

1. a kind of search dimension method for digging for query word in mass data, which is characterized in that the method includes as follows Step：

1）Lexical item list is extracted：Based on text, html tag and repeat region pattern, from each in the data set grabbed Lists is extracted in webpage；

2）Increase extracting mechanism, to realize to step 1）In the Lists that is drawn into carry out validity expansion；

（1）For each news search word, relevant news data K items are crawled in a search engine as data set；

（2）Text therein is extracted to each document crawled；

（3）The data of each document are handled, using same a word as the standard or same paragraph work of an extraction List Standard of the standard or same chapter for extracting List for one as an extraction List；Will in short, the same paragraph or Name in the same chapter is extracted to extract as a List, place name and be extracted as a List, mechanism name It is used as a List；

（4）To step（3）In the List that extracts be filtered；

3）Lexical item list is given a mark：Assess the importance of each List extracted；

4）Lexical item list clusters：Similar lexical item list is merged to form an inquiry dimension；

5）Inquire the sequence of dimension and lexical item list：Calculate different inquiry facets, the importance of lexical item.

2. the search dimension method for digging according to claim 1 for query word in mass data, which is characterized in that institute State step（3）In for Chinese name, place name, mechanism name extraction, first use tool nlpir Chinese word segmentation system centerings Text is segmented, and name, place name and mechanism name therein can be obtained after participle；For the name of English, place name, machine The extraction of structure name identifies name, place name, mechanism name using the name Entity recognition device of Stanford University.

3. the search dimension method for digging according to claim 1 for query word in mass data, which is characterized in that institute State step（4）Specially：

a）Crawl step（3）Each webpage of the lexical item in wikipedia in the List of extraction, and obtain each word in the List " classification " property set of item；

b）" classification " property set of each lexical item in List is sought into union, obtains a big categorical attribute collection C；

c）Each classification in C is traversed, for each classification, the lexical item comprising the classification in the List is put together, if should Lexical item in classification is more than three, then forms a new List, and the List by lexical item less than three gives up；

d）Step c）Cycle can obtain a series of Lists after terminating, and each List is according to a categorical attribute It obtains；

e）For the new List of each of Lists, scored the List of extraction using idf information；

f）Select a highest List of scoring as final List.

4. the search dimension method for digging according to claim 3 for query word in mass data, which is characterized in that institute State step e）In idf calculation formula be：idf=（N-n+0.5）/(n+0.5)；Wherein, to be in wikipedia include wherein N Item numbers in total, n indicate List institutes according to categorical attribute include in wikipedia entry sum.

5. the search dimension method for digging according to claim 3 for query word in mass data, which is characterized in that institute State step e）It is middle to be to the List of the extraction calculation formula to score using idf information：Score=length*idf, wherein Length indicates the length of List.

6. the search dimension method for digging according to claim 1 for query word in mass data, which is characterized in that institute State step 2）Specially：Entity word in same a word, same paragraph or same piece news is extracted as a List； Then processing is filtered using wikipedia to the List being drawn into.