CN107908712A

CN107908712A - Cross-language information matching process based on term extraction

Info

Publication number: CN107908712A
Application number: CN201711101619.1A
Authority: CN
Inventors: 刘刚; 胡昱临; 孙素艳
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2017-11-10
Filing date: 2017-11-10
Publication date: 2018-04-13

Abstract

The present invention is to provide a kind of cross-language information matching process based on term extraction.Chinese text is pre-processed using natural language processing technique, part-of-speech tagging is carried out for sentence；Word filtering is carried out to pre-processed results using the word-building rule of term, the border of word is determined by comentropy；The domain correlation degree of term is weighed using the IDF values of word in domain corpus, two groups of words are weighted processing, last set threshold value, accepts or rejects candidate terms according to term scoring event；On the basis of the field term of acquisition, centering english term aligns, and obtains term in the corresponding translation in this area.Finally utilize Chinese and English term alignment result structure retrieval type, establish contacting between Chinese and English, retrieval type is subjected to full-text search in English text, most matched English text is determined according to matching result, across language matching is realized using field term alignment result so as to reach.

Description

Cross-language information matching process based on term extraction

Technical field

The present invention relates to a kind of cross-language information matching process.

Background technology

1. term extraction

Term is the denotion that specific concept is directed in specific area.At present, the sight that scholars generally accept field term Putting is：Field term (DomainTerm) is a set, this set is to be used for representing specific concept in specific ambit Appellation.It is noun either S＆T capability that domestic researcher, which generally accepts field term, but " name mentioned here Word " is distinct from the noun in linguistics.

There are 1 between term and concept:1 pattern.Each field term may only represent only in other words One specific concept.Conversely, a specific concept can only also have a unique denotion field term.This also exactly illustrates Two key properties of field term, i.e. monosemy and single-character given name.Assuming that a term cannot meet monosemy and single-character given name, So it is easy for contrary opinion and polysemia occur in a subject or multiple subjects.

Automatic term extraction method is broadly divided into three kinds：Rule-based method, Statistics-Based Method and rule and Count the method being combined.Since rule-based and Statistics-Based Method respectively there are advantage and disadvantage, both can combine to take Long benefit is short, and nearest research largely combines statistics and philological method.

2. across language matching

In Chinese and English cross-language information retrieval system, user puts question to the related English of retrieval (or the Chinese by Chinese (or English) Language) document.There are four common matching process for cross-language information retrieval：Homologous matching (Homologous matching), Document translation (Document translation), intermediate language technology (Interlingua Technology) and query translation (Query translation)。

Homologous matching judges one of which language term according to macaronic word orthographic form or pronunciation similarity Meaning, without any translation.For example, French can be counted as the English word for having misspelling, directly to English text Part execution information is retrieved, and without mutually translation.

Document is translated and query translation contrast, and multilingual raw information set is first converted into identical with inquiry by it Language, then carry out single language information retrieval process.

The methods of potential applications index (Latent SemanticIndexing, LSI) and Generalized Vector Space Model is existing In the common method that need not be carried out translation and can complete cross-language information retrieval.Dive semantic indexing (Cross- across language Language Latent Semantic Indexing) it is exactly Chinese and English in semantic space and corpus according to different language Correspondence is so as to according to a kind of method of language retrieval another kind language.

Question-type input by user (original language) is exactly translated as the language (object language) of system support by query translation, Then the question-type of object language is submitted into matching module again, carries out single language information retrieval.

The content of the invention

Being carried based on term for accuracy rate can be improved under conditions of close efficiency it is an object of the invention to provide a kind of The cross-language information matching process taken.

The object of the present invention is achieved like this：

Step 1：Using in units of sentence as the basis of term extraction, by making pauses in reading unpunctuated ancient writings, segmenting and filtering deactivation to text The procedure extraction data acquisition system of word；

Step 2：By obtaining lexical set to be filtered after Chinese word segmentation；

Step 3：Load stop words text, from lexical set read in a vocabulary, by vocabulary in stop words text into Row is searched, if finding, is filtered out the character, is not otherwise filtered；

Step 4：Sentence carries out participle and stop words filtering for unit；

Step 5：Retain the noun if word.natures is noun, if word.nature is adjective, under judging The part of speech of one word, if next word is noun, retains the noun；If word.nature is verb, former and later two are judged Word, if noun, then retains the noun；

Step 6：Text collection is filtered by the method for step 3-5, the text set List after being filtered, returned Set List after rule-based filtering；

Step 7：Assuming that length L (s) ＞ 1 of character string s, and s is the left and right word border of some word, this character string s By as a complete word；

Step 8：If s is word w left word borders accordingly, w is write as w=sx, wherein x is arbitrary string；If s It is word w right word borders accordingly, is write w as w=ys, wherein y is arbitrary string；If s is the left and right word side of word w at the same time Boundary, then necessarily have：(1) s is identical with w；(2) one kind inside w=sxs both of these case, and x is arbitrary string；It is false , can be by this current character string as being one complete when can be as the right boundary of word such as some character string s Word.

Step 9：According to based on term rules filter as a result, obtaining candidate terms set；

Step 10：The left and right comentropy of word in candidate terms set is calculated, total letter is then calculated according to left and right comentropy Entropy is ceased, the comentropy of identical word is added, total information entropy is ranked up, and reservation meets left and right comentropy H (s) ＞ IE_min Word, wherein IE_minIt is a customized numerical value；

Step 11：The score of each word is ranked up, if the score of word is identical, successively according to comentropy Size and the size of anti-document frequency IDF are ranked up set；

Step 12：To the set after sequence, acquirement point is more than predetermined threshold, that is, Score ＞ Score_minWord retained, Wherein Score_minIt is the numerical value manually obtained；

Step 13：Machine translation is carried out to term Term and obtains Terminology Translation set Term_Translate；

Step 14：If Terminology Translation set Term_Translate is the translation carried out according to dictionary, word is added it to In allusion quotation translation set List_Entrem, the English set Map_Result.put after Chinese and English alignment is otherwise added；

Step 15：Cartesian product List_Descartes is asked to the set in dictionary translation set List_EnTerm；

Step 16：Each set in cartesian product List_Descartes set is traveled through, if it does, then summation； Then this word is inserted into and do not logged in the corresponding tables of data of word, searching, there is maximum and Sum to store English sequence accordingly The set List_Max of row；Return to English final result Map_Result；

Step 17：Establishment inverted index is carried out to english information text with English Search Engines Lucene；

Step 18：Sortord is carried according to English Search Engines Lucene to calculate text score, and according to Divide and return to ranking results；

Step 19：Search engine once receives keyword message and just starts to be retrieved in resource text；

Step 20：According to the scoring event of retrieval text, the text of detection is ranked up and is then forwarded to user.

Present invention solves the technical problem that it is：Due to these features that Chinese is shown, cause in extraction Chinese text art During language, it is not possible to directly using external existing term extraction viewpoint and method.The design feature of Chinese language must be directed to, is visited Study carefully the method and technology being applicable in during Chinese terminology extraction.And probed into for relation between term, can also help scholar to find The domain knowledge included inside text.

The English alignment of Chinese is likely to be obtained in bilingual corpora, but is likely to a word occur when translation The different meanings is represented in different fields, it is necessary to which we determine the translation of Chinese word according to linguistic context, this point is artificial Can accomplish substantially determine, and machine can not complete this part work, currently used method is the side by statistics Method determines the corresponding translation of word.

The characteristics of technical scheme, is mainly reflected in：

(1), the problem of border determines difficulty when being extracted for field term existing at present, the present invention are made using comentropy Based on, the information entropy between word is calculated according to term rules filter result, so that it is determined that whether word is complete, and is used IDF values propose that multiple features fusion mentioned in the present invention carries out term extraction method as domain correlation degree.

By the way of statistics and rule blend, the set obtained afterwards is filtered for stop words, according to word order and word Property filtered, then using comentropy method calculate word comentropy, since comentropy cannot extract some low-frequency words, institute To carry out term rules filtering to the text in domain corpus first, the IDF values for then calculating word in corpus are weighed The domain correlation degree of word, that is, the two values are weighted averagely, and then obtain most suitable term.

It is high frequency words since word frequency statistics can extract some but is not the word of term, comentropy and association relationship can It is enough to judge whether term complete, however comentropy be directed in article high frequency term extraction when show more preferable effect Fruit, although log-likelihood ratio is more preferable to low-frequency word effect, is used in combination with other methods.And comentropy method It is fine that result is extracted for the frequent words in text.It is not but art that this method, which will not extract inner tight in text and combine, The word of language.For the above reasons, use information entropy method of the present invention calculates the entropy of word, passes through the restriction to threshold value, energy Enough judge whether term is complete, and term extraction is realized so as to be combined with term rules.

(2), it is of the invention by word in bag of words for this extremely complex problem of the match pattern between Sino-British vocabulary This concept of co-occurrence is applied to Chinese and English term alignment, the probability occurred at the same time by calculating English word, so as to obtain The translation that Chinese terminology is most suitable in this area, this method not only allow for word directly turning in Chinese and English dictionary Translate, it is also contemplated that the situation of term co-occurrence, can preferably be applied in the processing of field text.

Co-occurrence (can have the set of words model of repetition, the bag-of-words model) belongs in bag of words application One kind, is usually used in increasing at present the performance of information retrieval.In general, the vocabulary occurred in text is provided to help author table to reach it Theme, therefore theme relative words are can serve as, it is only different with the degree of correlation of theme.Document Modeling Purpose is exactly to find out the vocabulary closely related with theme, and then the main feature of document is represented using these vocabulary.Just as objective Existent is the same, and vocabulary also has its field label.Therefore, the vocabulary in same area or theme appears in a same piece jointly Probability in document is of a relatively high, and the degree of correlation between vocabulary and theme is may determine that using the co-occurrence phenomenon of word.

In document sets space, in a fixed size window appearance of a vocabulary be usually associated with some collocation vocabulary Appearance, illustrate the vocabulary and these collocation vocabulary between exist certain semantic relation, this vocabulary at the same time appearance phenomenon quilt Referred to as Term co-occurrence phenomenon.Fixed size window can be a document, a natural paragraph, a sentence either with some word K vocabulary distance range centered on remittance.

(3), asked for the question-type languages in cross-language information retrieval is carried out and different from each other this of information text languages Topic, the question-type of text is built using term as characteristic value, Chinese and English term alignment is then carried out to characteristic value, according to alignment As a result retrieval type is built, languages will be inquired about by this method and retrieval information unification is mapped as same language, finally by The mode of full-text search, returns the result to obtain matching relationship according to retrieval.

Use the relatively good Lucene full-text index engines of current effect.For the facility of retrieval, first by Lucene Establishment inverted index (representing which word is included in which document) is carried out to english information text, is then carried according to Lucene Sortord calculates text score, and returns to ranking results according to score.

Under normal conditions, when user keys in the keyword for wanting inquiry, search engine once receives keyword letter Breath just starts to be retrieved in resource text.Then according to the scoring event of retrieval text, the text of detection is ranked up It is then forwarded to user.Lucene weighs the matching degree of Document and Query using score Score.

The main innovation point of the present invention is embodied in：

(1) field term extracting method based on multi-feature fusion is proposed.It is first due to unique word formation pattern of Chinese terminology Rule-based filtering first is carried out to text, then the method for use information entropy determines term border, since comentropy cannot extract low frequency Word, so weighing the domain correlation degree of term, the result that the two parts are obtained using the IDF values of word in domain corpus Value carry out integrated treatment, the term in the text of field is extracted by the method for the present invention, and with common several terms Extracting method compares and analyzes.Test result indicates that：Itd is proposed in the present invention based on multi-feature extraction field term algorithm ratio It is more satisfactory.

(2) the Chinese and English term alignment algorithm based on bag of words is proposed.The present invention is for associative mode between Sino-British word This complex problem, it is proposed that Chinese and English term alignment algorithm.On the basis of the term of extraction, term is turned over first Translate, then according to the resource that uses of translation, different situations have a different alignment thereofs, according in bag of words when alignment The co-occurrence rate of word obtains best Chinese terminology alignment.

(3) Problems Existing in cross-language information retrieval is directed to, the basis using term alignment result as this step, to Chinese The english term of alignment is reconfigured, and builds question-type, so that full-text search is carried out in information resources, and according to retrieval As a result the most matched information text of text to be matched is obtained.

Cross-language information matching primitives method of the kind based on term extraction of the present invention, is combined using statistics and rule Mode extracts term, and the statistical method used in terminology extraction is analyzed first, and comentropy is found according to analysis result Unique status in terminology extraction, the shortcomings that according to this method when extracting term, it is proposed that weigh term with IDF values Domain correlation degree, field term extraction algorithm based on multi-feature fusion is finally proposed according to terminological rule.

The technique effect analysis and verification of the present invention：

For term extraction, the method comparative analysis with rule-based method, based on mutual information, and leading to respectively of the invention Evaluation index is crossed to be evaluated, the results showed that algorithm effect proposed by the present invention is relatively good.

The present invention first according to the corpus of Chinese natural language process field using described in previous step based on multiple features The terminology extraction method of fusion extracts term, then carries out machine translation to these terms, by these translation results all Be saved in the nomenclature in database, according to bag of words obtain any two term occur in same piece article it is general 5398 term words have been obtained in rate, this step, these terms are calculated with bag of words, 7,500,000 left sides have been obtained Right corresponding data, this 7,500,000 data is stored in database table, and then Database is indexed, is optimized so that Retrieval time greatly reduces, and the term alignment for next step provides the foundation and may.

This part needs to carry out term extraction to the text in corpus, and the common evaluation index of terminology extraction has accuracy rate (P), recall rate (R) and F- values (F) evaluate experimental result.It is as follows according to corpus result of calculation：P=83.57%, R =81%, F=82.26%.

For Chinese and English correspondence and cross-language information retrieval the two significant process, the present invention, which proposes, a kind of is based on bag of words The corresponding algorithm of Chinese and English of model, and retrieved using corresponding English in corpus, it is as a result also more satisfactory.

The present invention chooses a large amount of texts and inquired about, complete according to looking into for each inquiry, and/precision ratio and then averaging is looked into entirely/looks into Quasi- rate, is then also had the method for being carried out full-text search based on body to be contrasted, drawn higher according to this method and tradition CLIR Be averaged and look into complete/precision ratio：

The method of the present invention efficiency is poor unlike other methods effect, but retrieves if desired in corpus and text to be matched During the relevant information text of this whole, the accuracy rate that the method for the present invention reaches has 66.67% or so, be effect be it is best, And then the validity of the cross-language information matching technique based on term extraction is demonstrated, applicability is more preferable.

Brief description of the drawings

Fig. 1 vocabulary filtering process figures.

Fig. 2 Chinese and English cross-language information matches flow chart.

Cross-language information matching system module maps of the Fig. 3 based on term extraction.

Embodiment

Illustrate below and the present invention is described in more detail.

1. Text Pretreatment

Input：Need the text message analyzed

Output：Lexical set after participle

Step 1：The present invention is directed to the characteristic between word in traditional participle model and text, using sentence as unit, Sentence is pre-processed, as the basis of term extraction, in this stage, by making pauses in reading unpunctuated ancient writings to text, participle and filtering stop The procedure extraction data acquisition system of word.

2. vocabulary filtering stop words filtering

After Chinese word segmentation is carried out to text, single character string one by one can be obtained.It is seen that in sentence In have a great influence to semantic meaning representation is mostly noun and verb, the vocabulary of other ornamental equivalents to the semantic effect of sentence not Greatly, so needing to retain this part name influential on sentence semantics or verb, and they are referred to as to continue to employ word.So And in the sequence after these participles, the frequency occurred in the text there are a part of word is very high, but actually to text Analysis does not have too much influence.These vocabulary are mainly made of auxiliary words of mood, preposition, adverbial word etc., and these words do not have in itself There is clear and definite implication, only when using these words as can just play some use during a part for sentence.This kind of word is referred to as to stop Word.Therefore, if can the inessential vocabulary of those in the text of field filtered completely, it will greatly save system Memory space and reduce the verification system middle and later periods workload and calculation amount.Therefore the design in preprocessing part not only will Select suitably participle and dimensioning algorithm are simultaneously improved appropriately, and the process that also filtered to vocabulary carries out appropriate set Meter.The flow chart of vocabulary filtration step is as shown in Figure 1.

Precondition：Participle operation is performed

Input：Character set to be filtered

Output：Text lexical set after vocabulary filtering

Step 2：By lexical set to be filtered can be obtained after Chinese word segmentation.

Step 3：Stop words text is loaded, a vocabulary is read in from the set, vocabulary is carried out in stop words text Search.If finding, need to filter out the character, otherwise filter.

3. field term extracts

Precondition：Perform and complete vocabulary filter operation

Input：Chinese text text

Output：Set List after filtering

Step 4：Sentence carries out participle and stop words filtering for unit.

Step 5：Retain the noun if word.natures is noun, if word.nature is adjective, under judging The part of speech of one word, if next word is noun, retains the noun.If word.nature is verb, former and later two are judged Word, if noun, then retains the noun.

Step 6：Return to the set List after rule-based filtering；

4. application of the comentropy in term

Precondition：Perform and complete field term extraction.

Input：Set List after rule-based filtering

Output：Term and the corresponding information entropy Map_Entropy of term

Step 7：Assuming that length L (s) ＞ 1 of character string s, and s is the left and right word border of some word, then this character String s can be regarded as a complete word.

Step 8：If s is word w left word borders accordingly, then w can be write to w=sx as, wherein x is any character String；If s is word w right word borders accordingly, then w can be write to w=ys as, wherein y is arbitrary string；If s is at the same time It is the left and right word border of word w, then necessarily have：(1) s is identical with w；(2) one kind inside w=sxs both of these case, and x It is arbitrary string.

5. blend extraction field term based on multiple features

TF-IDF is a kind of statistical method, and it is either whole to a text to evaluate some word according to this method The significance level of any one text in corpus.Frequency of occurrence is directly proportional in the text with this word for the importance of word, The frequency occurred with this word in corpus is inversely proportional.

Precondition：Perform the calculating for completing comentropy.

Input：The HashMapMap_Entropy obtained according to the method for rule and Information entropy fusion, according to field language material The HashMapMap_Idf that storehouse is calculated.

Output：The HashMapMap_Result for the term extraction method extraction that text proposes according to the present invention

Step 9：According to based on term rules filter as a result, obtaining candidate terms set.

Step 10：The left and right comentropy of word in set of computations, then calculates total information entropy according to left and right comentropy, identical The comentropy of word is added, and total information entropy is ranked up, and reservation meets H (s) ＞ IE_minWord, wherein IE_minIt is certainly One numerical value of definition.

Step 11：The score of each word is ranked up, if the score of word is identical, successively according to information The size of entropy and the size of IDF are ranked up set；

Step 12：To the set after sequence, Score ＞ Score are taken_minWord retained, wherein Score_minIt is artificial Obtained numerical value.

6. the Chinese and English alignment based on bag of words

Term alignment is exactly found out one-to-one between this field Chinese word and English word from the text of field Correspondence.Form, it is assumed that e_aAnd e_bIt is English word, (e can be obtained by bag of words before_a,e_b) in English corpus In co-occurrence value, expression is the e in corpus_aAnd e_bThe probability occurred at the same time.Precondition：Completion vocabulary is performed to merge Journey

Input：The corresponding term set List of Chinese text

Output：HashMapMap_Result after Chinese and English alignment

Step 13：Machine translation is carried out to Term and obtains Term_Translate；

Step 14：If term_Translate is the translation carried out according to dictionary, add it in term_entrem. Otherwise Map_Result.put is added.

Step 15：Cartesian product is asked to the set in List_EnTerm.List_Descartes

Step 16：Each set in List_Descartes set is traveled through, if it does, then Sum=Sum+ Value；Then this word is inserted into and do not logged in the corresponding tables of data of word, searching, there is maximum Sum to store English accordingly The set List_Max of literary sequence；Return to Map_Result；

7. cross-language information retrieval

When being retrieved, read first according to the obtained translator of English of Chinese as a result, according to corresponding result to English into The retrieval type of structure, is then consigned to search engine and performs retrieval ordering, further obtained most by row combination, construction retrieval type With text.The flow chart of this step is as shown in Figure 2.

Precondition：Perform and complete the Chinese and English alignment based on bag of words

Input：The map_result of Chinese and English alignment

Output：Ranking results

Step 17：Establishment inverted index, which (represents which includes in which document, to be carried out to english information text with Lucene Word)

Step 18：Sortord is carried according to Lucene to calculate text score, and sequence knot is returned to according to score Fruit.

8. retrieval result sorts

Precondition：Perform and complete cross-language information retrieval

Input：Text cross-language information retrieval result

Output：Detect the ordering scenario of text

Step 19：Search engine once receives keyword message and just starts to be retrieved in resource text.

Whole technology paths are as shown in Figure 3.

Claims

1. a kind of cross-language information matching process based on term extraction, it is characterized in that：

Step 1：Using in units of sentence as the basis of term extraction, by making pauses in reading unpunctuated ancient writings, segmenting and filtering stop words to text Procedure extraction data acquisition system；

Step 3：Stop words text is loaded, a vocabulary is read in from lexical set, vocabulary is looked into stop words text Look for, if finding, filter out the character, otherwise do not filter；

Step 4：Sentence carries out participle and stop words filtering for unit；

Step 5：Retain the noun if word.natures is noun, if word.nature is adjective, judge next The part of speech of word, if next word is noun, retains the noun；If word.nature is verb, former and later two words are judged, If noun, then retain the noun；

Step 6：Text collection is filtered by the method for step 3-5, the text set List after being filtered, return to rule Set List after filtering；

Step 7：Assuming that length L (s) ＞ 1 of character string s, and s is the left and right word border of some word, this character string s is made For a complete word；

Step 8：If s is word w left word borders accordingly, w is write as w=sx, wherein x is arbitrary string；If s is word W right word borders accordingly, are write w as w=ys, wherein y is arbitrary string；If s is the left and right word border of word w at the same time, that Necessarily have：(1) s is identical with w；(2) one kind inside w=sxs both of these case, and x is arbitrary string；

Step 10：The left and right comentropy of word in candidate terms set is calculated, total information entropy is then calculated according to left and right comentropy, The comentropy of identical word is added, and total information entropy is ranked up, and reservation meets left and right comentropy H (s) ＞ IE_minWord Language, wherein IE_minIt is a customized numerical value；

Step 11：The score of each word is ranked up, if the score of word is identical, successively according to the size of comentropy Set is ranked up with the size of anti-document frequency IDF；

Step 13：Machine translation is carried out to Term and obtains Terminology Translation set Term_Translate；

Step 14：If Terminology Translation set Term_Translate is the translation carried out according to dictionary, adds it to dictionary and turn over Translate in set Term_entrem, otherwise add the English set Map_Result.put after Chinese and English alignment；

Step 16：Each set in cartesian product List_Descartes set is traveled through, if it does, then Sum=Sum+ Value；Then this word is inserted into and do not logged in the corresponding tables of data of word, searching, there is maximum Sum to store English accordingly The set List_Max of literary sequence；Return to English final result Map_Result；

Step 18：Sortord is carried according to English Search Engines Lucene to calculate text score, and is returned according to score Return ranking results；