CN107908712A - Cross-language information matching process based on term extraction - Google Patents

Cross-language information matching process based on term extraction Download PDF

Info

Publication number
CN107908712A
CN107908712A CN201711101619.1A CN201711101619A CN107908712A CN 107908712 A CN107908712 A CN 107908712A CN 201711101619 A CN201711101619 A CN 201711101619A CN 107908712 A CN107908712 A CN 107908712A
Authority
CN
China
Prior art keywords
word
term
text
english
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711101619.1A
Other languages
Chinese (zh)
Inventor
刘刚
胡昱临
孙素艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN201711101619.1A priority Critical patent/CN107908712A/en
Publication of CN107908712A publication Critical patent/CN107908712A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The present invention is to provide a kind of cross-language information matching process based on term extraction.Chinese text is pre-processed using natural language processing technique, part-of-speech tagging is carried out for sentence;Word filtering is carried out to pre-processed results using the word-building rule of term, the border of word is determined by comentropy;The domain correlation degree of term is weighed using the IDF values of word in domain corpus, two groups of words are weighted processing, last set threshold value, accepts or rejects candidate terms according to term scoring event;On the basis of the field term of acquisition, centering english term aligns, and obtains term in the corresponding translation in this area.Finally utilize Chinese and English term alignment result structure retrieval type, establish contacting between Chinese and English, retrieval type is subjected to full-text search in English text, most matched English text is determined according to matching result, across language matching is realized using field term alignment result so as to reach.

Description

Cross-language information matching process based on term extraction
Technical field
The present invention relates to a kind of cross-language information matching process.
Background technology
1. term extraction
Term is the denotion that specific concept is directed in specific area.At present, the sight that scholars generally accept field term Putting is:Field term (DomainTerm) is a set, this set is to be used for representing specific concept in specific ambit Appellation.It is noun either S&T capability that domestic researcher, which generally accepts field term, but " name mentioned here Word " is distinct from the noun in linguistics.
There are 1 between term and concept:1 pattern.Each field term may only represent only in other words One specific concept.Conversely, a specific concept can only also have a unique denotion field term.This also exactly illustrates Two key properties of field term, i.e. monosemy and single-character given name.Assuming that a term cannot meet monosemy and single-character given name, So it is easy for contrary opinion and polysemia occur in a subject or multiple subjects.
Automatic term extraction method is broadly divided into three kinds:Rule-based method, Statistics-Based Method and rule and Count the method being combined.Since rule-based and Statistics-Based Method respectively there are advantage and disadvantage, both can combine to take Long benefit is short, and nearest research largely combines statistics and philological method.
2. across language matching
In Chinese and English cross-language information retrieval system, user puts question to the related English of retrieval (or the Chinese by Chinese (or English) Language) document.There are four common matching process for cross-language information retrieval:Homologous matching (Homologous matching), Document translation (Document translation), intermediate language technology (Interlingua Technology) and query translation (Query translation)。
Homologous matching judges one of which language term according to macaronic word orthographic form or pronunciation similarity Meaning, without any translation.For example, French can be counted as the English word for having misspelling, directly to English text Part execution information is retrieved, and without mutually translation.
Document is translated and query translation contrast, and multilingual raw information set is first converted into identical with inquiry by it Language, then carry out single language information retrieval process.
The methods of potential applications index (Latent SemanticIndexing, LSI) and Generalized Vector Space Model is existing In the common method that need not be carried out translation and can complete cross-language information retrieval.Dive semantic indexing (Cross- across language Language Latent Semantic Indexing) it is exactly Chinese and English in semantic space and corpus according to different language Correspondence is so as to according to a kind of method of language retrieval another kind language.
Question-type input by user (original language) is exactly translated as the language (object language) of system support by query translation, Then the question-type of object language is submitted into matching module again, carries out single language information retrieval.
The content of the invention
Being carried based on term for accuracy rate can be improved under conditions of close efficiency it is an object of the invention to provide a kind of The cross-language information matching process taken.
The object of the present invention is achieved like this:
Step 1:Using in units of sentence as the basis of term extraction, by making pauses in reading unpunctuated ancient writings, segmenting and filtering deactivation to text The procedure extraction data acquisition system of word;
Step 2:By obtaining lexical set to be filtered after Chinese word segmentation;
Step 3:Load stop words text, from lexical set read in a vocabulary, by vocabulary in stop words text into Row is searched, if finding, is filtered out the character, is not otherwise filtered;
Step 4:Sentence carries out participle and stop words filtering for unit;
Step 5:Retain the noun if word.natures is noun, if word.nature is adjective, under judging The part of speech of one word, if next word is noun, retains the noun;If word.nature is verb, former and later two are judged Word, if noun, then retains the noun;
Step 6:Text collection is filtered by the method for step 3-5, the text set List after being filtered, returned Set List after rule-based filtering;
Step 7:Assuming that length L (s) > 1 of character string s, and s is the left and right word border of some word, this character string s By as a complete word;
Step 8:If s is word w left word borders accordingly, w is write as w=sx, wherein x is arbitrary string;If s It is word w right word borders accordingly, is write w as w=ys, wherein y is arbitrary string;If s is the left and right word side of word w at the same time Boundary, then necessarily have:(1) s is identical with w;(2) one kind inside w=sxs both of these case, and x is arbitrary string;It is false , can be by this current character string as being one complete when can be as the right boundary of word such as some character string s Word.
Step 9:According to based on term rules filter as a result, obtaining candidate terms set;
Step 10:The left and right comentropy of word in candidate terms set is calculated, total letter is then calculated according to left and right comentropy Entropy is ceased, the comentropy of identical word is added, total information entropy is ranked up, and reservation meets left and right comentropy H (s) > IEmin Word, wherein IEminIt is a customized numerical value;
Step 11:The score of each word is ranked up, if the score of word is identical, successively according to comentropy Size and the size of anti-document frequency IDF are ranked up set;
Step 12:To the set after sequence, acquirement point is more than predetermined threshold, that is, Score > ScoreminWord retained, Wherein ScoreminIt is the numerical value manually obtained;
Step 13:Machine translation is carried out to term Term and obtains Terminology Translation set Term_Translate;
Step 14:If Terminology Translation set Term_Translate is the translation carried out according to dictionary, word is added it to In allusion quotation translation set List_Entrem, the English set Map_Result.put after Chinese and English alignment is otherwise added;
Step 15:Cartesian product List_Descartes is asked to the set in dictionary translation set List_EnTerm;
Step 16:Each set in cartesian product List_Descartes set is traveled through, if it does, then summation; Then this word is inserted into and do not logged in the corresponding tables of data of word, searching, there is maximum and Sum to store English sequence accordingly The set List_Max of row;Return to English final result Map_Result;
Step 17:Establishment inverted index is carried out to english information text with English Search Engines Lucene;
Step 18:Sortord is carried according to English Search Engines Lucene to calculate text score, and according to Divide and return to ranking results;
Step 19:Search engine once receives keyword message and just starts to be retrieved in resource text;
Step 20:According to the scoring event of retrieval text, the text of detection is ranked up and is then forwarded to user.
Present invention solves the technical problem that it is:Due to these features that Chinese is shown, cause in extraction Chinese text art During language, it is not possible to directly using external existing term extraction viewpoint and method.The design feature of Chinese language must be directed to, is visited Study carefully the method and technology being applicable in during Chinese terminology extraction.And probed into for relation between term, can also help scholar to find The domain knowledge included inside text.
The English alignment of Chinese is likely to be obtained in bilingual corpora, but is likely to a word occur when translation The different meanings is represented in different fields, it is necessary to which we determine the translation of Chinese word according to linguistic context, this point is artificial Can accomplish substantially determine, and machine can not complete this part work, currently used method is the side by statistics Method determines the corresponding translation of word.
The characteristics of technical scheme, is mainly reflected in:
(1), the problem of border determines difficulty when being extracted for field term existing at present, the present invention are made using comentropy Based on, the information entropy between word is calculated according to term rules filter result, so that it is determined that whether word is complete, and is used IDF values propose that multiple features fusion mentioned in the present invention carries out term extraction method as domain correlation degree.
By the way of statistics and rule blend, the set obtained afterwards is filtered for stop words, according to word order and word Property filtered, then using comentropy method calculate word comentropy, since comentropy cannot extract some low-frequency words, institute To carry out term rules filtering to the text in domain corpus first, the IDF values for then calculating word in corpus are weighed The domain correlation degree of word, that is, the two values are weighted averagely, and then obtain most suitable term.
It is high frequency words since word frequency statistics can extract some but is not the word of term, comentropy and association relationship can It is enough to judge whether term complete, however comentropy be directed in article high frequency term extraction when show more preferable effect Fruit, although log-likelihood ratio is more preferable to low-frequency word effect, is used in combination with other methods.And comentropy method It is fine that result is extracted for the frequent words in text.It is not but art that this method, which will not extract inner tight in text and combine, The word of language.For the above reasons, use information entropy method of the present invention calculates the entropy of word, passes through the restriction to threshold value, energy Enough judge whether term is complete, and term extraction is realized so as to be combined with term rules.
(2), it is of the invention by word in bag of words for this extremely complex problem of the match pattern between Sino-British vocabulary This concept of co-occurrence is applied to Chinese and English term alignment, the probability occurred at the same time by calculating English word, so as to obtain The translation that Chinese terminology is most suitable in this area, this method not only allow for word directly turning in Chinese and English dictionary Translate, it is also contemplated that the situation of term co-occurrence, can preferably be applied in the processing of field text.
Co-occurrence (can have the set of words model of repetition, the bag-of-words model) belongs in bag of words application One kind, is usually used in increasing at present the performance of information retrieval.In general, the vocabulary occurred in text is provided to help author table to reach it Theme, therefore theme relative words are can serve as, it is only different with the degree of correlation of theme.Document Modeling Purpose is exactly to find out the vocabulary closely related with theme, and then the main feature of document is represented using these vocabulary.Just as objective Existent is the same, and vocabulary also has its field label.Therefore, the vocabulary in same area or theme appears in a same piece jointly Probability in document is of a relatively high, and the degree of correlation between vocabulary and theme is may determine that using the co-occurrence phenomenon of word.
In document sets space, in a fixed size window appearance of a vocabulary be usually associated with some collocation vocabulary Appearance, illustrate the vocabulary and these collocation vocabulary between exist certain semantic relation, this vocabulary at the same time appearance phenomenon quilt Referred to as Term co-occurrence phenomenon.Fixed size window can be a document, a natural paragraph, a sentence either with some word K vocabulary distance range centered on remittance.
(3), asked for the question-type languages in cross-language information retrieval is carried out and different from each other this of information text languages Topic, the question-type of text is built using term as characteristic value, Chinese and English term alignment is then carried out to characteristic value, according to alignment As a result retrieval type is built, languages will be inquired about by this method and retrieval information unification is mapped as same language, finally by The mode of full-text search, returns the result to obtain matching relationship according to retrieval.
Use the relatively good Lucene full-text index engines of current effect.For the facility of retrieval, first by Lucene Establishment inverted index (representing which word is included in which document) is carried out to english information text, is then carried according to Lucene Sortord calculates text score, and returns to ranking results according to score.
Under normal conditions, when user keys in the keyword for wanting inquiry, search engine once receives keyword letter Breath just starts to be retrieved in resource text.Then according to the scoring event of retrieval text, the text of detection is ranked up It is then forwarded to user.Lucene weighs the matching degree of Document and Query using score Score.
The main innovation point of the present invention is embodied in:
(1) field term extracting method based on multi-feature fusion is proposed.It is first due to unique word formation pattern of Chinese terminology Rule-based filtering first is carried out to text, then the method for use information entropy determines term border, since comentropy cannot extract low frequency Word, so weighing the domain correlation degree of term, the result that the two parts are obtained using the IDF values of word in domain corpus Value carry out integrated treatment, the term in the text of field is extracted by the method for the present invention, and with common several terms Extracting method compares and analyzes.Test result indicates that:Itd is proposed in the present invention based on multi-feature extraction field term algorithm ratio It is more satisfactory.
(2) the Chinese and English term alignment algorithm based on bag of words is proposed.The present invention is for associative mode between Sino-British word This complex problem, it is proposed that Chinese and English term alignment algorithm.On the basis of the term of extraction, term is turned over first Translate, then according to the resource that uses of translation, different situations have a different alignment thereofs, according in bag of words when alignment The co-occurrence rate of word obtains best Chinese terminology alignment.
(3) Problems Existing in cross-language information retrieval is directed to, the basis using term alignment result as this step, to Chinese The english term of alignment is reconfigured, and builds question-type, so that full-text search is carried out in information resources, and according to retrieval As a result the most matched information text of text to be matched is obtained.
Cross-language information matching primitives method of the kind based on term extraction of the present invention, is combined using statistics and rule Mode extracts term, and the statistical method used in terminology extraction is analyzed first, and comentropy is found according to analysis result Unique status in terminology extraction, the shortcomings that according to this method when extracting term, it is proposed that weigh term with IDF values Domain correlation degree, field term extraction algorithm based on multi-feature fusion is finally proposed according to terminological rule.
The technique effect analysis and verification of the present invention:
For term extraction, the method comparative analysis with rule-based method, based on mutual information, and leading to respectively of the invention Evaluation index is crossed to be evaluated, the results showed that algorithm effect proposed by the present invention is relatively good.
The present invention first according to the corpus of Chinese natural language process field using described in previous step based on multiple features The terminology extraction method of fusion extracts term, then carries out machine translation to these terms, by these translation results all Be saved in the nomenclature in database, according to bag of words obtain any two term occur in same piece article it is general 5398 term words have been obtained in rate, this step, these terms are calculated with bag of words, 7,500,000 left sides have been obtained Right corresponding data, this 7,500,000 data is stored in database table, and then Database is indexed, is optimized so that Retrieval time greatly reduces, and the term alignment for next step provides the foundation and may.
This part needs to carry out term extraction to the text in corpus, and the common evaluation index of terminology extraction has accuracy rate (P), recall rate (R) and F- values (F) evaluate experimental result.It is as follows according to corpus result of calculation:P=83.57%, R =81%, F=82.26%.
For Chinese and English correspondence and cross-language information retrieval the two significant process, the present invention, which proposes, a kind of is based on bag of words The corresponding algorithm of Chinese and English of model, and retrieved using corresponding English in corpus, it is as a result also more satisfactory.
The present invention chooses a large amount of texts and inquired about, complete according to looking into for each inquiry, and/precision ratio and then averaging is looked into entirely/looks into Quasi- rate, is then also had the method for being carried out full-text search based on body to be contrasted, drawn higher according to this method and tradition CLIR Be averaged and look into complete/precision ratio:
The method of the present invention efficiency is poor unlike other methods effect, but retrieves if desired in corpus and text to be matched During the relevant information text of this whole, the accuracy rate that the method for the present invention reaches has 66.67% or so, be effect be it is best, And then the validity of the cross-language information matching technique based on term extraction is demonstrated, applicability is more preferable.
Brief description of the drawings
Fig. 1 vocabulary filtering process figures.
Fig. 2 Chinese and English cross-language information matches flow chart.
Cross-language information matching system module maps of the Fig. 3 based on term extraction.
Embodiment
Illustrate below and the present invention is described in more detail.
1. Text Pretreatment
Input:Need the text message analyzed
Output:Lexical set after participle
Step 1:The present invention is directed to the characteristic between word in traditional participle model and text, using sentence as unit, Sentence is pre-processed, as the basis of term extraction, in this stage, by making pauses in reading unpunctuated ancient writings to text, participle and filtering stop The procedure extraction data acquisition system of word.
2. vocabulary filtering stop words filtering
After Chinese word segmentation is carried out to text, single character string one by one can be obtained.It is seen that in sentence In have a great influence to semantic meaning representation is mostly noun and verb, the vocabulary of other ornamental equivalents to the semantic effect of sentence not Greatly, so needing to retain this part name influential on sentence semantics or verb, and they are referred to as to continue to employ word.So And in the sequence after these participles, the frequency occurred in the text there are a part of word is very high, but actually to text Analysis does not have too much influence.These vocabulary are mainly made of auxiliary words of mood, preposition, adverbial word etc., and these words do not have in itself There is clear and definite implication, only when using these words as can just play some use during a part for sentence.This kind of word is referred to as to stop Word.Therefore, if can the inessential vocabulary of those in the text of field filtered completely, it will greatly save system Memory space and reduce the verification system middle and later periods workload and calculation amount.Therefore the design in preprocessing part not only will Select suitably participle and dimensioning algorithm are simultaneously improved appropriately, and the process that also filtered to vocabulary carries out appropriate set Meter.The flow chart of vocabulary filtration step is as shown in Figure 1.
Precondition:Participle operation is performed
Input:Character set to be filtered
Output:Text lexical set after vocabulary filtering
Step 2:By lexical set to be filtered can be obtained after Chinese word segmentation.
Step 3:Stop words text is loaded, a vocabulary is read in from the set, vocabulary is carried out in stop words text Search.If finding, need to filter out the character, otherwise filter.
3. field term extracts
By the way of statistics and rule blend, the set obtained afterwards is filtered for stop words, according to word order and word Property filtered, then using comentropy method calculate word comentropy, since comentropy cannot extract some low-frequency words, institute To carry out term rules filtering to the text in domain corpus first, the IDF values for then calculating word in corpus are weighed The domain correlation degree of word, that is, the two values are weighted averagely, and then obtain most suitable term.
Precondition:Perform and complete vocabulary filter operation
Input:Chinese text text
Output:Set List after filtering
Step 4:Sentence carries out participle and stop words filtering for unit.
Step 5:Retain the noun if word.natures is noun, if word.nature is adjective, under judging The part of speech of one word, if next word is noun, retains the noun.If word.nature is verb, former and later two are judged Word, if noun, then retains the noun.
Step 6:Return to the set List after rule-based filtering;
4. application of the comentropy in term
It is high frequency words since word frequency statistics can extract some but is not the word of term, comentropy and association relationship can It is enough to judge whether term complete, however comentropy be directed in article high frequency term extraction when show more preferable effect Fruit, although log-likelihood ratio is more preferable to low-frequency word effect, is used in combination with other methods.And comentropy method It is fine that result is extracted for the frequent words in text.It is not but art that this method, which will not extract inner tight in text and combine, The word of language.For the above reasons, use information entropy method of the present invention calculates the entropy of word, passes through the restriction to threshold value, energy Enough judge whether term is complete, and term extraction is realized so as to be combined with term rules.
Precondition:Perform and complete field term extraction.
Input:Set List after rule-based filtering
Output:Term and the corresponding information entropy Map_Entropy of term
Step 7:Assuming that length L (s) > 1 of character string s, and s is the left and right word border of some word, then this character String s can be regarded as a complete word.
Step 8:If s is word w left word borders accordingly, then w can be write to w=sx as, wherein x is any character String;If s is word w right word borders accordingly, then w can be write to w=ys as, wherein y is arbitrary string;If s is at the same time It is the left and right word border of word w, then necessarily have:(1) s is identical with w;(2) one kind inside w=sxs both of these case, and x It is arbitrary string.
5. blend extraction field term based on multiple features
TF-IDF is a kind of statistical method, and it is either whole to a text to evaluate some word according to this method The significance level of any one text in corpus.Frequency of occurrence is directly proportional in the text with this word for the importance of word, The frequency occurred with this word in corpus is inversely proportional.
Precondition:Perform the calculating for completing comentropy.
Input:The HashMapMap_Entropy obtained according to the method for rule and Information entropy fusion, according to field language material The HashMapMap_Idf that storehouse is calculated.
Output:The HashMapMap_Result for the term extraction method extraction that text proposes according to the present invention
Step 9:According to based on term rules filter as a result, obtaining candidate terms set.
Step 10:The left and right comentropy of word in set of computations, then calculates total information entropy according to left and right comentropy, identical The comentropy of word is added, and total information entropy is ranked up, and reservation meets H (s) > IEminWord, wherein IEminIt is certainly One numerical value of definition.
Step 11:The score of each word is ranked up, if the score of word is identical, successively according to information The size of entropy and the size of IDF are ranked up set;
Step 12:To the set after sequence, Score > Score are takenminWord retained, wherein ScoreminIt is artificial Obtained numerical value.
6. the Chinese and English alignment based on bag of words
Term alignment is exactly found out one-to-one between this field Chinese word and English word from the text of field Correspondence.Form, it is assumed that eaAnd ebIt is English word, (e can be obtained by bag of words beforea,eb) in English corpus In co-occurrence value, expression is the e in corpusaAnd ebThe probability occurred at the same time.Precondition:Completion vocabulary is performed to merge Journey
Input:The corresponding term set List of Chinese text
Output:HashMapMap_Result after Chinese and English alignment
Step 13:Machine translation is carried out to Term and obtains Term_Translate;
Step 14:If term_Translate is the translation carried out according to dictionary, add it in term_entrem. Otherwise Map_Result.put is added.
Step 15:Cartesian product is asked to the set in List_EnTerm.List_Descartes
Step 16:Each set in List_Descartes set is traveled through, if it does, then Sum=Sum+ Value;Then this word is inserted into and do not logged in the corresponding tables of data of word, searching, there is maximum Sum to store English accordingly The set List_Max of literary sequence;Return to Map_Result;
7. cross-language information retrieval
When being retrieved, read first according to the obtained translator of English of Chinese as a result, according to corresponding result to English into The retrieval type of structure, is then consigned to search engine and performs retrieval ordering, further obtained most by row combination, construction retrieval type With text.The flow chart of this step is as shown in Figure 2.
Precondition:Perform and complete the Chinese and English alignment based on bag of words
Input:The map_result of Chinese and English alignment
Output:Ranking results
Step 17:Establishment inverted index, which (represents which includes in which document, to be carried out to english information text with Lucene Word)
Step 18:Sortord is carried according to Lucene to calculate text score, and sequence knot is returned to according to score Fruit.
8. retrieval result sorts
Precondition:Perform and complete cross-language information retrieval
Input:Text cross-language information retrieval result
Output:Detect the ordering scenario of text
Step 19:Search engine once receives keyword message and just starts to be retrieved in resource text.
Step 20:According to the scoring event of retrieval text, the text of detection is ranked up and is then forwarded to user.
Whole technology paths are as shown in Figure 3.

Claims (1)

1. a kind of cross-language information matching process based on term extraction, it is characterized in that:
Step 1:Using in units of sentence as the basis of term extraction, by making pauses in reading unpunctuated ancient writings, segmenting and filtering stop words to text Procedure extraction data acquisition system;
Step 2:By obtaining lexical set to be filtered after Chinese word segmentation;
Step 3:Stop words text is loaded, a vocabulary is read in from lexical set, vocabulary is looked into stop words text Look for, if finding, filter out the character, otherwise do not filter;
Step 4:Sentence carries out participle and stop words filtering for unit;
Step 5:Retain the noun if word.natures is noun, if word.nature is adjective, judge next The part of speech of word, if next word is noun, retains the noun;If word.nature is verb, former and later two words are judged, If noun, then retain the noun;
Step 6:Text collection is filtered by the method for step 3-5, the text set List after being filtered, return to rule Set List after filtering;
Step 7:Assuming that length L (s) > 1 of character string s, and s is the left and right word border of some word, this character string s is made For a complete word;
Step 8:If s is word w left word borders accordingly, w is write as w=sx, wherein x is arbitrary string;If s is word W right word borders accordingly, are write w as w=ys, wherein y is arbitrary string;If s is the left and right word border of word w at the same time, that Necessarily have:(1) s is identical with w;(2) one kind inside w=sxs both of these case, and x is arbitrary string;
Step 9:According to based on term rules filter as a result, obtaining candidate terms set;
Step 10:The left and right comentropy of word in candidate terms set is calculated, total information entropy is then calculated according to left and right comentropy, The comentropy of identical word is added, and total information entropy is ranked up, and reservation meets left and right comentropy H (s) > IEminWord Language, wherein IEminIt is a customized numerical value;
Step 11:The score of each word is ranked up, if the score of word is identical, successively according to the size of comentropy Set is ranked up with the size of anti-document frequency IDF;
Step 12:To the set after sequence, acquirement point is more than predetermined threshold, that is, Score > ScoreminWord retained, wherein ScoreminIt is the numerical value manually obtained;
Step 13:Machine translation is carried out to Term and obtains Terminology Translation set Term_Translate;
Step 14:If Terminology Translation set Term_Translate is the translation carried out according to dictionary, adds it to dictionary and turn over Translate in set Term_entrem, otherwise add the English set Map_Result.put after Chinese and English alignment;
Step 15:Cartesian product List_Descartes is asked to the set in dictionary translation set List_EnTerm;
Step 16:Each set in cartesian product List_Descartes set is traveled through, if it does, then Sum=Sum+ Value;Then this word is inserted into and do not logged in the corresponding tables of data of word, searching, there is maximum Sum to store English accordingly The set List_Max of literary sequence;Return to English final result Map_Result;
Step 17:Establishment inverted index is carried out to english information text with English Search Engines Lucene;
Step 18:Sortord is carried according to English Search Engines Lucene to calculate text score, and is returned according to score Return ranking results;
Step 19:Search engine once receives keyword message and just starts to be retrieved in resource text;
Step 20:According to the scoring event of retrieval text, the text of detection is ranked up and is then forwarded to user.
CN201711101619.1A 2017-11-10 2017-11-10 Cross-language information matching process based on term extraction Pending CN107908712A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711101619.1A CN107908712A (en) 2017-11-10 2017-11-10 Cross-language information matching process based on term extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711101619.1A CN107908712A (en) 2017-11-10 2017-11-10 Cross-language information matching process based on term extraction

Publications (1)

Publication Number Publication Date
CN107908712A true CN107908712A (en) 2018-04-13

Family

ID=61843958

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711101619.1A Pending CN107908712A (en) 2017-11-10 2017-11-10 Cross-language information matching process based on term extraction

Country Status (1)

Country Link
CN (1) CN107908712A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299480A (en) * 2018-09-04 2019-02-01 上海传神翻译服务有限公司 Terminology Translation method and device based on context of co-text
CN109918658A (en) * 2019-02-28 2019-06-21 云孚科技(北京)有限公司 A kind of method and system obtaining target vocabulary from text
CN110472047A (en) * 2019-07-15 2019-11-19 昆明理工大学 A kind of Chinese of multiple features fusion gets over news viewpoint sentence abstracting method
CN110795541A (en) * 2019-08-23 2020-02-14 腾讯科技(深圳)有限公司 Text query method and device, electronic equipment and computer readable storage medium
CN110928989A (en) * 2019-11-01 2020-03-27 暨南大学 Language model-based annual newspaper corpus construction method
CN111222328A (en) * 2018-11-26 2020-06-02 百度在线网络技术(北京)有限公司 Label extraction method and device and electronic equipment
CN111797621A (en) * 2020-06-04 2020-10-20 语联网(武汉)信息技术有限公司 Method and system for replacing terms
CN112464634A (en) * 2020-12-23 2021-03-09 中译语通科技股份有限公司 Cross-language entity automatic alignment method and system based on mutual information entropy
CN112507060A (en) * 2020-12-14 2021-03-16 福建正孚软件有限公司 Domain corpus construction method and system
CN112992303A (en) * 2019-12-15 2021-06-18 苏州市爱生生物技术有限公司 Human phenotype standard expression extraction method
CN115204190A (en) * 2022-09-13 2022-10-18 中科聚信信息技术(北京)有限公司 Device and method for converting financial field terms into English

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5301109A (en) * 1990-06-11 1994-04-05 Bell Communications Research, Inc. Computerized cross-language document retrieval using latent semantic indexing

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5301109A (en) * 1990-06-11 1994-04-05 Bell Communications Research, Inc. Computerized cross-language document retrieval using latent semantic indexing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孙素艳: "基于术语提取的跨语言信息匹配技术研究", 《万方数据》 *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299480A (en) * 2018-09-04 2019-02-01 上海传神翻译服务有限公司 Terminology Translation method and device based on context of co-text
CN109299480B (en) * 2018-09-04 2023-11-07 上海传神翻译服务有限公司 Context-based term translation method and device
CN111222328A (en) * 2018-11-26 2020-06-02 百度在线网络技术(北京)有限公司 Label extraction method and device and electronic equipment
CN111222328B (en) * 2018-11-26 2023-06-16 百度在线网络技术(北京)有限公司 Label extraction method and device and electronic equipment
CN109918658A (en) * 2019-02-28 2019-06-21 云孚科技(北京)有限公司 A kind of method and system obtaining target vocabulary from text
CN110472047B (en) * 2019-07-15 2022-12-13 昆明理工大学 Multi-feature fusion Chinese-Yue news viewpoint sentence extraction method
CN110472047A (en) * 2019-07-15 2019-11-19 昆明理工大学 A kind of Chinese of multiple features fusion gets over news viewpoint sentence abstracting method
CN110795541A (en) * 2019-08-23 2020-02-14 腾讯科技(深圳)有限公司 Text query method and device, electronic equipment and computer readable storage medium
CN110795541B (en) * 2019-08-23 2023-05-26 腾讯科技(深圳)有限公司 Text query method, text query device, electronic equipment and computer readable storage medium
CN110928989A (en) * 2019-11-01 2020-03-27 暨南大学 Language model-based annual newspaper corpus construction method
CN112992303A (en) * 2019-12-15 2021-06-18 苏州市爱生生物技术有限公司 Human phenotype standard expression extraction method
CN111797621A (en) * 2020-06-04 2020-10-20 语联网(武汉)信息技术有限公司 Method and system for replacing terms
CN111797621B (en) * 2020-06-04 2024-05-14 语联网(武汉)信息技术有限公司 Term replacement method and system
CN112507060A (en) * 2020-12-14 2021-03-16 福建正孚软件有限公司 Domain corpus construction method and system
CN112464634A (en) * 2020-12-23 2021-03-09 中译语通科技股份有限公司 Cross-language entity automatic alignment method and system based on mutual information entropy
CN112464634B (en) * 2020-12-23 2023-09-05 中译语通科技股份有限公司 Cross-language entity automatic alignment method and system based on mutual information entropy
CN115204190A (en) * 2022-09-13 2022-10-18 中科聚信信息技术(北京)有限公司 Device and method for converting financial field terms into English
CN115204190B (en) * 2022-09-13 2022-11-22 中科聚信信息技术(北京)有限公司 Device and method for converting financial field terms into English

Similar Documents

Publication Publication Date Title
CN107908712A (en) Cross-language information matching process based on term extraction
KR20160060253A (en) Natural Language Question-Answering System and method
Baeza Yates et al. Cassa: A context-aware synonym simplification algorithm
Afzal et al. Semantically enhanced concept search of the Holy Quran: Qur’anic English WordNet
Lahbari et al. Arabic question classification using machine learning approaches
Dorji et al. Extraction, selection and ranking of Field Association (FA) Terms from domain-specific corpora for building a comprehensive FA terms dictionary
Hättasch et al. It's ai match: A two-step approach for schema matching using embeddings
Biltawi et al. Arabic question answering systems: gap analysis
Gupta Hybrid algorithm for multilingual summarization of Hindi and Punjabi documents
Lahbari et al. Toward a new arabic question answering system.
Basirat et al. Lexical and morpho-syntactic features in word embeddings-a case study of nouns in swedish
Kovář et al. Finding definitions in large corpora with Sketch Engine
Mollaei et al. Question classification in Persian language based on conditional random fields
Ventura et al. Mining concepts from texts
Badaro et al. A link prediction approach for accurately mapping a large-scale Arabic lexical resource to English WordNet
Muhammad et al. A comparison between conditional random field and structured support vector machine for Arabic named entity recognition
Atlam et al. A new approach for Arabic text classification using Arabic field‐association terms
Randhawa et al. Study of spell checking techniques and available spell checkers in regional languages: a survey
CN111209737B (en) Method for screening out noise document and computer readable storage medium
Qu et al. Cross-language information extraction and auto evaluation for OOV term translations
Kamal et al. Enhancing arabic question answering system
CN106681982B (en) English novel abstraction generating method
Liu et al. Cross-Language Information Matching Technology Based on Term Extraction
Qu et al. Information extraction for thai celebrities from free text
Pishartoy et al. Extending capabilities of English to Marathi machine translator

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180413

RJ01 Rejection of invention patent application after publication