CN107908712A - Cross-language information matching process based on term extraction - Google Patents
Cross-language information matching process based on term extraction Download PDFInfo
- Publication number
- CN107908712A CN107908712A CN201711101619.1A CN201711101619A CN107908712A CN 107908712 A CN107908712 A CN 107908712A CN 201711101619 A CN201711101619 A CN 201711101619A CN 107908712 A CN107908712 A CN 107908712A
- Authority
- CN
- China
- Prior art keywords
- word
- term
- text
- english
- chinese
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/319—Inverted lists
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3335—Syntactic pre-processing, e.g. stopword elimination, stemming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
Abstract
The present invention is to provide a kind of cross-language information matching process based on term extraction.Chinese text is pre-processed using natural language processing technique, part-of-speech tagging is carried out for sentence;Word filtering is carried out to pre-processed results using the word-building rule of term, the border of word is determined by comentropy;The domain correlation degree of term is weighed using the IDF values of word in domain corpus, two groups of words are weighted processing, last set threshold value, accepts or rejects candidate terms according to term scoring event;On the basis of the field term of acquisition, centering english term aligns, and obtains term in the corresponding translation in this area.Finally utilize Chinese and English term alignment result structure retrieval type, establish contacting between Chinese and English, retrieval type is subjected to full-text search in English text, most matched English text is determined according to matching result, across language matching is realized using field term alignment result so as to reach.
Description
Technical field
The present invention relates to a kind of cross-language information matching process.
Background technology
1. term extraction
Term is the denotion that specific concept is directed in specific area.At present, the sight that scholars generally accept field term
Putting is:Field term (DomainTerm) is a set, this set is to be used for representing specific concept in specific ambit
Appellation.It is noun either S&T capability that domestic researcher, which generally accepts field term, but " name mentioned here
Word " is distinct from the noun in linguistics.
There are 1 between term and concept:1 pattern.Each field term may only represent only in other words
One specific concept.Conversely, a specific concept can only also have a unique denotion field term.This also exactly illustrates
Two key properties of field term, i.e. monosemy and single-character given name.Assuming that a term cannot meet monosemy and single-character given name,
So it is easy for contrary opinion and polysemia occur in a subject or multiple subjects.
Automatic term extraction method is broadly divided into three kinds:Rule-based method, Statistics-Based Method and rule and
Count the method being combined.Since rule-based and Statistics-Based Method respectively there are advantage and disadvantage, both can combine to take
Long benefit is short, and nearest research largely combines statistics and philological method.
2. across language matching
In Chinese and English cross-language information retrieval system, user puts question to the related English of retrieval (or the Chinese by Chinese (or English)
Language) document.There are four common matching process for cross-language information retrieval:Homologous matching (Homologous matching),
Document translation (Document translation), intermediate language technology (Interlingua Technology) and query translation
(Query translation)。
Homologous matching judges one of which language term according to macaronic word orthographic form or pronunciation similarity
Meaning, without any translation.For example, French can be counted as the English word for having misspelling, directly to English text
Part execution information is retrieved, and without mutually translation.
Document is translated and query translation contrast, and multilingual raw information set is first converted into identical with inquiry by it
Language, then carry out single language information retrieval process.
The methods of potential applications index (Latent SemanticIndexing, LSI) and Generalized Vector Space Model is existing
In the common method that need not be carried out translation and can complete cross-language information retrieval.Dive semantic indexing (Cross- across language
Language Latent Semantic Indexing) it is exactly Chinese and English in semantic space and corpus according to different language
Correspondence is so as to according to a kind of method of language retrieval another kind language.
Question-type input by user (original language) is exactly translated as the language (object language) of system support by query translation,
Then the question-type of object language is submitted into matching module again, carries out single language information retrieval.
The content of the invention
Being carried based on term for accuracy rate can be improved under conditions of close efficiency it is an object of the invention to provide a kind of
The cross-language information matching process taken.
The object of the present invention is achieved like this:
Step 1:Using in units of sentence as the basis of term extraction, by making pauses in reading unpunctuated ancient writings, segmenting and filtering deactivation to text
The procedure extraction data acquisition system of word;
Step 2:By obtaining lexical set to be filtered after Chinese word segmentation;
Step 3:Load stop words text, from lexical set read in a vocabulary, by vocabulary in stop words text into
Row is searched, if finding, is filtered out the character, is not otherwise filtered;
Step 4:Sentence carries out participle and stop words filtering for unit;
Step 5:Retain the noun if word.natures is noun, if word.nature is adjective, under judging
The part of speech of one word, if next word is noun, retains the noun;If word.nature is verb, former and later two are judged
Word, if noun, then retains the noun;
Step 6:Text collection is filtered by the method for step 3-5, the text set List after being filtered, returned
Set List after rule-based filtering;
Step 7:Assuming that length L (s) > 1 of character string s, and s is the left and right word border of some word, this character string s
By as a complete word;
Step 8:If s is word w left word borders accordingly, w is write as w=sx, wherein x is arbitrary string;If s
It is word w right word borders accordingly, is write w as w=ys, wherein y is arbitrary string;If s is the left and right word side of word w at the same time
Boundary, then necessarily have:(1) s is identical with w;(2) one kind inside w=sxs both of these case, and x is arbitrary string;It is false
, can be by this current character string as being one complete when can be as the right boundary of word such as some character string s
Word.
Step 9:According to based on term rules filter as a result, obtaining candidate terms set;
Step 10:The left and right comentropy of word in candidate terms set is calculated, total letter is then calculated according to left and right comentropy
Entropy is ceased, the comentropy of identical word is added, total information entropy is ranked up, and reservation meets left and right comentropy H (s) > IEmin
Word, wherein IEminIt is a customized numerical value;
Step 11:The score of each word is ranked up, if the score of word is identical, successively according to comentropy
Size and the size of anti-document frequency IDF are ranked up set;
Step 12:To the set after sequence, acquirement point is more than predetermined threshold, that is, Score > ScoreminWord retained,
Wherein ScoreminIt is the numerical value manually obtained;
Step 13:Machine translation is carried out to term Term and obtains Terminology Translation set Term_Translate;
Step 14:If Terminology Translation set Term_Translate is the translation carried out according to dictionary, word is added it to
In allusion quotation translation set List_Entrem, the English set Map_Result.put after Chinese and English alignment is otherwise added;
Step 15:Cartesian product List_Descartes is asked to the set in dictionary translation set List_EnTerm;
Step 16:Each set in cartesian product List_Descartes set is traveled through, if it does, then summation;
Then this word is inserted into and do not logged in the corresponding tables of data of word, searching, there is maximum and Sum to store English sequence accordingly
The set List_Max of row;Return to English final result Map_Result;
Step 17:Establishment inverted index is carried out to english information text with English Search Engines Lucene;
Step 18:Sortord is carried according to English Search Engines Lucene to calculate text score, and according to
Divide and return to ranking results;
Step 19:Search engine once receives keyword message and just starts to be retrieved in resource text;
Step 20:According to the scoring event of retrieval text, the text of detection is ranked up and is then forwarded to user.
Present invention solves the technical problem that it is:Due to these features that Chinese is shown, cause in extraction Chinese text art
During language, it is not possible to directly using external existing term extraction viewpoint and method.The design feature of Chinese language must be directed to, is visited
Study carefully the method and technology being applicable in during Chinese terminology extraction.And probed into for relation between term, can also help scholar to find
The domain knowledge included inside text.
The English alignment of Chinese is likely to be obtained in bilingual corpora, but is likely to a word occur when translation
The different meanings is represented in different fields, it is necessary to which we determine the translation of Chinese word according to linguistic context, this point is artificial
Can accomplish substantially determine, and machine can not complete this part work, currently used method is the side by statistics
Method determines the corresponding translation of word.
The characteristics of technical scheme, is mainly reflected in:
(1), the problem of border determines difficulty when being extracted for field term existing at present, the present invention are made using comentropy
Based on, the information entropy between word is calculated according to term rules filter result, so that it is determined that whether word is complete, and is used
IDF values propose that multiple features fusion mentioned in the present invention carries out term extraction method as domain correlation degree.
By the way of statistics and rule blend, the set obtained afterwards is filtered for stop words, according to word order and word
Property filtered, then using comentropy method calculate word comentropy, since comentropy cannot extract some low-frequency words, institute
To carry out term rules filtering to the text in domain corpus first, the IDF values for then calculating word in corpus are weighed
The domain correlation degree of word, that is, the two values are weighted averagely, and then obtain most suitable term.
It is high frequency words since word frequency statistics can extract some but is not the word of term, comentropy and association relationship can
It is enough to judge whether term complete, however comentropy be directed in article high frequency term extraction when show more preferable effect
Fruit, although log-likelihood ratio is more preferable to low-frequency word effect, is used in combination with other methods.And comentropy method
It is fine that result is extracted for the frequent words in text.It is not but art that this method, which will not extract inner tight in text and combine,
The word of language.For the above reasons, use information entropy method of the present invention calculates the entropy of word, passes through the restriction to threshold value, energy
Enough judge whether term is complete, and term extraction is realized so as to be combined with term rules.
(2), it is of the invention by word in bag of words for this extremely complex problem of the match pattern between Sino-British vocabulary
This concept of co-occurrence is applied to Chinese and English term alignment, the probability occurred at the same time by calculating English word, so as to obtain
The translation that Chinese terminology is most suitable in this area, this method not only allow for word directly turning in Chinese and English dictionary
Translate, it is also contemplated that the situation of term co-occurrence, can preferably be applied in the processing of field text.
Co-occurrence (can have the set of words model of repetition, the bag-of-words model) belongs in bag of words application
One kind, is usually used in increasing at present the performance of information retrieval.In general, the vocabulary occurred in text is provided to help author table to reach it
Theme, therefore theme relative words are can serve as, it is only different with the degree of correlation of theme.Document Modeling
Purpose is exactly to find out the vocabulary closely related with theme, and then the main feature of document is represented using these vocabulary.Just as objective
Existent is the same, and vocabulary also has its field label.Therefore, the vocabulary in same area or theme appears in a same piece jointly
Probability in document is of a relatively high, and the degree of correlation between vocabulary and theme is may determine that using the co-occurrence phenomenon of word.
In document sets space, in a fixed size window appearance of a vocabulary be usually associated with some collocation vocabulary
Appearance, illustrate the vocabulary and these collocation vocabulary between exist certain semantic relation, this vocabulary at the same time appearance phenomenon quilt
Referred to as Term co-occurrence phenomenon.Fixed size window can be a document, a natural paragraph, a sentence either with some word
K vocabulary distance range centered on remittance.
(3), asked for the question-type languages in cross-language information retrieval is carried out and different from each other this of information text languages
Topic, the question-type of text is built using term as characteristic value, Chinese and English term alignment is then carried out to characteristic value, according to alignment
As a result retrieval type is built, languages will be inquired about by this method and retrieval information unification is mapped as same language, finally by
The mode of full-text search, returns the result to obtain matching relationship according to retrieval.
Use the relatively good Lucene full-text index engines of current effect.For the facility of retrieval, first by Lucene
Establishment inverted index (representing which word is included in which document) is carried out to english information text, is then carried according to Lucene
Sortord calculates text score, and returns to ranking results according to score.
Under normal conditions, when user keys in the keyword for wanting inquiry, search engine once receives keyword letter
Breath just starts to be retrieved in resource text.Then according to the scoring event of retrieval text, the text of detection is ranked up
It is then forwarded to user.Lucene weighs the matching degree of Document and Query using score Score.
The main innovation point of the present invention is embodied in:
(1) field term extracting method based on multi-feature fusion is proposed.It is first due to unique word formation pattern of Chinese terminology
Rule-based filtering first is carried out to text, then the method for use information entropy determines term border, since comentropy cannot extract low frequency
Word, so weighing the domain correlation degree of term, the result that the two parts are obtained using the IDF values of word in domain corpus
Value carry out integrated treatment, the term in the text of field is extracted by the method for the present invention, and with common several terms
Extracting method compares and analyzes.Test result indicates that:Itd is proposed in the present invention based on multi-feature extraction field term algorithm ratio
It is more satisfactory.
(2) the Chinese and English term alignment algorithm based on bag of words is proposed.The present invention is for associative mode between Sino-British word
This complex problem, it is proposed that Chinese and English term alignment algorithm.On the basis of the term of extraction, term is turned over first
Translate, then according to the resource that uses of translation, different situations have a different alignment thereofs, according in bag of words when alignment
The co-occurrence rate of word obtains best Chinese terminology alignment.
(3) Problems Existing in cross-language information retrieval is directed to, the basis using term alignment result as this step, to Chinese
The english term of alignment is reconfigured, and builds question-type, so that full-text search is carried out in information resources, and according to retrieval
As a result the most matched information text of text to be matched is obtained.
Cross-language information matching primitives method of the kind based on term extraction of the present invention, is combined using statistics and rule
Mode extracts term, and the statistical method used in terminology extraction is analyzed first, and comentropy is found according to analysis result
Unique status in terminology extraction, the shortcomings that according to this method when extracting term, it is proposed that weigh term with IDF values
Domain correlation degree, field term extraction algorithm based on multi-feature fusion is finally proposed according to terminological rule.
The technique effect analysis and verification of the present invention:
For term extraction, the method comparative analysis with rule-based method, based on mutual information, and leading to respectively of the invention
Evaluation index is crossed to be evaluated, the results showed that algorithm effect proposed by the present invention is relatively good.
The present invention first according to the corpus of Chinese natural language process field using described in previous step based on multiple features
The terminology extraction method of fusion extracts term, then carries out machine translation to these terms, by these translation results all
Be saved in the nomenclature in database, according to bag of words obtain any two term occur in same piece article it is general
5398 term words have been obtained in rate, this step, these terms are calculated with bag of words, 7,500,000 left sides have been obtained
Right corresponding data, this 7,500,000 data is stored in database table, and then Database is indexed, is optimized so that
Retrieval time greatly reduces, and the term alignment for next step provides the foundation and may.
This part needs to carry out term extraction to the text in corpus, and the common evaluation index of terminology extraction has accuracy rate
(P), recall rate (R) and F- values (F) evaluate experimental result.It is as follows according to corpus result of calculation:P=83.57%, R
=81%, F=82.26%.
For Chinese and English correspondence and cross-language information retrieval the two significant process, the present invention, which proposes, a kind of is based on bag of words
The corresponding algorithm of Chinese and English of model, and retrieved using corresponding English in corpus, it is as a result also more satisfactory.
The present invention chooses a large amount of texts and inquired about, complete according to looking into for each inquiry, and/precision ratio and then averaging is looked into entirely/looks into
Quasi- rate, is then also had the method for being carried out full-text search based on body to be contrasted, drawn higher according to this method and tradition CLIR
Be averaged and look into complete/precision ratio:
The method of the present invention efficiency is poor unlike other methods effect, but retrieves if desired in corpus and text to be matched
During the relevant information text of this whole, the accuracy rate that the method for the present invention reaches has 66.67% or so, be effect be it is best,
And then the validity of the cross-language information matching technique based on term extraction is demonstrated, applicability is more preferable.
Brief description of the drawings
Fig. 1 vocabulary filtering process figures.
Fig. 2 Chinese and English cross-language information matches flow chart.
Cross-language information matching system module maps of the Fig. 3 based on term extraction.
Embodiment
Illustrate below and the present invention is described in more detail.
1. Text Pretreatment
Input:Need the text message analyzed
Output:Lexical set after participle
Step 1:The present invention is directed to the characteristic between word in traditional participle model and text, using sentence as unit,
Sentence is pre-processed, as the basis of term extraction, in this stage, by making pauses in reading unpunctuated ancient writings to text, participle and filtering stop
The procedure extraction data acquisition system of word.
2. vocabulary filtering stop words filtering
After Chinese word segmentation is carried out to text, single character string one by one can be obtained.It is seen that in sentence
In have a great influence to semantic meaning representation is mostly noun and verb, the vocabulary of other ornamental equivalents to the semantic effect of sentence not
Greatly, so needing to retain this part name influential on sentence semantics or verb, and they are referred to as to continue to employ word.So
And in the sequence after these participles, the frequency occurred in the text there are a part of word is very high, but actually to text
Analysis does not have too much influence.These vocabulary are mainly made of auxiliary words of mood, preposition, adverbial word etc., and these words do not have in itself
There is clear and definite implication, only when using these words as can just play some use during a part for sentence.This kind of word is referred to as to stop
Word.Therefore, if can the inessential vocabulary of those in the text of field filtered completely, it will greatly save system
Memory space and reduce the verification system middle and later periods workload and calculation amount.Therefore the design in preprocessing part not only will
Select suitably participle and dimensioning algorithm are simultaneously improved appropriately, and the process that also filtered to vocabulary carries out appropriate set
Meter.The flow chart of vocabulary filtration step is as shown in Figure 1.
Precondition:Participle operation is performed
Input:Character set to be filtered
Output:Text lexical set after vocabulary filtering
Step 2:By lexical set to be filtered can be obtained after Chinese word segmentation.
Step 3:Stop words text is loaded, a vocabulary is read in from the set, vocabulary is carried out in stop words text
Search.If finding, need to filter out the character, otherwise filter.
3. field term extracts
By the way of statistics and rule blend, the set obtained afterwards is filtered for stop words, according to word order and word
Property filtered, then using comentropy method calculate word comentropy, since comentropy cannot extract some low-frequency words, institute
To carry out term rules filtering to the text in domain corpus first, the IDF values for then calculating word in corpus are weighed
The domain correlation degree of word, that is, the two values are weighted averagely, and then obtain most suitable term.
Precondition:Perform and complete vocabulary filter operation
Input:Chinese text text
Output:Set List after filtering
Step 4:Sentence carries out participle and stop words filtering for unit.
Step 5:Retain the noun if word.natures is noun, if word.nature is adjective, under judging
The part of speech of one word, if next word is noun, retains the noun.If word.nature is verb, former and later two are judged
Word, if noun, then retains the noun.
Step 6:Return to the set List after rule-based filtering;
4. application of the comentropy in term
It is high frequency words since word frequency statistics can extract some but is not the word of term, comentropy and association relationship can
It is enough to judge whether term complete, however comentropy be directed in article high frequency term extraction when show more preferable effect
Fruit, although log-likelihood ratio is more preferable to low-frequency word effect, is used in combination with other methods.And comentropy method
It is fine that result is extracted for the frequent words in text.It is not but art that this method, which will not extract inner tight in text and combine,
The word of language.For the above reasons, use information entropy method of the present invention calculates the entropy of word, passes through the restriction to threshold value, energy
Enough judge whether term is complete, and term extraction is realized so as to be combined with term rules.
Precondition:Perform and complete field term extraction.
Input:Set List after rule-based filtering
Output:Term and the corresponding information entropy Map_Entropy of term
Step 7:Assuming that length L (s) > 1 of character string s, and s is the left and right word border of some word, then this character
String s can be regarded as a complete word.
Step 8:If s is word w left word borders accordingly, then w can be write to w=sx as, wherein x is any character
String;If s is word w right word borders accordingly, then w can be write to w=ys as, wherein y is arbitrary string;If s is at the same time
It is the left and right word border of word w, then necessarily have:(1) s is identical with w;(2) one kind inside w=sxs both of these case, and x
It is arbitrary string.
5. blend extraction field term based on multiple features
TF-IDF is a kind of statistical method, and it is either whole to a text to evaluate some word according to this method
The significance level of any one text in corpus.Frequency of occurrence is directly proportional in the text with this word for the importance of word,
The frequency occurred with this word in corpus is inversely proportional.
Precondition:Perform the calculating for completing comentropy.
Input:The HashMapMap_Entropy obtained according to the method for rule and Information entropy fusion, according to field language material
The HashMapMap_Idf that storehouse is calculated.
Output:The HashMapMap_Result for the term extraction method extraction that text proposes according to the present invention
Step 9:According to based on term rules filter as a result, obtaining candidate terms set.
Step 10:The left and right comentropy of word in set of computations, then calculates total information entropy according to left and right comentropy, identical
The comentropy of word is added, and total information entropy is ranked up, and reservation meets H (s) > IEminWord, wherein IEminIt is certainly
One numerical value of definition.
Step 11:The score of each word is ranked up, if the score of word is identical, successively according to information
The size of entropy and the size of IDF are ranked up set;
Step 12:To the set after sequence, Score > Score are takenminWord retained, wherein ScoreminIt is artificial
Obtained numerical value.
6. the Chinese and English alignment based on bag of words
Term alignment is exactly found out one-to-one between this field Chinese word and English word from the text of field
Correspondence.Form, it is assumed that eaAnd ebIt is English word, (e can be obtained by bag of words beforea,eb) in English corpus
In co-occurrence value, expression is the e in corpusaAnd ebThe probability occurred at the same time.Precondition:Completion vocabulary is performed to merge
Journey
Input:The corresponding term set List of Chinese text
Output:HashMapMap_Result after Chinese and English alignment
Step 13:Machine translation is carried out to Term and obtains Term_Translate;
Step 14:If term_Translate is the translation carried out according to dictionary, add it in term_entrem.
Otherwise Map_Result.put is added.
Step 15:Cartesian product is asked to the set in List_EnTerm.List_Descartes
Step 16:Each set in List_Descartes set is traveled through, if it does, then Sum=Sum+
Value;Then this word is inserted into and do not logged in the corresponding tables of data of word, searching, there is maximum Sum to store English accordingly
The set List_Max of literary sequence;Return to Map_Result;
7. cross-language information retrieval
When being retrieved, read first according to the obtained translator of English of Chinese as a result, according to corresponding result to English into
The retrieval type of structure, is then consigned to search engine and performs retrieval ordering, further obtained most by row combination, construction retrieval type
With text.The flow chart of this step is as shown in Figure 2.
Precondition:Perform and complete the Chinese and English alignment based on bag of words
Input:The map_result of Chinese and English alignment
Output:Ranking results
Step 17:Establishment inverted index, which (represents which includes in which document, to be carried out to english information text with Lucene
Word)
Step 18:Sortord is carried according to Lucene to calculate text score, and sequence knot is returned to according to score
Fruit.
8. retrieval result sorts
Precondition:Perform and complete cross-language information retrieval
Input:Text cross-language information retrieval result
Output:Detect the ordering scenario of text
Step 19:Search engine once receives keyword message and just starts to be retrieved in resource text.
Step 20:According to the scoring event of retrieval text, the text of detection is ranked up and is then forwarded to user.
Whole technology paths are as shown in Figure 3.
Claims (1)
1. a kind of cross-language information matching process based on term extraction, it is characterized in that:
Step 1:Using in units of sentence as the basis of term extraction, by making pauses in reading unpunctuated ancient writings, segmenting and filtering stop words to text
Procedure extraction data acquisition system;
Step 2:By obtaining lexical set to be filtered after Chinese word segmentation;
Step 3:Stop words text is loaded, a vocabulary is read in from lexical set, vocabulary is looked into stop words text
Look for, if finding, filter out the character, otherwise do not filter;
Step 4:Sentence carries out participle and stop words filtering for unit;
Step 5:Retain the noun if word.natures is noun, if word.nature is adjective, judge next
The part of speech of word, if next word is noun, retains the noun;If word.nature is verb, former and later two words are judged,
If noun, then retain the noun;
Step 6:Text collection is filtered by the method for step 3-5, the text set List after being filtered, return to rule
Set List after filtering;
Step 7:Assuming that length L (s) > 1 of character string s, and s is the left and right word border of some word, this character string s is made
For a complete word;
Step 8:If s is word w left word borders accordingly, w is write as w=sx, wherein x is arbitrary string;If s is word
W right word borders accordingly, are write w as w=ys, wherein y is arbitrary string;If s is the left and right word border of word w at the same time, that
Necessarily have:(1) s is identical with w;(2) one kind inside w=sxs both of these case, and x is arbitrary string;
Step 9:According to based on term rules filter as a result, obtaining candidate terms set;
Step 10:The left and right comentropy of word in candidate terms set is calculated, total information entropy is then calculated according to left and right comentropy,
The comentropy of identical word is added, and total information entropy is ranked up, and reservation meets left and right comentropy H (s) > IEminWord
Language, wherein IEminIt is a customized numerical value;
Step 11:The score of each word is ranked up, if the score of word is identical, successively according to the size of comentropy
Set is ranked up with the size of anti-document frequency IDF;
Step 12:To the set after sequence, acquirement point is more than predetermined threshold, that is, Score > ScoreminWord retained, wherein
ScoreminIt is the numerical value manually obtained;
Step 13:Machine translation is carried out to Term and obtains Terminology Translation set Term_Translate;
Step 14:If Terminology Translation set Term_Translate is the translation carried out according to dictionary, adds it to dictionary and turn over
Translate in set Term_entrem, otherwise add the English set Map_Result.put after Chinese and English alignment;
Step 15:Cartesian product List_Descartes is asked to the set in dictionary translation set List_EnTerm;
Step 16:Each set in cartesian product List_Descartes set is traveled through, if it does, then Sum=Sum+
Value;Then this word is inserted into and do not logged in the corresponding tables of data of word, searching, there is maximum Sum to store English accordingly
The set List_Max of literary sequence;Return to English final result Map_Result;
Step 17:Establishment inverted index is carried out to english information text with English Search Engines Lucene;
Step 18:Sortord is carried according to English Search Engines Lucene to calculate text score, and is returned according to score
Return ranking results;
Step 19:Search engine once receives keyword message and just starts to be retrieved in resource text;
Step 20:According to the scoring event of retrieval text, the text of detection is ranked up and is then forwarded to user.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711101619.1A CN107908712A (en) | 2017-11-10 | 2017-11-10 | Cross-language information matching process based on term extraction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711101619.1A CN107908712A (en) | 2017-11-10 | 2017-11-10 | Cross-language information matching process based on term extraction |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107908712A true CN107908712A (en) | 2018-04-13 |
Family
ID=61843958
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711101619.1A Pending CN107908712A (en) | 2017-11-10 | 2017-11-10 | Cross-language information matching process based on term extraction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107908712A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109299480A (en) * | 2018-09-04 | 2019-02-01 | 上海传神翻译服务有限公司 | Terminology Translation method and device based on context of co-text |
CN109918658A (en) * | 2019-02-28 | 2019-06-21 | 云孚科技(北京)有限公司 | A kind of method and system obtaining target vocabulary from text |
CN110472047A (en) * | 2019-07-15 | 2019-11-19 | 昆明理工大学 | A kind of Chinese of multiple features fusion gets over news viewpoint sentence abstracting method |
CN110795541A (en) * | 2019-08-23 | 2020-02-14 | 腾讯科技(深圳)有限公司 | Text query method and device, electronic equipment and computer readable storage medium |
CN110928989A (en) * | 2019-11-01 | 2020-03-27 | 暨南大学 | Language model-based annual newspaper corpus construction method |
CN111222328A (en) * | 2018-11-26 | 2020-06-02 | 百度在线网络技术(北京)有限公司 | Label extraction method and device and electronic equipment |
CN111797621A (en) * | 2020-06-04 | 2020-10-20 | 语联网(武汉)信息技术有限公司 | Method and system for replacing terms |
CN112464634A (en) * | 2020-12-23 | 2021-03-09 | 中译语通科技股份有限公司 | Cross-language entity automatic alignment method and system based on mutual information entropy |
CN112507060A (en) * | 2020-12-14 | 2021-03-16 | 福建正孚软件有限公司 | Domain corpus construction method and system |
CN112992303A (en) * | 2019-12-15 | 2021-06-18 | 苏州市爱生生物技术有限公司 | Human phenotype standard expression extraction method |
CN115204190A (en) * | 2022-09-13 | 2022-10-18 | 中科聚信信息技术(北京)有限公司 | Device and method for converting financial field terms into English |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5301109A (en) * | 1990-06-11 | 1994-04-05 | Bell Communications Research, Inc. | Computerized cross-language document retrieval using latent semantic indexing |
-
2017
- 2017-11-10 CN CN201711101619.1A patent/CN107908712A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5301109A (en) * | 1990-06-11 | 1994-04-05 | Bell Communications Research, Inc. | Computerized cross-language document retrieval using latent semantic indexing |
Non-Patent Citations (1)
Title |
---|
孙素艳: "基于术语提取的跨语言信息匹配技术研究", 《万方数据》 * |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109299480A (en) * | 2018-09-04 | 2019-02-01 | 上海传神翻译服务有限公司 | Terminology Translation method and device based on context of co-text |
CN109299480B (en) * | 2018-09-04 | 2023-11-07 | 上海传神翻译服务有限公司 | Context-based term translation method and device |
CN111222328A (en) * | 2018-11-26 | 2020-06-02 | 百度在线网络技术(北京)有限公司 | Label extraction method and device and electronic equipment |
CN111222328B (en) * | 2018-11-26 | 2023-06-16 | 百度在线网络技术(北京)有限公司 | Label extraction method and device and electronic equipment |
CN109918658A (en) * | 2019-02-28 | 2019-06-21 | 云孚科技(北京)有限公司 | A kind of method and system obtaining target vocabulary from text |
CN110472047B (en) * | 2019-07-15 | 2022-12-13 | 昆明理工大学 | Multi-feature fusion Chinese-Yue news viewpoint sentence extraction method |
CN110472047A (en) * | 2019-07-15 | 2019-11-19 | 昆明理工大学 | A kind of Chinese of multiple features fusion gets over news viewpoint sentence abstracting method |
CN110795541A (en) * | 2019-08-23 | 2020-02-14 | 腾讯科技(深圳)有限公司 | Text query method and device, electronic equipment and computer readable storage medium |
CN110795541B (en) * | 2019-08-23 | 2023-05-26 | 腾讯科技(深圳)有限公司 | Text query method, text query device, electronic equipment and computer readable storage medium |
CN110928989A (en) * | 2019-11-01 | 2020-03-27 | 暨南大学 | Language model-based annual newspaper corpus construction method |
CN112992303A (en) * | 2019-12-15 | 2021-06-18 | 苏州市爱生生物技术有限公司 | Human phenotype standard expression extraction method |
CN111797621A (en) * | 2020-06-04 | 2020-10-20 | 语联网(武汉)信息技术有限公司 | Method and system for replacing terms |
CN111797621B (en) * | 2020-06-04 | 2024-05-14 | 语联网(武汉)信息技术有限公司 | Term replacement method and system |
CN112507060A (en) * | 2020-12-14 | 2021-03-16 | 福建正孚软件有限公司 | Domain corpus construction method and system |
CN112464634A (en) * | 2020-12-23 | 2021-03-09 | 中译语通科技股份有限公司 | Cross-language entity automatic alignment method and system based on mutual information entropy |
CN112464634B (en) * | 2020-12-23 | 2023-09-05 | 中译语通科技股份有限公司 | Cross-language entity automatic alignment method and system based on mutual information entropy |
CN115204190A (en) * | 2022-09-13 | 2022-10-18 | 中科聚信信息技术(北京)有限公司 | Device and method for converting financial field terms into English |
CN115204190B (en) * | 2022-09-13 | 2022-11-22 | 中科聚信信息技术(北京)有限公司 | Device and method for converting financial field terms into English |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107908712A (en) | Cross-language information matching process based on term extraction | |
KR20160060253A (en) | Natural Language Question-Answering System and method | |
Baeza Yates et al. | Cassa: A context-aware synonym simplification algorithm | |
Afzal et al. | Semantically enhanced concept search of the Holy Quran: Qur’anic English WordNet | |
Lahbari et al. | Arabic question classification using machine learning approaches | |
Dorji et al. | Extraction, selection and ranking of Field Association (FA) Terms from domain-specific corpora for building a comprehensive FA terms dictionary | |
Hättasch et al. | It's ai match: A two-step approach for schema matching using embeddings | |
Biltawi et al. | Arabic question answering systems: gap analysis | |
Gupta | Hybrid algorithm for multilingual summarization of Hindi and Punjabi documents | |
Lahbari et al. | Toward a new arabic question answering system. | |
Basirat et al. | Lexical and morpho-syntactic features in word embeddings-a case study of nouns in swedish | |
Kovář et al. | Finding definitions in large corpora with Sketch Engine | |
Mollaei et al. | Question classification in Persian language based on conditional random fields | |
Ventura et al. | Mining concepts from texts | |
Badaro et al. | A link prediction approach for accurately mapping a large-scale Arabic lexical resource to English WordNet | |
Muhammad et al. | A comparison between conditional random field and structured support vector machine for Arabic named entity recognition | |
Atlam et al. | A new approach for Arabic text classification using Arabic field‐association terms | |
Randhawa et al. | Study of spell checking techniques and available spell checkers in regional languages: a survey | |
CN111209737B (en) | Method for screening out noise document and computer readable storage medium | |
Qu et al. | Cross-language information extraction and auto evaluation for OOV term translations | |
Kamal et al. | Enhancing arabic question answering system | |
CN106681982B (en) | English novel abstraction generating method | |
Liu et al. | Cross-Language Information Matching Technology Based on Term Extraction | |
Qu et al. | Information extraction for thai celebrities from free text | |
Pishartoy et al. | Extending capabilities of English to Marathi machine translator |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180413 |
|
RJ01 | Rejection of invention patent application after publication |