CN108287825A

CN108287825A - A kind of term identification abstracting method and system

Info

Publication number: CN108287825A
Application number: CN201810009626.7A
Authority: CN
Inventors: 王建华; 程国艮
Original assignee: Chinese Translation Language Through Polytron Technologies Inc
Current assignee: Chinese Translation Language Through Polytron Technologies Inc
Priority date: 2018-01-05
Filing date: 2018-01-05
Publication date: 2018-07-17

Abstract

The invention belongs to technical field of language recognition, a kind of term identification abstracting method and system are disclosed, including：Repeatedly identification is carried out to term to extract；Identify more term combination terms；Match translation translation；And carry out term extraction.The present invention is to improve localization interpreter's work, improves translation efficiency, provides a kind of term identification abstracting method, program automatically analyzes document, extracts technical term, and Rapid matching goes out to translate translation, it is improved work efficiency for localization interpreter, and the accuracy translated.The present invention carries out repeatedly identification to term and extracts, and improves accuracy rate；More term combination terms can be accurately identified.

Description

A kind of term identification abstracting method and system

Technical field

The invention belongs to technical field of language recognition more particularly to a kind of term identification abstracting method and systems.

Background technology

It is found in real work, localization interpreter needs to carry out manual sieve to the technical term in document in translation document Choosing, then carries out technical translator, the operation is found in the course of work, and short-range order is not cumbersome and takes time and effort, most in turn to it again It is important to do many repetitive operations.The prior art localizes in interpreter's work, and translation efficiency is low；Accuracy is poor.

In conclusion problem of the existing technology is：The most important defect of the prior art is existed in term extraction As soon as this big term has been splitted into multiple terms by the case where combination of adjacent term is whole term after processed, But the reason of really term, defect, is only to have done terminological analysis to the single vocabulary after participle, does not account for Close on the term relationship between word, difficult point be by calculate close on the relationship of vocabulary judge adjacent words combination whether be Term.And the prior art cannot judge adjacent art by term extraction algorithm by way of calculating adjacent term weight Whether the character string of language composition is a term.

Invention content

In view of the problems of the existing technology, the present invention provides a kind of term identification abstracting method and systems.

The invention is realized in this way a kind of term identifies that abstracting method, the term identification abstracting method include：To art Language carries out repeatedly identification and extracts；Identify more term combination terms；Match translation translation；And carry out term extraction.

Further, the technical term identification, which is extracted, includes：

A) prepare：Arrange each field term library of each languages, corresponding translation content, languages and field：

B) division in field；

C) field operation and participle are detected participle by part-of-speech tagging algorithm and (pass through to each word after participle Part-of-speech tagging algorithm carries out part-of-speech tagging, the word of the parts of speech such as removal number, quantifier, adverbial word, preposition, conjunction, auxiliary word, interjection.), Judge the participle be term probability how many directly ignore if low, the high then reservation of probability；

D) vocabulary generated according to step c), is matched with the terminology bank in the languages, field and (takes the vocabulary of generation Terminology bank matching inquiry is gone to whether there is in turn), if it does, regarding as term, remaining vocabulary carries out next step operation；

E) the remaining vocabulary of step d) carries out matching filtering and (takes remaining vocabulary to go in turn by non-glossary of term The inquiry of non-art vocabulary whether there is, if it does, the vocabulary is not belonging to term), if vocabulary there are non-glossary of term, It is term to assert the vocabulary not；

F) with the matching of terminology bank, non-terminology bank, two groups of data are determined：Term, non-term.

G) term of document, non-terminology data are carried out to the extraction of a term again again by term extraction method.

Further, term extraction method further comprises：

1) given text T is split according to complete words and (carries out punctuate segmentation according to punctuation mark), T=[S1, S2 ..., Sm]；

2) for each sentence, participle and part-of-speech tagging processing are carried out, and filter out stop words, only retains and specifies part of speech Word (carries out part-of-speech tagging, removal number, quantifier, adverbial word, preposition, company to each word after participle by part-of-speech tagging algorithm The word of the parts of speech such as word, auxiliary word, interjection.), Si=[ti, 1, ti, 2 ..., ti, m], wherein ti, j ∈ Sj are the candidates after retaining Term；

3) structure candidate terms figure G=(V, E), wherein V are set of node, are made of the candidate terms generated；Then Use cooccurrence relation construction it is wantonly between 2 points side (window is constructed centered on current word by cooccurrence relation, such as The distance of each two words of movement in left and right there is 5 words inside this window), there are sides only when it between two nodes Corresponding vocabulary length be K window in co-occurrence, K indicate window size, K word of most co-occurrences；

4) according to formula And huge corpus is combined, the weight of each node of iterative diffusion (in the window of each word, calculates the word in window successively The weight relationship of each word), until convergence；

5) to node weights carry out Bit-reversed (carry out flashback sequence according to weight size, weight it is big come front), Most important T word is obtained, as candidate terms；

6) it by 5) obtaining most important T word, is marked in urtext, if forming adjacent phrase, combines At more word terms；Term sequence is added；

7) two groups of data are determined：Term, non-term；

8) terminology data generated to term process twice is integrated, and re-scheduling is then combined with (twice after term process Two groups are obtained as a result, two groups of results are merged, removal dittograph only stays one), finally obtain all terms.

Another object of the present invention is to provide the language translation systems that a kind of term identifies abstracting method.

The present invention is to improve localization interpreter's work, improves translation efficiency, now provides a kind of term identification abstracting method, journey Sequence automatically analyzes document, extracts technical term, and Rapid matching goes out to translate translation, and work effect is improved for localization interpreter Rate, and translation accuracy, former interpreter translates an article and needs 3 days, can meet within present 1 day.

The present invention carries out repeatedly identification to term and extracts, and improves accuracy rate.The present invention can accurately identify more term combination arts Language.

Description of the drawings

Fig. 1 is term identification abstracting method flow chart provided in an embodiment of the present invention.

Specific implementation mode

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.

The most important defect of the prior art is in term extraction, and the combination there are adjacent term is a whole term This big term has just been splitted into multiple terms, but a really term by situation after processed；And the prior art is not By term extraction algorithm, can be judged by way of calculating adjacent term weight adjacent term composition character string whether be One term.

The application principle of the present invention is explained in detail below in conjunction with the accompanying drawings.

As shown in Figure 1, term provided in an embodiment of the present invention identifies abstracting method, including：

One) technical term identification is extracted：

Such as：The term of field of computer technology：Apple, corresponding translation content may be：Apple Technolegy.

B) field of food belongs to：Apple, corresponding translation content may be：apple fruit；

Program handles document, and word segmentation processing (reject various punctuation marks and segment) is carried out to the content of text in document.

C) field operation and participle are detected participle by part-of-speech tagging algorithm, judge that the participle is term probability How many directly ignores if low, the high then reservation of probability.

D) vocabulary generated according to step c), is matched with the terminology bank in the languages, field, if match, recognize Fixed its is term, and remaining vocabulary carries out next step operation.

E) the remaining vocabulary of step d) carries out matching filtering by non-glossary of term, if there are non-term words for vocabulary Remittance table, then it is term to assert the vocabulary not.

F) two groups of data can be determined by former steps and the matching of terminology bank, non-terminology bank：Term, non-term；

G) data of document are carried out to the extraction of a term again again by term extraction algorithm, steps are as follows：

Term extraction algorithm：

1) given text T is split according to complete words, i.e. T=[S1, S2 ..., Sm].

2) for each sentence, participle and part-of-speech tagging processing are carried out, and filter out stop words, only retains and specifies part of speech Word, i.e. Si=[ti, 1, ti, 2 ..., ti, m], wherein ti, j ∈ Sj are the candidate terms after retaining.

3) structure candidate terms figure G=(V, E), wherein V are set of node, are made of the candidate terms 2. generated, then adopt There are side only it is K's in length when their corresponding vocabulary between two nodes with the wantonly side between 2 points of cooccurrence relation construction Co-occurrence in window, K indicate window size, i.e., most K words of co-occurrence.

4) according to formula above, and huge corpus is combined, the weight of each node of iterative diffusion, until convergence.

5) Bit-reversed is carried out to node weights, to obtain most important T word, as candidate terms.

6) it by 5) obtaining most important T word, is marked in urtext, if forming adjacent phrase, combines At more word terms.For example, having sentence " Matlab code for plotting ambiguity function " in text, such as Fruit " Matlab " and " code " belong to candidate terms, then are combined into " Matlab code " and term sequence is added.

7) two groups of data can be determined：Term, non-term wherein contain multiple words or multiple arts in term group The term of language composition.

8) terminology data generated to term process twice is integrated, and re-scheduling is then combined with, and finally obtains this article In all terms.

With reference to specific embodiment, the invention will be further described.

Term provided in an embodiment of the present invention identifies abstracting method, including：

1, test data：Programmer is the professional for being engaged in program development, maintenance.Programmer is generally divided into program to set Meter personnel and program coding personnel, but the boundary of the two is not perfectly clear, particularly in China.Software practitioner is divided into just Grade programmer, four major class of senior programmer, systems analyst and project manager.

2, this section words are segmented first：[programmer is to be engaged in, program, exploitation,, it safeguards, profession, people Member,., generally, will, programmer is divided into, program, design, personnel, and, program, coding, personnel,, but, the two, boundary, It is and no, very, clear,, especially, be, China,., software, working, personnel are divided into, primary, programmer,, advanced, journey Sequence person,, system, analyst, and, project, manager, four, big/a, class,.].

3, punctuation mark, adjective, verb, interjection etc. are removed.

4, remaining phrase is after processing：[programmer, English, program are developed, and are safeguarded, profession, personnel, programmer, journey Sequence, design, personnel, program, coding, personnel, boundary, especially, China, software, personnel are divided into, and programmer is advanced, programmer, System, analyst, project, manager].

5, term extraction algorithm through the invention, calculates the weight of adjacent word, first carries out the extraction of first round term.

6, the array after extraction is then：[professional, programmer, program coding personnel, Chinese software people Member, senior programmer, systems analyst, project manager].

7, remaining vocabulary is matched with nomenclature, rejects non-term, this completes the extractions to term.

8, the final result is：[programmer, English, program, professional, programmer, program coding people Member, Chinese software personnel, senior programmer, systems analyst, project manager].

9, again to the phrase that the 4th step generates carry out with the matched operation of terminology bank, match term, take the term and the The term phrase that 8 steps generate carries out re-scheduling merging, obtains term result to the end.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention All any modification, equivalent and improvement etc., should all be included in the protection scope of the present invention made by within refreshing and principle.

Claims

1. a kind of term identifies abstracting method, which is characterized in that the term identifies that abstracting method includes：Term is carried out multiple Identification is extracted；Identify more term combination terms；Match translation translation；And carry out term extraction.

2. term as described in claim 1 identifies abstracting method, which is characterized in that the technical term identifies abstracting method packet It includes：

B) division in field；

C) field operation and participle are detected participle by part-of-speech tagging algorithm, it is more to judge that the participle is that term probability has It is few, if low, directly ignore, the high then reservation of probability；

D) vocabulary generated according to step c), is matched with the terminology bank in the languages, field, if it does, regarding as art Language, remaining vocabulary carry out next step operation；

E) the remaining vocabulary of step d) carries out matching filtering by non-glossary of term, if there are non-term vocabulary for vocabulary Table, it is term to assert the vocabulary not；

3. term as described in claim 1 identifies abstracting method, which is characterized in that the technical term identify abstracting method into One step includes：

1) given text T is split according to complete words, T=[S1, S2 ..., Sm]；

2) for each sentence, participle and part-of-speech tagging processing are carried out, and filter out stop words, only retains the list for specifying part of speech Word, Si=[ti, 1, ti, 2 ..., ti, m], wherein ti, j ∈ Sj are the candidate terms after retaining；

3) structure candidate terms figure G=(V, E), wherein V are set of node, are made of the candidate terms generated；Then it uses The wantonly side between 2 points of cooccurrence relation construction, there are sides only when their corresponding vocabulary are in the window that length is K between two nodes Co-occurrence in mouthful, K indicate window size, K word of most co-occurrences；

4) according to formula above, and huge corpus is combined, the weight of each node of iterative diffusion, until convergence；

5) Bit-reversed is carried out to node weights, most important T word is obtained, as candidate terms；

6) it by 5) obtaining most important T word, is marked in urtext, if forming adjacent phrase, is combined into more Word term；Term sequence is added；

7) two groups of data are determined：Term, non-term；

8) terminology data generated to term process twice is integrated, and re-scheduling is then combined with, and finally obtains all terms.

4. term as claimed in claim 2 identifies abstracting method, which is characterized in that being detected method to participle is：Participle Part-of-speech tagging is carried out by part-of-speech tagging algorithm to each word later, removal number, adverbial word, preposition, conjunction, auxiliary word, is sighed at quantifier The word of the parts of speech such as word；

Carrying out matching process with the terminology bank in the languages, field is：The vocabulary for taking generation removes terminology bank matching inquiry in turn It whether there is；

By non-glossary of term, carrying out matching filter method is：Take remaining vocabulary goes the inquiry of non-art vocabulary to be in turn No presence, if it does, the vocabulary is not belonging to term.

5. term as claimed in claim 3 identifies abstracting method, which is characterized in that given text T according to complete words The method of being split is：Punctuate segmentation is carried out according to punctuation mark；

For each sentence, participle and part-of-speech tagging processing is carried out, and filter out stop words, only retain the word side for specifying part of speech Method is：Part-of-speech tagging, removal number, quantifier, adverbial word, preposition, company are carried out by part-of-speech tagging algorithm to each word after participle The word of the parts of speech such as word, auxiliary word, interjection；

Use the wantonly side method between 2 points of cooccurrence relation construction for：One is constructed centered on current word by cooccurrence relation Window；

The weight method of each node of iterative diffusion is：In the window of each word, the word is calculated successively with each word in window Weight relationship；

Carrying out Bit-reversed method to node weights is：Carry out flashback sequence according to weight size, weight it is big come front；

The terminology data generated to term process twice is integrated, re-scheduling, and the method for being then combined with is：Twice after term process Two groups are obtained as a result, two groups of results are merged, removal dittograph only stays one.

6. a kind of language translation system of term identification abstracting method as claimed in any one of claims 1 to 5, wherein.