CN108287825A - A kind of term identification abstracting method and system - Google Patents
A kind of term identification abstracting method and system Download PDFInfo
- Publication number
- CN108287825A CN108287825A CN201810009626.7A CN201810009626A CN108287825A CN 108287825 A CN108287825 A CN 108287825A CN 201810009626 A CN201810009626 A CN 201810009626A CN 108287825 A CN108287825 A CN 108287825A
- Authority
- CN
- China
- Prior art keywords
- term
- word
- vocabulary
- participle
- abstracting method
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention belongs to technical field of language recognition, a kind of term identification abstracting method and system are disclosed, including:Repeatedly identification is carried out to term to extract;Identify more term combination terms;Match translation translation;And carry out term extraction.The present invention is to improve localization interpreter's work, improves translation efficiency, provides a kind of term identification abstracting method, program automatically analyzes document, extracts technical term, and Rapid matching goes out to translate translation, it is improved work efficiency for localization interpreter, and the accuracy translated.The present invention carries out repeatedly identification to term and extracts, and improves accuracy rate;More term combination terms can be accurately identified.
Description
Technical field
The invention belongs to technical field of language recognition more particularly to a kind of term identification abstracting method and systems.
Background technology
It is found in real work, localization interpreter needs to carry out manual sieve to the technical term in document in translation document
Choosing, then carries out technical translator, the operation is found in the course of work, and short-range order is not cumbersome and takes time and effort, most in turn to it again
It is important to do many repetitive operations.The prior art localizes in interpreter's work, and translation efficiency is low;Accuracy is poor.
In conclusion problem of the existing technology is:The most important defect of the prior art is existed in term extraction
As soon as this big term has been splitted into multiple terms by the case where combination of adjacent term is whole term after processed,
But the reason of really term, defect, is only to have done terminological analysis to the single vocabulary after participle, does not account for
Close on the term relationship between word, difficult point be by calculate close on the relationship of vocabulary judge adjacent words combination whether be
Term.And the prior art cannot judge adjacent art by term extraction algorithm by way of calculating adjacent term weight
Whether the character string of language composition is a term.
Invention content
In view of the problems of the existing technology, the present invention provides a kind of term identification abstracting method and systems.
The invention is realized in this way a kind of term identifies that abstracting method, the term identification abstracting method include:To art
Language carries out repeatedly identification and extracts;Identify more term combination terms;Match translation translation;And carry out term extraction.
Further, the technical term identification, which is extracted, includes:
A) prepare:Arrange each field term library of each languages, corresponding translation content, languages and field:
B) division in field;
C) field operation and participle are detected participle by part-of-speech tagging algorithm and (pass through to each word after participle
Part-of-speech tagging algorithm carries out part-of-speech tagging, the word of the parts of speech such as removal number, quantifier, adverbial word, preposition, conjunction, auxiliary word, interjection.),
Judge the participle be term probability how many directly ignore if low, the high then reservation of probability;
D) vocabulary generated according to step c), is matched with the terminology bank in the languages, field and (takes the vocabulary of generation
Terminology bank matching inquiry is gone to whether there is in turn), if it does, regarding as term, remaining vocabulary carries out next step operation;
E) the remaining vocabulary of step d) carries out matching filtering and (takes remaining vocabulary to go in turn by non-glossary of term
The inquiry of non-art vocabulary whether there is, if it does, the vocabulary is not belonging to term), if vocabulary there are non-glossary of term,
It is term to assert the vocabulary not;
F) with the matching of terminology bank, non-terminology bank, two groups of data are determined:Term, non-term.
G) term of document, non-terminology data are carried out to the extraction of a term again again by term extraction method.
Further, term extraction method further comprises:
1) given text T is split according to complete words and (carries out punctuate segmentation according to punctuation mark), T=[S1,
S2 ..., Sm];
2) for each sentence, participle and part-of-speech tagging processing are carried out, and filter out stop words, only retains and specifies part of speech
Word (carries out part-of-speech tagging, removal number, quantifier, adverbial word, preposition, company to each word after participle by part-of-speech tagging algorithm
The word of the parts of speech such as word, auxiliary word, interjection.), Si=[ti, 1, ti, 2 ..., ti, m], wherein ti, j ∈ Sj are the candidates after retaining
Term;
3) structure candidate terms figure G=(V, E), wherein V are set of node, are made of the candidate terms generated;Then
Use cooccurrence relation construction it is wantonly between 2 points side (window is constructed centered on current word by cooccurrence relation, such as
The distance of each two words of movement in left and right there is 5 words inside this window), there are sides only when it between two nodes
Corresponding vocabulary length be K window in co-occurrence, K indicate window size, K word of most co-occurrences;
4) according to formula
And huge corpus is combined, the weight of each node of iterative diffusion (in the window of each word, calculates the word in window successively
The weight relationship of each word), until convergence;
5) to node weights carry out Bit-reversed (carry out flashback sequence according to weight size, weight it is big come front),
Most important T word is obtained, as candidate terms;
6) it by 5) obtaining most important T word, is marked in urtext, if forming adjacent phrase, combines
At more word terms;Term sequence is added;
7) two groups of data are determined:Term, non-term;
8) terminology data generated to term process twice is integrated, and re-scheduling is then combined with (twice after term process
Two groups are obtained as a result, two groups of results are merged, removal dittograph only stays one), finally obtain all terms.
Another object of the present invention is to provide the language translation systems that a kind of term identifies abstracting method.
The present invention is to improve localization interpreter's work, improves translation efficiency, now provides a kind of term identification abstracting method, journey
Sequence automatically analyzes document, extracts technical term, and Rapid matching goes out to translate translation, and work effect is improved for localization interpreter
Rate, and translation accuracy, former interpreter translates an article and needs 3 days, can meet within present 1 day.
The present invention carries out repeatedly identification to term and extracts, and improves accuracy rate.The present invention can accurately identify more term combination arts
Language.
Description of the drawings
Fig. 1 is term identification abstracting method flow chart provided in an embodiment of the present invention.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention
It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to
Limit the present invention.
The most important defect of the prior art is in term extraction, and the combination there are adjacent term is a whole term
This big term has just been splitted into multiple terms, but a really term by situation after processed;And the prior art is not
By term extraction algorithm, can be judged by way of calculating adjacent term weight adjacent term composition character string whether be
One term.
The application principle of the present invention is explained in detail below in conjunction with the accompanying drawings.
As shown in Figure 1, term provided in an embodiment of the present invention identifies abstracting method, including:
One) technical term identification is extracted:
A) prepare:Arrange each field term library of each languages, corresponding translation content, languages and field:
Such as:The term of field of computer technology:Apple, corresponding translation content may be:Apple Technolegy.
B) field of food belongs to:Apple, corresponding translation content may be:apple fruit;
Program handles document, and word segmentation processing (reject various punctuation marks and segment) is carried out to the content of text in document.
C) field operation and participle are detected participle by part-of-speech tagging algorithm, judge that the participle is term probability
How many directly ignores if low, the high then reservation of probability.
D) vocabulary generated according to step c), is matched with the terminology bank in the languages, field, if match, recognize
Fixed its is term, and remaining vocabulary carries out next step operation.
E) the remaining vocabulary of step d) carries out matching filtering by non-glossary of term, if there are non-term words for vocabulary
Remittance table, then it is term to assert the vocabulary not.
F) two groups of data can be determined by former steps and the matching of terminology bank, non-terminology bank:Term, non-term;
G) data of document are carried out to the extraction of a term again again by term extraction algorithm, steps are as follows:
Term extraction algorithm:
1) given text T is split according to complete words, i.e. T=[S1, S2 ..., Sm].
2) for each sentence, participle and part-of-speech tagging processing are carried out, and filter out stop words, only retains and specifies part of speech
Word, i.e. Si=[ti, 1, ti, 2 ..., ti, m], wherein ti, j ∈ Sj are the candidate terms after retaining.
3) structure candidate terms figure G=(V, E), wherein V are set of node, are made of the candidate terms 2. generated, then adopt
There are side only it is K's in length when their corresponding vocabulary between two nodes with the wantonly side between 2 points of cooccurrence relation construction
Co-occurrence in window, K indicate window size, i.e., most K words of co-occurrence.
4) according to formula above, and huge corpus is combined, the weight of each node of iterative diffusion, until convergence.
5) Bit-reversed is carried out to node weights, to obtain most important T word, as candidate terms.
6) it by 5) obtaining most important T word, is marked in urtext, if forming adjacent phrase, combines
At more word terms.For example, having sentence " Matlab code for plotting ambiguity function " in text, such as
Fruit " Matlab " and " code " belong to candidate terms, then are combined into " Matlab code " and term sequence is added.
7) two groups of data can be determined:Term, non-term wherein contain multiple words or multiple arts in term group
The term of language composition.
8) terminology data generated to term process twice is integrated, and re-scheduling is then combined with, and finally obtains this article
In all terms.
With reference to specific embodiment, the invention will be further described.
Term provided in an embodiment of the present invention identifies abstracting method, including:
1, test data:Programmer is the professional for being engaged in program development, maintenance.Programmer is generally divided into program to set
Meter personnel and program coding personnel, but the boundary of the two is not perfectly clear, particularly in China.Software practitioner is divided into just
Grade programmer, four major class of senior programmer, systems analyst and project manager.
2, this section words are segmented first:[programmer is to be engaged in, program, exploitation,, it safeguards, profession, people
Member,., generally, will, programmer is divided into, program, design, personnel, and, program, coding, personnel,, but, the two, boundary,
It is and no, very, clear,, especially, be, China,., software, working, personnel are divided into, primary, programmer,, advanced, journey
Sequence person,, system, analyst, and, project, manager, four, big/a, class,.].
3, punctuation mark, adjective, verb, interjection etc. are removed.
4, remaining phrase is after processing:[programmer, English, program are developed, and are safeguarded, profession, personnel, programmer, journey
Sequence, design, personnel, program, coding, personnel, boundary, especially, China, software, personnel are divided into, and programmer is advanced, programmer,
System, analyst, project, manager].
5, term extraction algorithm through the invention, calculates the weight of adjacent word, first carries out the extraction of first round term.
6, the array after extraction is then:[professional, programmer, program coding personnel, Chinese software people
Member, senior programmer, systems analyst, project manager].
7, remaining vocabulary is matched with nomenclature, rejects non-term, this completes the extractions to term.
8, the final result is:[programmer, English, program, professional, programmer, program coding people
Member, Chinese software personnel, senior programmer, systems analyst, project manager].
9, again to the phrase that the 4th step generates carry out with the matched operation of terminology bank, match term, take the term and the
The term phrase that 8 steps generate carries out re-scheduling merging, obtains term result to the end.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention
All any modification, equivalent and improvement etc., should all be included in the protection scope of the present invention made by within refreshing and principle.
Claims (6)
1. a kind of term identifies abstracting method, which is characterized in that the term identifies that abstracting method includes:Term is carried out multiple
Identification is extracted;Identify more term combination terms;Match translation translation;And carry out term extraction.
2. term as described in claim 1 identifies abstracting method, which is characterized in that the technical term identifies abstracting method packet
It includes:
A) prepare:Arrange each field term library of each languages, corresponding translation content, languages and field:
B) division in field;
C) field operation and participle are detected participle by part-of-speech tagging algorithm, it is more to judge that the participle is that term probability has
It is few, if low, directly ignore, the high then reservation of probability;
D) vocabulary generated according to step c), is matched with the terminology bank in the languages, field, if it does, regarding as art
Language, remaining vocabulary carry out next step operation;
E) the remaining vocabulary of step d) carries out matching filtering by non-glossary of term, if there are non-term vocabulary for vocabulary
Table, it is term to assert the vocabulary not;
F) with the matching of terminology bank, non-terminology bank, two groups of data are determined:Term, non-term.
G) term of document, non-terminology data are carried out to the extraction of a term again again by term extraction method.
3. term as described in claim 1 identifies abstracting method, which is characterized in that the technical term identify abstracting method into
One step includes:
1) given text T is split according to complete words, T=[S1, S2 ..., Sm];
2) for each sentence, participle and part-of-speech tagging processing are carried out, and filter out stop words, only retains the list for specifying part of speech
Word, Si=[ti, 1, ti, 2 ..., ti, m], wherein ti, j ∈ Sj are the candidate terms after retaining;
3) structure candidate terms figure G=(V, E), wherein V are set of node, are made of the candidate terms generated;Then it uses
The wantonly side between 2 points of cooccurrence relation construction, there are sides only when their corresponding vocabulary are in the window that length is K between two nodes
Co-occurrence in mouthful, K indicate window size, K word of most co-occurrences;
4) according to formula above, and huge corpus is combined, the weight of each node of iterative diffusion, until convergence;
5) Bit-reversed is carried out to node weights, most important T word is obtained, as candidate terms;
6) it by 5) obtaining most important T word, is marked in urtext, if forming adjacent phrase, is combined into more
Word term;Term sequence is added;
7) two groups of data are determined:Term, non-term;
8) terminology data generated to term process twice is integrated, and re-scheduling is then combined with, and finally obtains all terms.
4. term as claimed in claim 2 identifies abstracting method, which is characterized in that being detected method to participle is:Participle
Part-of-speech tagging is carried out by part-of-speech tagging algorithm to each word later, removal number, adverbial word, preposition, conjunction, auxiliary word, is sighed at quantifier
The word of the parts of speech such as word;
Carrying out matching process with the terminology bank in the languages, field is:The vocabulary for taking generation removes terminology bank matching inquiry in turn
It whether there is;
By non-glossary of term, carrying out matching filter method is:Take remaining vocabulary goes the inquiry of non-art vocabulary to be in turn
No presence, if it does, the vocabulary is not belonging to term.
5. term as claimed in claim 3 identifies abstracting method, which is characterized in that given text T according to complete words
The method of being split is:Punctuate segmentation is carried out according to punctuation mark;
For each sentence, participle and part-of-speech tagging processing is carried out, and filter out stop words, only retain the word side for specifying part of speech
Method is:Part-of-speech tagging, removal number, quantifier, adverbial word, preposition, company are carried out by part-of-speech tagging algorithm to each word after participle
The word of the parts of speech such as word, auxiliary word, interjection;
Use the wantonly side method between 2 points of cooccurrence relation construction for:One is constructed centered on current word by cooccurrence relation
Window;
The weight method of each node of iterative diffusion is:In the window of each word, the word is calculated successively with each word in window
Weight relationship;
Carrying out Bit-reversed method to node weights is:Carry out flashback sequence according to weight size, weight it is big come front;
The terminology data generated to term process twice is integrated, re-scheduling, and the method for being then combined with is:Twice after term process
Two groups are obtained as a result, two groups of results are merged, removal dittograph only stays one.
6. a kind of language translation system of term identification abstracting method as claimed in any one of claims 1 to 5, wherein.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810009626.7A CN108287825A (en) | 2018-01-05 | 2018-01-05 | A kind of term identification abstracting method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810009626.7A CN108287825A (en) | 2018-01-05 | 2018-01-05 | A kind of term identification abstracting method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108287825A true CN108287825A (en) | 2018-07-17 |
Family
ID=62834962
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810009626.7A Pending CN108287825A (en) | 2018-01-05 | 2018-01-05 | A kind of term identification abstracting method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108287825A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109902290A (en) * | 2019-01-23 | 2019-06-18 | 广州杰赛科技股份有限公司 | A kind of term extraction method, system and equipment based on text information |
CN111046660A (en) * | 2019-11-21 | 2020-04-21 | 深圳无域科技技术有限公司 | Method and device for recognizing text professional terms |
CN114328826A (en) * | 2021-12-20 | 2022-04-12 | 青岛檬豆网络科技有限公司 | Method for extracting key words and abstracts of technical achievements and technical requirements |
CN115204190A (en) * | 2022-09-13 | 2022-10-18 | 中科聚信信息技术(北京)有限公司 | Device and method for converting financial field terms into English |
CN116702786A (en) * | 2023-08-04 | 2023-09-05 | 山东大学 | Chinese professional term extraction method and system integrating rules and statistical features |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103778243A (en) * | 2014-02-11 | 2014-05-07 | 北京信息科技大学 | Domain term extraction method |
US20140278359A1 (en) * | 2013-03-15 | 2014-09-18 | Luminoso Technologies, Inc. | Method and system for converting document sets to term-association vector spaces on demand |
CN105760368A (en) * | 2016-03-11 | 2016-07-13 | 张广睿 | Deep processing method for characters of document |
CN106951414A (en) * | 2017-03-30 | 2017-07-14 | 万迅 | A kind of academic text vocabulary identification of function method sorted based on machine learning |
-
2018
- 2018-01-05 CN CN201810009626.7A patent/CN108287825A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140278359A1 (en) * | 2013-03-15 | 2014-09-18 | Luminoso Technologies, Inc. | Method and system for converting document sets to term-association vector spaces on demand |
CN103778243A (en) * | 2014-02-11 | 2014-05-07 | 北京信息科技大学 | Domain term extraction method |
CN105760368A (en) * | 2016-03-11 | 2016-07-13 | 张广睿 | Deep processing method for characters of document |
CN106951414A (en) * | 2017-03-30 | 2017-07-14 | 万迅 | A kind of academic text vocabulary identification of function method sorted based on machine learning |
Non-Patent Citations (3)
Title |
---|
WEI LIU,ET AL: "A genetic algorithm enabled ensemble for unsupervised medical term extraction from clinical letters", 《HEALTH INFORMATION SCIENCE AND SYSTEMS》 * |
马佩勋,等: "基于TF* PDF 的热点关键短语提取", 《计算机应用研究》 * |
黄政豪: "基于术语自动抽取的科技文献翻译辅助***的设计与实现", 《万方学位论文》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109902290A (en) * | 2019-01-23 | 2019-06-18 | 广州杰赛科技股份有限公司 | A kind of term extraction method, system and equipment based on text information |
CN109902290B (en) * | 2019-01-23 | 2023-06-30 | 广州杰赛科技股份有限公司 | Text information-based term extraction method, system and equipment |
CN111046660A (en) * | 2019-11-21 | 2020-04-21 | 深圳无域科技技术有限公司 | Method and device for recognizing text professional terms |
CN111046660B (en) * | 2019-11-21 | 2023-05-09 | 深圳无域科技技术有限公司 | Method and device for identifying text professional terms |
CN114328826A (en) * | 2021-12-20 | 2022-04-12 | 青岛檬豆网络科技有限公司 | Method for extracting key words and abstracts of technical achievements and technical requirements |
CN114328826B (en) * | 2021-12-20 | 2024-06-11 | 青岛檬豆网络科技有限公司 | Method for extracting keywords and abstracts of technical achievements and technical demands |
CN115204190A (en) * | 2022-09-13 | 2022-10-18 | 中科聚信信息技术(北京)有限公司 | Device and method for converting financial field terms into English |
CN115204190B (en) * | 2022-09-13 | 2022-11-22 | 中科聚信信息技术(北京)有限公司 | Device and method for converting financial field terms into English |
CN116702786A (en) * | 2023-08-04 | 2023-09-05 | 山东大学 | Chinese professional term extraction method and system integrating rules and statistical features |
CN116702786B (en) * | 2023-08-04 | 2023-11-17 | 山东大学 | Chinese professional term extraction method and system integrating rules and statistical features |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108287825A (en) | A kind of term identification abstracting method and system | |
Cotterell et al. | Labeled morphological segmentation with semi-markov models | |
Poon et al. | Unsupervised morphological segmentation with log-linear models | |
CN102214166B (en) | Machine translation system and machine translation method based on syntactic analysis and hierarchical model | |
CN106610951A (en) | Improved text similarity solving algorithm based on semantic analysis | |
WO2017177809A1 (en) | Word segmentation method and system for language text | |
CN106611041A (en) | New text similarity solution method | |
CN110362678A (en) | A kind of method and apparatus automatically extracting Chinese text keyword | |
Darwish et al. | Using Stem-Templates to Improve Arabic POS and Gender/Number Tagging. | |
CN103678287B (en) | A kind of method that keyword is unified | |
Jahangir et al. | N-gram and gazetteer list based named entity recognition for urdu: A scarce resourced language | |
CN106528621A (en) | Improved density text clustering algorithm | |
CN108959630A (en) | A kind of character attribute abstracting method towards English without structure text | |
CN106610954A (en) | Text feature word extraction method based on statistics | |
CN108763192B (en) | Entity relation extraction method and device for text processing | |
CN103678288A (en) | Automatic proper noun translation method | |
Btoush et al. | Rule based approach for Arabic part of speech tagging and name entity recognition | |
Singh et al. | Improving neural machine translation using rule-based machine translation | |
CN110502759A (en) | The Chinese for incorporating classified dictionary gets over the outer word treatment method of hybrid network nerve machine translation set | |
Hládek et al. | Online natural language processing of the Slovak language | |
Hellwig | Morphological disambiguation of classical Sanskrit | |
Khoufi et al. | Statistical-based system for morphological annotation of Arabic texts | |
Meselhi et al. | Hybrid named entity recognition-application to Arabic language | |
Srinivasagan et al. | An automated system for tamil named entity recognition using hybrid approach | |
Jafar Tafreshi et al. | A novel approach to conditional random field-based named entity recognition using Persian specific features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180717 |
|
RJ01 | Rejection of invention patent application after publication |