CN108287825A - A kind of term identification abstracting method and system - Google Patents

A kind of term identification abstracting method and system Download PDF

Info

Publication number
CN108287825A
CN108287825A CN201810009626.7A CN201810009626A CN108287825A CN 108287825 A CN108287825 A CN 108287825A CN 201810009626 A CN201810009626 A CN 201810009626A CN 108287825 A CN108287825 A CN 108287825A
Authority
CN
China
Prior art keywords
term
word
vocabulary
participle
abstracting method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810009626.7A
Other languages
Chinese (zh)
Inventor
王建华
程国艮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chinese Translation Language Through Polytron Technologies Inc
Original Assignee
Chinese Translation Language Through Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chinese Translation Language Through Polytron Technologies Inc filed Critical Chinese Translation Language Through Polytron Technologies Inc
Priority to CN201810009626.7A priority Critical patent/CN108287825A/en
Publication of CN108287825A publication Critical patent/CN108287825A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to technical field of language recognition, a kind of term identification abstracting method and system are disclosed, including:Repeatedly identification is carried out to term to extract;Identify more term combination terms;Match translation translation;And carry out term extraction.The present invention is to improve localization interpreter's work, improves translation efficiency, provides a kind of term identification abstracting method, program automatically analyzes document, extracts technical term, and Rapid matching goes out to translate translation, it is improved work efficiency for localization interpreter, and the accuracy translated.The present invention carries out repeatedly identification to term and extracts, and improves accuracy rate;More term combination terms can be accurately identified.

Description

A kind of term identification abstracting method and system
Technical field
The invention belongs to technical field of language recognition more particularly to a kind of term identification abstracting method and systems.
Background technology
It is found in real work, localization interpreter needs to carry out manual sieve to the technical term in document in translation document Choosing, then carries out technical translator, the operation is found in the course of work, and short-range order is not cumbersome and takes time and effort, most in turn to it again It is important to do many repetitive operations.The prior art localizes in interpreter's work, and translation efficiency is low;Accuracy is poor.
In conclusion problem of the existing technology is:The most important defect of the prior art is existed in term extraction As soon as this big term has been splitted into multiple terms by the case where combination of adjacent term is whole term after processed, But the reason of really term, defect, is only to have done terminological analysis to the single vocabulary after participle, does not account for Close on the term relationship between word, difficult point be by calculate close on the relationship of vocabulary judge adjacent words combination whether be Term.And the prior art cannot judge adjacent art by term extraction algorithm by way of calculating adjacent term weight Whether the character string of language composition is a term.
Invention content
In view of the problems of the existing technology, the present invention provides a kind of term identification abstracting method and systems.
The invention is realized in this way a kind of term identifies that abstracting method, the term identification abstracting method include:To art Language carries out repeatedly identification and extracts;Identify more term combination terms;Match translation translation;And carry out term extraction.
Further, the technical term identification, which is extracted, includes:
A) prepare:Arrange each field term library of each languages, corresponding translation content, languages and field:
B) division in field;
C) field operation and participle are detected participle by part-of-speech tagging algorithm and (pass through to each word after participle Part-of-speech tagging algorithm carries out part-of-speech tagging, the word of the parts of speech such as removal number, quantifier, adverbial word, preposition, conjunction, auxiliary word, interjection.), Judge the participle be term probability how many directly ignore if low, the high then reservation of probability;
D) vocabulary generated according to step c), is matched with the terminology bank in the languages, field and (takes the vocabulary of generation Terminology bank matching inquiry is gone to whether there is in turn), if it does, regarding as term, remaining vocabulary carries out next step operation;
E) the remaining vocabulary of step d) carries out matching filtering and (takes remaining vocabulary to go in turn by non-glossary of term The inquiry of non-art vocabulary whether there is, if it does, the vocabulary is not belonging to term), if vocabulary there are non-glossary of term, It is term to assert the vocabulary not;
F) with the matching of terminology bank, non-terminology bank, two groups of data are determined:Term, non-term.
G) term of document, non-terminology data are carried out to the extraction of a term again again by term extraction method.
Further, term extraction method further comprises:
1) given text T is split according to complete words and (carries out punctuate segmentation according to punctuation mark), T=[S1, S2 ..., Sm];
2) for each sentence, participle and part-of-speech tagging processing are carried out, and filter out stop words, only retains and specifies part of speech Word (carries out part-of-speech tagging, removal number, quantifier, adverbial word, preposition, company to each word after participle by part-of-speech tagging algorithm The word of the parts of speech such as word, auxiliary word, interjection.), Si=[ti, 1, ti, 2 ..., ti, m], wherein ti, j ∈ Sj are the candidates after retaining Term;
3) structure candidate terms figure G=(V, E), wherein V are set of node, are made of the candidate terms generated;Then Use cooccurrence relation construction it is wantonly between 2 points side (window is constructed centered on current word by cooccurrence relation, such as The distance of each two words of movement in left and right there is 5 words inside this window), there are sides only when it between two nodes Corresponding vocabulary length be K window in co-occurrence, K indicate window size, K word of most co-occurrences;
4) according to formula And huge corpus is combined, the weight of each node of iterative diffusion (in the window of each word, calculates the word in window successively The weight relationship of each word), until convergence;
5) to node weights carry out Bit-reversed (carry out flashback sequence according to weight size, weight it is big come front), Most important T word is obtained, as candidate terms;
6) it by 5) obtaining most important T word, is marked in urtext, if forming adjacent phrase, combines At more word terms;Term sequence is added;
7) two groups of data are determined:Term, non-term;
8) terminology data generated to term process twice is integrated, and re-scheduling is then combined with (twice after term process Two groups are obtained as a result, two groups of results are merged, removal dittograph only stays one), finally obtain all terms.
Another object of the present invention is to provide the language translation systems that a kind of term identifies abstracting method.
The present invention is to improve localization interpreter's work, improves translation efficiency, now provides a kind of term identification abstracting method, journey Sequence automatically analyzes document, extracts technical term, and Rapid matching goes out to translate translation, and work effect is improved for localization interpreter Rate, and translation accuracy, former interpreter translates an article and needs 3 days, can meet within present 1 day.
The present invention carries out repeatedly identification to term and extracts, and improves accuracy rate.The present invention can accurately identify more term combination arts Language.
Description of the drawings
Fig. 1 is term identification abstracting method flow chart provided in an embodiment of the present invention.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.
The most important defect of the prior art is in term extraction, and the combination there are adjacent term is a whole term This big term has just been splitted into multiple terms, but a really term by situation after processed;And the prior art is not By term extraction algorithm, can be judged by way of calculating adjacent term weight adjacent term composition character string whether be One term.
The application principle of the present invention is explained in detail below in conjunction with the accompanying drawings.
As shown in Figure 1, term provided in an embodiment of the present invention identifies abstracting method, including:
One) technical term identification is extracted:
A) prepare:Arrange each field term library of each languages, corresponding translation content, languages and field:
Such as:The term of field of computer technology:Apple, corresponding translation content may be:Apple Technolegy.
B) field of food belongs to:Apple, corresponding translation content may be:apple fruit;
Program handles document, and word segmentation processing (reject various punctuation marks and segment) is carried out to the content of text in document.
C) field operation and participle are detected participle by part-of-speech tagging algorithm, judge that the participle is term probability How many directly ignores if low, the high then reservation of probability.
D) vocabulary generated according to step c), is matched with the terminology bank in the languages, field, if match, recognize Fixed its is term, and remaining vocabulary carries out next step operation.
E) the remaining vocabulary of step d) carries out matching filtering by non-glossary of term, if there are non-term words for vocabulary Remittance table, then it is term to assert the vocabulary not.
F) two groups of data can be determined by former steps and the matching of terminology bank, non-terminology bank:Term, non-term;
G) data of document are carried out to the extraction of a term again again by term extraction algorithm, steps are as follows:
Term extraction algorithm:
1) given text T is split according to complete words, i.e. T=[S1, S2 ..., Sm].
2) for each sentence, participle and part-of-speech tagging processing are carried out, and filter out stop words, only retains and specifies part of speech Word, i.e. Si=[ti, 1, ti, 2 ..., ti, m], wherein ti, j ∈ Sj are the candidate terms after retaining.
3) structure candidate terms figure G=(V, E), wherein V are set of node, are made of the candidate terms 2. generated, then adopt There are side only it is K's in length when their corresponding vocabulary between two nodes with the wantonly side between 2 points of cooccurrence relation construction Co-occurrence in window, K indicate window size, i.e., most K words of co-occurrence.
4) according to formula above, and huge corpus is combined, the weight of each node of iterative diffusion, until convergence.
5) Bit-reversed is carried out to node weights, to obtain most important T word, as candidate terms.
6) it by 5) obtaining most important T word, is marked in urtext, if forming adjacent phrase, combines At more word terms.For example, having sentence " Matlab code for plotting ambiguity function " in text, such as Fruit " Matlab " and " code " belong to candidate terms, then are combined into " Matlab code " and term sequence is added.
7) two groups of data can be determined:Term, non-term wherein contain multiple words or multiple arts in term group The term of language composition.
8) terminology data generated to term process twice is integrated, and re-scheduling is then combined with, and finally obtains this article In all terms.
With reference to specific embodiment, the invention will be further described.
Term provided in an embodiment of the present invention identifies abstracting method, including:
1, test data:Programmer is the professional for being engaged in program development, maintenance.Programmer is generally divided into program to set Meter personnel and program coding personnel, but the boundary of the two is not perfectly clear, particularly in China.Software practitioner is divided into just Grade programmer, four major class of senior programmer, systems analyst and project manager.
2, this section words are segmented first:[programmer is to be engaged in, program, exploitation,, it safeguards, profession, people Member,., generally, will, programmer is divided into, program, design, personnel, and, program, coding, personnel,, but, the two, boundary, It is and no, very, clear,, especially, be, China,., software, working, personnel are divided into, primary, programmer,, advanced, journey Sequence person,, system, analyst, and, project, manager, four, big/a, class,.].
3, punctuation mark, adjective, verb, interjection etc. are removed.
4, remaining phrase is after processing:[programmer, English, program are developed, and are safeguarded, profession, personnel, programmer, journey Sequence, design, personnel, program, coding, personnel, boundary, especially, China, software, personnel are divided into, and programmer is advanced, programmer, System, analyst, project, manager].
5, term extraction algorithm through the invention, calculates the weight of adjacent word, first carries out the extraction of first round term.
6, the array after extraction is then:[professional, programmer, program coding personnel, Chinese software people Member, senior programmer, systems analyst, project manager].
7, remaining vocabulary is matched with nomenclature, rejects non-term, this completes the extractions to term.
8, the final result is:[programmer, English, program, professional, programmer, program coding people Member, Chinese software personnel, senior programmer, systems analyst, project manager].
9, again to the phrase that the 4th step generates carry out with the matched operation of terminology bank, match term, take the term and the The term phrase that 8 steps generate carries out re-scheduling merging, obtains term result to the end.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention All any modification, equivalent and improvement etc., should all be included in the protection scope of the present invention made by within refreshing and principle.

Claims (6)

1. a kind of term identifies abstracting method, which is characterized in that the term identifies that abstracting method includes:Term is carried out multiple Identification is extracted;Identify more term combination terms;Match translation translation;And carry out term extraction.
2. term as described in claim 1 identifies abstracting method, which is characterized in that the technical term identifies abstracting method packet It includes:
A) prepare:Arrange each field term library of each languages, corresponding translation content, languages and field:
B) division in field;
C) field operation and participle are detected participle by part-of-speech tagging algorithm, it is more to judge that the participle is that term probability has It is few, if low, directly ignore, the high then reservation of probability;
D) vocabulary generated according to step c), is matched with the terminology bank in the languages, field, if it does, regarding as art Language, remaining vocabulary carry out next step operation;
E) the remaining vocabulary of step d) carries out matching filtering by non-glossary of term, if there are non-term vocabulary for vocabulary Table, it is term to assert the vocabulary not;
F) with the matching of terminology bank, non-terminology bank, two groups of data are determined:Term, non-term.
G) term of document, non-terminology data are carried out to the extraction of a term again again by term extraction method.
3. term as described in claim 1 identifies abstracting method, which is characterized in that the technical term identify abstracting method into One step includes:
1) given text T is split according to complete words, T=[S1, S2 ..., Sm];
2) for each sentence, participle and part-of-speech tagging processing are carried out, and filter out stop words, only retains the list for specifying part of speech Word, Si=[ti, 1, ti, 2 ..., ti, m], wherein ti, j ∈ Sj are the candidate terms after retaining;
3) structure candidate terms figure G=(V, E), wherein V are set of node, are made of the candidate terms generated;Then it uses The wantonly side between 2 points of cooccurrence relation construction, there are sides only when their corresponding vocabulary are in the window that length is K between two nodes Co-occurrence in mouthful, K indicate window size, K word of most co-occurrences;
4) according to formula above, and huge corpus is combined, the weight of each node of iterative diffusion, until convergence;
5) Bit-reversed is carried out to node weights, most important T word is obtained, as candidate terms;
6) it by 5) obtaining most important T word, is marked in urtext, if forming adjacent phrase, is combined into more Word term;Term sequence is added;
7) two groups of data are determined:Term, non-term;
8) terminology data generated to term process twice is integrated, and re-scheduling is then combined with, and finally obtains all terms.
4. term as claimed in claim 2 identifies abstracting method, which is characterized in that being detected method to participle is:Participle Part-of-speech tagging is carried out by part-of-speech tagging algorithm to each word later, removal number, adverbial word, preposition, conjunction, auxiliary word, is sighed at quantifier The word of the parts of speech such as word;
Carrying out matching process with the terminology bank in the languages, field is:The vocabulary for taking generation removes terminology bank matching inquiry in turn It whether there is;
By non-glossary of term, carrying out matching filter method is:Take remaining vocabulary goes the inquiry of non-art vocabulary to be in turn No presence, if it does, the vocabulary is not belonging to term.
5. term as claimed in claim 3 identifies abstracting method, which is characterized in that given text T according to complete words The method of being split is:Punctuate segmentation is carried out according to punctuation mark;
For each sentence, participle and part-of-speech tagging processing is carried out, and filter out stop words, only retain the word side for specifying part of speech Method is:Part-of-speech tagging, removal number, quantifier, adverbial word, preposition, company are carried out by part-of-speech tagging algorithm to each word after participle The word of the parts of speech such as word, auxiliary word, interjection;
Use the wantonly side method between 2 points of cooccurrence relation construction for:One is constructed centered on current word by cooccurrence relation Window;
The weight method of each node of iterative diffusion is:In the window of each word, the word is calculated successively with each word in window Weight relationship;
Carrying out Bit-reversed method to node weights is:Carry out flashback sequence according to weight size, weight it is big come front;
The terminology data generated to term process twice is integrated, re-scheduling, and the method for being then combined with is:Twice after term process Two groups are obtained as a result, two groups of results are merged, removal dittograph only stays one.
6. a kind of language translation system of term identification abstracting method as claimed in any one of claims 1 to 5, wherein.
CN201810009626.7A 2018-01-05 2018-01-05 A kind of term identification abstracting method and system Pending CN108287825A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810009626.7A CN108287825A (en) 2018-01-05 2018-01-05 A kind of term identification abstracting method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810009626.7A CN108287825A (en) 2018-01-05 2018-01-05 A kind of term identification abstracting method and system

Publications (1)

Publication Number Publication Date
CN108287825A true CN108287825A (en) 2018-07-17

Family

ID=62834962

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810009626.7A Pending CN108287825A (en) 2018-01-05 2018-01-05 A kind of term identification abstracting method and system

Country Status (1)

Country Link
CN (1) CN108287825A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109902290A (en) * 2019-01-23 2019-06-18 广州杰赛科技股份有限公司 A kind of term extraction method, system and equipment based on text information
CN111046660A (en) * 2019-11-21 2020-04-21 深圳无域科技技术有限公司 Method and device for recognizing text professional terms
CN114328826A (en) * 2021-12-20 2022-04-12 青岛檬豆网络科技有限公司 Method for extracting key words and abstracts of technical achievements and technical requirements
CN115204190A (en) * 2022-09-13 2022-10-18 中科聚信信息技术(北京)有限公司 Device and method for converting financial field terms into English
CN116702786A (en) * 2023-08-04 2023-09-05 山东大学 Chinese professional term extraction method and system integrating rules and statistical features

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103778243A (en) * 2014-02-11 2014-05-07 北京信息科技大学 Domain term extraction method
US20140278359A1 (en) * 2013-03-15 2014-09-18 Luminoso Technologies, Inc. Method and system for converting document sets to term-association vector spaces on demand
CN105760368A (en) * 2016-03-11 2016-07-13 张广睿 Deep processing method for characters of document
CN106951414A (en) * 2017-03-30 2017-07-14 万迅 A kind of academic text vocabulary identification of function method sorted based on machine learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140278359A1 (en) * 2013-03-15 2014-09-18 Luminoso Technologies, Inc. Method and system for converting document sets to term-association vector spaces on demand
CN103778243A (en) * 2014-02-11 2014-05-07 北京信息科技大学 Domain term extraction method
CN105760368A (en) * 2016-03-11 2016-07-13 张广睿 Deep processing method for characters of document
CN106951414A (en) * 2017-03-30 2017-07-14 万迅 A kind of academic text vocabulary identification of function method sorted based on machine learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
WEI LIU,ET AL: "A genetic algorithm enabled ensemble for unsupervised medical term extraction from clinical letters", 《HEALTH INFORMATION SCIENCE AND SYSTEMS》 *
马佩勋,等: "基于TF* PDF 的热点关键短语提取", 《计算机应用研究》 *
黄政豪: "基于术语自动抽取的科技文献翻译辅助***的设计与实现", 《万方学位论文》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109902290A (en) * 2019-01-23 2019-06-18 广州杰赛科技股份有限公司 A kind of term extraction method, system and equipment based on text information
CN109902290B (en) * 2019-01-23 2023-06-30 广州杰赛科技股份有限公司 Text information-based term extraction method, system and equipment
CN111046660A (en) * 2019-11-21 2020-04-21 深圳无域科技技术有限公司 Method and device for recognizing text professional terms
CN111046660B (en) * 2019-11-21 2023-05-09 深圳无域科技技术有限公司 Method and device for identifying text professional terms
CN114328826A (en) * 2021-12-20 2022-04-12 青岛檬豆网络科技有限公司 Method for extracting key words and abstracts of technical achievements and technical requirements
CN114328826B (en) * 2021-12-20 2024-06-11 青岛檬豆网络科技有限公司 Method for extracting keywords and abstracts of technical achievements and technical demands
CN115204190A (en) * 2022-09-13 2022-10-18 中科聚信信息技术(北京)有限公司 Device and method for converting financial field terms into English
CN115204190B (en) * 2022-09-13 2022-11-22 中科聚信信息技术(北京)有限公司 Device and method for converting financial field terms into English
CN116702786A (en) * 2023-08-04 2023-09-05 山东大学 Chinese professional term extraction method and system integrating rules and statistical features
CN116702786B (en) * 2023-08-04 2023-11-17 山东大学 Chinese professional term extraction method and system integrating rules and statistical features

Similar Documents

Publication Publication Date Title
CN108287825A (en) A kind of term identification abstracting method and system
Cotterell et al. Labeled morphological segmentation with semi-markov models
Poon et al. Unsupervised morphological segmentation with log-linear models
CN102214166B (en) Machine translation system and machine translation method based on syntactic analysis and hierarchical model
CN106610951A (en) Improved text similarity solving algorithm based on semantic analysis
WO2017177809A1 (en) Word segmentation method and system for language text
CN106611041A (en) New text similarity solution method
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
Darwish et al. Using Stem-Templates to Improve Arabic POS and Gender/Number Tagging.
CN103678287B (en) A kind of method that keyword is unified
Jahangir et al. N-gram and gazetteer list based named entity recognition for urdu: A scarce resourced language
CN106528621A (en) Improved density text clustering algorithm
CN108959630A (en) A kind of character attribute abstracting method towards English without structure text
CN106610954A (en) Text feature word extraction method based on statistics
CN108763192B (en) Entity relation extraction method and device for text processing
CN103678288A (en) Automatic proper noun translation method
Btoush et al. Rule based approach for Arabic part of speech tagging and name entity recognition
Singh et al. Improving neural machine translation using rule-based machine translation
CN110502759A (en) The Chinese for incorporating classified dictionary gets over the outer word treatment method of hybrid network nerve machine translation set
Hládek et al. Online natural language processing of the Slovak language
Hellwig Morphological disambiguation of classical Sanskrit
Khoufi et al. Statistical-based system for morphological annotation of Arabic texts
Meselhi et al. Hybrid named entity recognition-application to Arabic language
Srinivasagan et al. An automated system for tamil named entity recognition using hybrid approach
Jafar Tafreshi et al. A novel approach to conditional random field-based named entity recognition using Persian specific features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180717

RJ01 Rejection of invention patent application after publication