CN107862045A - A kind of across language plagiarism detection method based on multiple features - Google Patents

A kind of across language plagiarism detection method based on multiple features Download PDF

Info

Publication number
CN107862045A
CN107862045A CN201711084337.5A CN201711084337A CN107862045A CN 107862045 A CN107862045 A CN 107862045A CN 201711084337 A CN201711084337 A CN 201711084337A CN 107862045 A CN107862045 A CN 107862045A
Authority
CN
China
Prior art keywords
feature
language
translation
chinese
carried out
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711084337.5A
Other languages
Chinese (zh)
Other versions
CN107862045B (en
Inventor
刘刚
胡昱临
李光曦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN201711084337.5A priority Critical patent/CN107862045B/en
Publication of CN107862045A publication Critical patent/CN107862045A/en
Application granted granted Critical
Publication of CN107862045B publication Critical patent/CN107862045B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • G06F16/337Profile generation, learning or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The present invention is to provide a kind of across language plagiarism detection method based on multiple features.(1) corpus is built;(2) structure of translation feature, the europeanized phenomenon and translation body problem generally occurred according to translation article has carried out translation feature construction, cleaning is carried out to feature by way of feature selecting and filters out validity feature, filters invalid feature or the feature of DeGrain;(3) feature selecting, some validity features are selected from all multiple features to carry out the training of grader, and then a certain piece of differentiation or a few Chinese articles whether there is across language plagiarization problem;(4) plagiarism detection corresponding to feature based, for the feature of Chinese, the accurate correspondence of English feature is carried out, and correspondingly carries out plagiarizing the filtering and generation of result according to translation feature and architectural feature, by WordNet plagiarize the final confirmation of result.The present invention can solve across language plagiarization problem according to the various features excavated from translation.

Description

A kind of across language plagiarism detection method based on multiple features
Technical field
Detect whether article has the method for cribbing the present invention relates to a kind of.
Background technology
(1) the europeanized phenomenon in English-Chinese translation and translation body problem are found
The mutual change of English and chinese gives both language and all brings subtle change, include accent, vocabulary, The factors such as grammer, rhetoric.Although macaronic influence is mutual, comparatively speaking, influence of the English to Chinese is remote Much larger than influence of the Chinese to English.When single language plagiarism detection can not increasingly meet science that it runs into dishonour problem when, this When across language plagiarism detection occur.However, single language plagiarism detection technology is in across language plagiarism detection and does not apply to.Currently across The method of language plagiarism detection most main flow has cross-language information retrieval (CLIR) and across language similitude detection (CLSD) two kinds of sides Method.
It can be carried out between bilingual by way of dictionary, corpus or translation correspondingly.When being studied, it is necessary to Solve the problems, such as word ambiguousness and ambiguity problem, the sortord of output result, the cutting problems of query word, to multilingual The various problems such as the dependence of resource.CLIR is just gradually popular in recent years, in across the language assessment forum of 2009, in 10 kinds of methods There are 7 kinds all directly to use machine translation that bilingual is converted into a kind of language.
CLSD refers to across language similarity detection technique, and it and CLIR have many similar places.In CLSD, compare Be different language article between similitude.Currently conventional CLSD algorithms have a lot, it is main including it is based on language syntax, Algorithm based on machine translation, based on terminological dictionary, based on parallel or comparable corpus and based on semantic network.
Abroad, exist much on the related article plagiarized across language, such as English-Arabic, English-German, English-Czech etc..But due to Chinese particularity, external article does not relate to the plagiarism detection of English-Chinese, but other double Detection of the Chinese language chapter to English-Chinese has very big reference.
(2) it make use of Feature Engineering.Feature Engineering is can to make engineering using the relevant knowledge of data fields to create Practise the process that algorithm reaches the feature of optimum performance." data and feature determine the upper limit of machine learning, and model and algorithm are only It is to approach this upper limit ".As can be seen here, Feature Engineering has extremely important status in machine learning.In practical application Central, Feature Engineering is that machine learning is successfully crucial.
Under the data of restriction, with excessive feature, to carry out, the training complexity of sorting algorithm is too big and easy cross is intended Close.So after feature is built, it is necessary first to carry out feature selecting.The purpose is to one is selected from all multiple features most Have the character subset of statistical significance, on the one hand can reach the purpose for screening invalid feature, on the other hand feature space is entered again Go dimensionality reduction, reduce model complexity.
The content of the invention
Can be solved to rob across language according to the various features excavated from translation it is an object of the invention to provide a kind of Surreptitiously across the language plagiarism detection method based on multiple features of problem.
The object of the present invention is achieved like this:
(1) corpus is built first
The corpus is divided into Chinese training set and Chinese test set, corpus is divided into two classes, first kind corpus is In the presence of the Chinese articles plagiarized across language, the acquisition of this part language material is by the way that English document automatic translation is obtained;Second class language Storehouse is expected for original Chinese articles, and the acquisition of this part language material is by downloading authoritative Chinese Papers;
The construction method of second class corpus is:A large amount of english articles are crawled with reptile, and batch is carried out certainly by program Dynamic translation obtains plagiarizing Chinese document, realizes and the article of the PDF format with particular number of batch is handled, a volume Number article for being n, forms m text-only file, file entitled n.m, wherein m for this article paragraph number, mainly including following Three steps,
1) by the document of PDF format be converted to can text mark XML format document;
2) converted according to XML label, the information of each class text,<P></P>Between for a paragraph, go to read text successively Shelves, read<P>Its special marking is just added before the label afterwards, and removes the content between other labels and label, text Remaining in shelves is the document for above being added at every section special marking;
3) gone to read the document for adding special marking with program, often read special marking and content behind is just write one Plain text document, and its special marking is removed;
(2) structure of translation feature
The europeanized phenomenon and translation body problem generally occurred according to translation article has carried out translation feature construction, passes through feature The mode of selection carries out cleaning to feature and filters out validity feature, filters invalid feature or the feature of DeGrain;
(3) feature selecting
Some validity features are selected from all multiple features to carry out the training of grader, and then distinguish an a certain piece or a few Piece Chinese articles whether there is across language plagiarization problem;
(4) plagiarism detection corresponding to feature based
For the feature of Chinese, the accurate correspondence of English feature is carried out, and correspondingly enter according to translation feature and architectural feature Row plagiarizes the filtering and generation of result, by WordNet plagiarize the final confirmation of result;
Four-stage is broadly divided into, the first stage, plagiarizes Candidate Set pretreatment stage, plagiarizing Candidate Set to Chinese and English is carried out Paragraph divides and part-of-speech tagging;Second stage, a filtration stage, accurate feature is carried out according to translation feature and corresponded to, and in fact Existing paragraph is apart from computational algorithm;Phase III, in the secondary filter stage, plagiarization result is carried out according to architectural feature and filtered again;The In four stages, final result the stage of recognition, final confirmation is carried out to plagiarizing result with WordNet, obtains finally plagiarizing result.
The language disunity of article is across the language most important obstacle of plagiarization of detection.Consider macaronic representation It is uniformly conventional method.But there is the problem of inaccuracy in form carries out RUP.Some unity of form compared with Difference, for example be compared again as same language by translating, this mode is higher for the effect requirements of translation, and different Translation bring influence degree difference larger so that the result of plagiarism detection is not known.Also word alignment problem.In detection process In, it is necessary to pass through dictionary or it is parallel expect storehouse realize across Language Document word alignment problem.This dictionary or Parallel Corpus It is large-scale and needs a variety of probabilistic dictionaries, can like this wastes a large amount of human and material resources, and be difficult to expand to a variety of languages Speech is above.
The present invention combines the particularity of Chinese, and certain piece is detected by finding a kind of style of writing style for not meeting Chinese custom Whether article has cribbing, to reduce plagiarism detection scope.It can so avoid translating ineffective and disambiguation inaccurate Problem, reach purpose of the detection across language plagiarization with a kind of unique method.Science improper detection of the present invention to Chinese has Significance, be advantageous to the raising of the specification, scientific research level of academic atmosphere.
Present invention firstly provides a kind of method that paragraph divides automatically, then from plagiarizing in Candidate Set by special based on translation Sign is corresponding and based on being filtered twice corresponding to architectural feature, filters out ineligible paragraph, filters out qualified section Fall, formed and plagiarize result, and final result confirmation has been carried out to plagiarizing result using WordNet.Finally, experimental result is carried out Checking, compared for existing with calculating the interpretation of result of similarity, indicated in the absence of filtering twice filter twice it is effective Property.
Interpretation of result
After being segmented, every article has been partitioned into some paragraphs.For specific test set, with fixed number come A certain section in an a certain piece is specified, and using this numbering as document title.For example, C307 representatives is Chinese Candidate Set the 3rd The 07th section.Similarly, English test set is represented with E.Ensuing experiment counts for convenience, by 142 articles before selection 1000 paragraphs be test set.
For multiple English paragraphs of a Chinese paragraph, ineligible is all filtered out.In contrast 1000 After paragraph, there are 749 paragraphs only to remain a suspicious paragraph after filtering twice, wherein there are 736 paragraphs accurately to match To the paragraph of its plagiarization, there is the situation of matching error in only 13 paragraphs.In remaining 251 paragraph Candidate Sets, there are 24 Paragraph is matching by filtering no suspicious paragraph twice, has 227 paragraphs to have multiple suspicious paragraphs matching.And , it is necessary to filter out and specifically plagiarize paragraph, this will be complete by WordNet dictionaries in 227 paragraphs matched with multiple results Into the confirmation work of final result.Through statistics, in 227 paragraphs of checking, there are 220 paragraphs to realize the standard for plagiarizing result It is really corresponding, the screening mistake of only 7 paragraphs, return its reason, because there is error when WordNet calculates similarity, correctly Paragraph, which does not have, obtains the similarity of maximum.But plagiarized paragraph exists in its suspicious paragraph after filtration, from side illustration As a result validity.
By the contrast verification to more than 1000 individual paragraphs, it is 74% to test the accuracy reached.Confirmation to final result In, the present invention is more more effective than without the method filtered twice.
Brief description of the drawings
Across the language plagiarism detection sorting technique routes of Fig. 1.
Across the language plagiarism detection method route figures of Fig. 2.
Across the language plagiarism detection overall framework figures of Fig. 3.
Across the language plagiarism detection module maps of Fig. 4.
Fig. 5 divides paragraph detail flowchart automatically.
Embodiment
For being plagiarized across language, it should determine that certain article whether there is across language plagiarization first, there will be across language The article of plagiarization is found out, so just can determine that in this article which paragraph or which across language plagiarization phenomenon partly be present.For Problem above, the present invention are mainly to be found from the Chinese articles plagiarized across language and select its effective translation feature, Different feature weights is given, structure has the disaggregated model plagiarized across language, and given Chinese articles can be classified, Cribbing is there may be in detection wherein which Chinese articles, and cribbing is not present in which article.
The present invention is by building a kind of across language plagiarism detection technology based on multiple features, it is intended to can be dug according to from translation The various features excavated solve across language plagiarization problem.The present Research that the present invention is plagiarized single bilingual speech first has been carried out point Analysis and summarize, it is proposed that a kind of across language plagiarism detection model based on multiple features, the model include based on multiple features across Language plagiarizes across language plagiarism detection corresponding to classification and feature based.In the process, feature selecting has used Chi-square Test Algorithm, and on this basis it is contemplated that quantity to feature in text and in classification feature degree of stability, then The weight of each feature is calculated with a kind of new calculating feature weight method.By the performance that the feature of above-mentioned selection is English with it Form is carried out correspondingly, and the algorithm of characteristic distance carrys out Chinese and English paragraph corresponding to comparison between a kind of calculating paragraph of proposition.So-called knot Structure feature is corresponding, will the structure of Sino-British paragraph be compared, retain the similar paragraph of structure, the big section of filtration difference Fall.Finally, testing result is verified and confirmed with based on WordNet method, be finally reached across language detection Purpose.
The technological means of the present invention mainly includes:
1st, corpus is built first, and corpus is divided into Chinese training set and Chinese test set.
Supervised classification is carried out to article, it is necessary to which that corpus is divided into two is big as training set in view of corpus to be utilized Class, one kind are the Chinese articles plagiarized across language be present.The acquisition of this part language material can be by by English document automatic translation Obtain.It is another kind of to download authoritative Chinese Papers for original Chinese articles, the acquisition of this part language material.
Wherein, the second class corpus is easier to obtain.And for first kind corpus, it there's almost no both at home and abroad such Corpus, so the present invention has crawled a large amount of english articles with reptile first, and batch automatic translation is carried out by program and obtained Plagiarize Chinese document.
This method, which is mainly realized, to be handled the article of the PDF format with particular number of batch, and a numbering is n Article, m text-only file can be formed, file entitled n.m, wherein m are the paragraph number of this article.
Its committed step has following three step.
(1) by the document of PDF format be converted to can text mark XML format document.XML is referred to as extensible markup language Speech, is mainly used to data storage and structure.And compared with HTML, it is that a kind of grammer is more loose, not strict Web language.Institute To be converted to XML format, label thereon can clearly mark each section, and next good base is beaten for division paragraph Plinth.In addition, after PDF being turned into XML, the headerfooter in its article has all removed automatically.Eliminate and remove header by hand Footer it is cumbersome.
(2) converted according to XML label, the information of each class text.Understand<P></P>Between for a paragraph.Successively Go to read document, read<P>Its special marking is just added before the label afterwards, and is removed interior between other labels and label Hold.Remaining in document is the document for above being added at every section special marking.
(3) gone to read the document for adding special marking with program.Special marking is often read just by content write-in one behind Individual plain text document, and its special marking is removed.Such a plain text document can correspond to one section in article.
2nd, the structure of translation feature.Feature is essential in machine learning, and the accuracy of learner largely depends on In the quality of the feature of structure.The europeanized phenomenon and translation body problem that the present invention generally occurs according to translation article have carried out translation Feature construction, the present invention are cleaned to feature by way of feature selecting, can filter out validity feature, filter invalid spy The feature of sign or DeGrain.Chi-square Test (CHI) method of selection is selected feature.
3rd, feature selecting.If the feature built is all not validity feature, it is necessary to be selected from all multiple features Dry validity feature carries out the training of grader, and then distinguishes an a certain piece or a few Chinese articles whether there is across language plagiarization Problem.
4th, plagiarism detection corresponding to feature based.For the feature of Chinese, the accurate correspondence of English feature is carried out, and according to Translation feature and architectural feature correspondingly carry out plagiarizing the filtering and generation of result, and carry out plagiarization result most by WordNet Confirm eventually.The correspondence of Chinese and English feature has been carried out with feature, has filtered and has not met paragraph corresponding to feature, and then by the knot of plagiarization The scope of fruit is substantially reduced.After calculated based on the distance between paragraph corresponding to translation feature, plagiarize in Candidate Set Chinese paragraph, which can filter, does not largely meet the paragraph of respective conditions, while remains some suspicious English yet and plagiarize paragraph, this Greatly reduce the scope for plagiarizing result.Then further extraction architectural feature is done into secondary filter to plagiarizing result.Choose Five kinds of architectural features:Adjectival length in the length of verb, sentence in the length of noun, sentence in the length of sentence, sentence The length of adverbial word in degree, sentence, for plagiarizing Candidate Set further screen and filter.
It is broadly divided into four-stage.First stage, plagiarize Candidate Set pretreatment stage.Candidate Set is plagiarized to Chinese and English to carry out Paragraph divides and part-of-speech tagging, convenient to plagiarizing being accurately positioned and post-processing for position.Second stage, a filtration stage.Root Accurate feature is carried out according to translation feature to correspond to, and realizes paragraph apart from computational algorithm.Phase III, secondary filter stage.Root Plagiarization result is carried out according to architectural feature to filter again, realization is based on architectural feature filter algorithm.Fourth stage, final result confirm Stage.Final confirmation is carried out to plagiarizing result with WordNet, obtains finally plagiarizing result.
The main processes of the present invention are described in more detail below in conjunction with the accompanying drawings.
Part I Text Pretreatments
Input:Need the text message analyzed
Output:Lexical set after participle
Using sentence as unit, as the basis of term extraction, in this stage, by making pauses in reading unpunctuated ancient writings to text, participle and Filter the procedure extraction data acquisition system of stop words.
The filtering stop words filtering of Part II vocabulary
After Chinese word segmentation is carried out to text, single character string one by one can be obtained.It is seen that in sentence In what semantic meaning representation was had a great influence is mostly noun and verb, the vocabulary of other ornamental equivalents to the semantic effect of sentence not Greatly, so needing to retain this part name influential on sentence semantics or verb, and they are referred to as to continue to employ word.So And in the sequence after these participles, it is very high the frequency that a part of word occurs in the text to be present, but actually to text Analysis does not have too much influence.These vocabulary are mainly made up of auxiliary words of mood, preposition, adverbial word etc., and these words do not have in itself There is clear and definite implication, only when using these words as can just play some use during a part for sentence.This kind of word is referred to as to stop Word.Therefore, if can the inessential vocabulary of those in the text of field filtered completely, it will greatly save system Memory space and reduce the checking system middle and later periods workload and amount of calculation.Therefore the design in preprocessing part not only will Select suitably participle and dimensioning algorithm are simultaneously improved appropriately, and the process that also filtered to vocabulary carries out appropriate set Meter.Process is as shown in Figure 3.
Precondition:Participle operation is performed
Input:Character set to be filtered
Output:Text lexical set after vocabulary filtering
Step 1:By lexical set to be filtered can be obtained after Chinese word segmentation.
Step 2:Stop words text is loaded, a vocabulary is read in from the set, vocabulary is carried out in stop words text Search.If finding, need to filter out the character, otherwise filter.
Part III feature constructions
The present invention summarizes following a few class text features and character representation, and these features are all frequent in text is plagiarized Occur, and in non-plagiarization text and the europeanized phenomenon that infrequently occurs.
Precondition:Perform and complete vocabulary filter operation
Output:Europeanized feature
Step 3:Europeanized feature is selected.
Step 4:Return to the characteristic set after selection;
Part IV feature selectings
Precondition:Perform and complete feature construction.
Input:Characteristic set
Output:Characteristic set after selection
Step 5:Chi-square Test is carried out to feature.
Step 6:Scored by examining to each feature, find the suitable parameters weighting of each feature.And arranged Sequence.
Part V .SVM model constructions
SVM is that current effect is best, one of the most frequently used grader, and it is based on structural risk minimization, extensive energy The advantages of power is strong, and it is a convex quadratic programming problem, and locally optimal solution is equivalent to globally optimal solution.
Precondition:Feature selecting is completed.
Input:Selected europeanized feature.
Output:SVM models based on europeanized feature.
Step 7:Selecting All Parameters C, inner product is replaced with RBF, obtain SVM dual problem.
Step 8:The variable for being unsatisfactory for constraints is updated with SMO algorithms, until all variables all meet KTT Condition.
Part VI paragraphs divide automatically
Because the storage mode of article is PDF format, it is necessary to consider how after PDF format accurately is converted into segmentation TXT plain text formats.PDF format is changed into the third party software of TXT forms or third-party opened there is many on the net Source storehouse, but effect is not notable, as a result in have many contents and stylistic mistake, corpus can be had a strong impact on by doing so Quality and then the inaccuracy for producing anaphase.So the present invention takes a kind of special method to realize the target.
Precondition:SVM structures are completed
Input:Text
Output:Paragraph after division
Step 9:By the document of PDF format be converted to can text mark XML format document;
Step 10:According to XML label, the information conversion of each class text.
Step 11:Gone to read the document for adding special marking with program.
Part VII English part-of-speech taggings
Precondition:Paragraph divides
Input:Paragraph after division
Output:Part-of-speech tagging result
Step 12:To exist form, picture situation filtration problem:Form, picture have special mark in XML document Note is different from the mark of paragraph, the not pre-read in segmentation.
Step 13:Filtering and paragraph consolidation problem to title:On the one hand, the mark of title is sometimes with paragraph marks weight It is multiple, all use<P></P>As mark, it is necessary to be identified and filter to title.On the other hand, XML is converted to from PDF sometimes When, occur between paragraph and block, divide into two sections of even more multistages by one section, need exist for merging.
Part VIII is based on plagiarization result corresponding to translation feature and once filtered
Precondition:Part-of-speech tagging
Input:Text after mark
Output:The text of filtering for the first time
Step 14:Chinese and English alignment is carried out according to feature.
Step 15:The paragraph distance of text is calculated.
Part IX carries out second of filtering to plagiarizing result
Step 16:The plagiarization result once filtered that a given Chinese plagiarizes paragraph and a upper trifle screens, one One is compared, and it is plagiarized in result from it if some feature exceeds specific threshold and is filtered, remaining after filtering Paragraph be plagiarization result after secondary filter.

Claims (3)

1. a kind of across language plagiarism detection method based on multiple features, it is characterized in that:
(1) corpus is built;
(2) structure of translation feature
The europeanized phenomenon and translation body problem generally occurred according to translation article has carried out translation feature construction, passes through feature selecting Mode cleaning carried out to feature filter out validity feature, filter invalid feature or the feature of DeGrain;
(3) feature selecting
Some validity features are selected from all multiple features to carry out the training of grader, and then are distinguished in an a certain piece or an a few pieces Article whether there is across language plagiarization problem;
(4) plagiarism detection corresponding to feature based
For the feature of Chinese, the accurate correspondence of English feature is carried out, and correspondingly robbed according to translation feature and architectural feature The surreptitiously filtering and generation of result, by WordNet plagiarize the final confirmation of result.
2. across the language plagiarism detection method according to claim 1 based on multiple features, it is characterized in that the structure language material Storehouse specifically includes:
The corpus is divided into Chinese training set and Chinese test set, corpus is divided into two classes, first kind corpus is to exist Across the Chinese articles that language is plagiarized, the acquisition of this part language material is by the way that English document automatic translation is obtained;Second class corpus For the Chinese articles of originality, the acquisition of this part language material is by downloading authoritative Chinese Papers;
The construction method of second class corpus is:A large amount of english articles are crawled with reptile, and batch automatic turning is carried out by program Translate to obtain and plagiarize Chinese document, realize and the article of the PDF format with particular number of batch is handled, a numbering is N article, forms m text-only file, file entitled n.m, wherein m for this article paragraph number, mainly including following three step,
1) by the document of PDF format be converted to can text mark XML format document;
2) converted according to XML label, the information of each class text,<P></P>Between for a paragraph, go to read document successively, Read<P>Its special marking is just added before the label afterwards, and removes the content between other labels and label, in document Remaining is the document for above being added at every section special marking;
3) gone to read the document for adding special marking with program, often read special marking and content behind is just write into a pure text This document, and its special marking is removed.
3. across the language plagiarism detection method according to claim 1 based on multiple features, it is characterized in that feature based is corresponding Plagiarism detection be broadly divided into four-stage, the first stage, plagiarize Candidate Set pretreatment stage, Candidate Set plagiarized to Chinese and English and entered Row paragraph divides and part-of-speech tagging;Second stage, a filtration stage, accurate feature is carried out according to translation feature and corresponded to, and Realize paragraph apart from computational algorithm;Phase III, in the secondary filter stage, plagiarization result is carried out according to architectural feature and filtered again; Fourth stage, final result the stage of recognition, final confirmation is carried out to plagiarizing result with WordNet, obtain final plagiarize and tie Fruit.
CN201711084337.5A 2017-11-07 2017-11-07 Cross-language plagiarism detection method based on multiple features Active CN107862045B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711084337.5A CN107862045B (en) 2017-11-07 2017-11-07 Cross-language plagiarism detection method based on multiple features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711084337.5A CN107862045B (en) 2017-11-07 2017-11-07 Cross-language plagiarism detection method based on multiple features

Publications (2)

Publication Number Publication Date
CN107862045A true CN107862045A (en) 2018-03-30
CN107862045B CN107862045B (en) 2022-01-14

Family

ID=61701211

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711084337.5A Active CN107862045B (en) 2017-11-07 2017-11-07 Cross-language plagiarism detection method based on multiple features

Country Status (1)

Country Link
CN (1) CN107862045B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110309268A (en) * 2019-07-12 2019-10-08 中电科大数据研究院有限公司 A kind of cross-language information retrieval method based on concept map
CN112380834A (en) * 2020-08-25 2021-02-19 中央民族大学 Tibetan language thesis plagiarism detection method and system
CN115828931A (en) * 2023-02-09 2023-03-21 中南大学 Chinese and English semantic similarity calculation method for paragraph-level text

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544326A (en) * 2013-11-14 2014-01-29 上海交通大学 Chinese and English cross-language plagiarism recognition method based on characteristics and content of translations
CN103823862A (en) * 2014-02-24 2014-05-28 西安交通大学 Cross-linguistic electronic text plagiarism detection system and detection method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544326A (en) * 2013-11-14 2014-01-29 上海交通大学 Chinese and English cross-language plagiarism recognition method based on characteristics and content of translations
CN103823862A (en) * 2014-02-24 2014-05-28 西安交通大学 Cross-linguistic electronic text plagiarism detection system and detection method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
RAFAEL COREZOLA PEREIR等: "A New Approach for Cross-Language Plagiarism Analysis", 《CLEF2010: MULTILINGUAL AND MULTIMODAL INFORMATION ACCESS EVALUATION》 *
何文垒: "基于WordNet的中英文跨语言文本相似度研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110309268A (en) * 2019-07-12 2019-10-08 中电科大数据研究院有限公司 A kind of cross-language information retrieval method based on concept map
CN110309268B (en) * 2019-07-12 2021-06-29 中电科大数据研究院有限公司 Cross-language information retrieval method based on concept graph
CN112380834A (en) * 2020-08-25 2021-02-19 中央民族大学 Tibetan language thesis plagiarism detection method and system
CN112380834B (en) * 2020-08-25 2023-10-31 中央民族大学 Method and system for detecting plagiarism of Tibetan paper
CN115828931A (en) * 2023-02-09 2023-03-21 中南大学 Chinese and English semantic similarity calculation method for paragraph-level text

Also Published As

Publication number Publication date
CN107862045B (en) 2022-01-14

Similar Documents

Publication Publication Date Title
CN107463607B (en) Method for acquiring and organizing upper and lower relations of domain entities by combining word vectors and bootstrap learning
Kaji et al. Building lexicon for sentiment analysis from massive collection of HTML documents
CN111897970A (en) Text comparison method, device and equipment based on knowledge graph and storage medium
CN102799577B (en) A kind of Chinese inter-entity semantic relation extraction method
CN109800310A (en) A kind of electric power O&amp;M text analyzing method based on structuring expression
CN107039034A (en) A kind of prosody prediction method and system
CN103678684A (en) Chinese word segmentation method based on navigation information retrieval
CN104063387A (en) Device and method abstracting keywords in text
CN108984661A (en) Entity alignment schemes and device in a kind of knowledge mapping
CN109145260A (en) A kind of text information extraction method
CN102214166A (en) Machine translation system and machine translation method based on syntactic analysis and hierarchical model
CN107862045A (en) A kind of across language plagiarism detection method based on multiple features
CN105786971B (en) A kind of grammer point recognition methods towards international Chinese teaching
CN104317882A (en) Decision-based Chinese word segmentation and fusion method
CN103514151A (en) Dependency grammar analysis method and device and auxiliary classifier training method
CN114564912B (en) Intelligent document format checking and correcting method and system
CN111814476A (en) Method and device for extracting entity relationship
CN109977391B (en) Information extraction method and device for text data
CN108009187A (en) A kind of short text Topics Crawling method for strengthening Text Representation
CN106250367A (en) The method building the interdependent treebank of Vietnamese based on the Nivre algorithm improved
CN110162615A (en) A kind of intelligent answer method, apparatus, electronic equipment and storage medium
Hazman et al. An ontology based approach for automatically annotating document segments
CN114298048A (en) Named entity identification method and device
CN105930471A (en) Speech abstract generation method and apparatus
Baishya et al. Present state and future scope of Assamese text processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant