CN105550172B - A kind of distributed text detection method and system - Google Patents
A kind of distributed text detection method and system Download PDFInfo
- Publication number
- CN105550172B CN105550172B CN201610020566.XA CN201610020566A CN105550172B CN 105550172 B CN105550172 B CN 105550172B CN 201610020566 A CN201610020566 A CN 201610020566A CN 105550172 B CN105550172 B CN 105550172B
- Authority
- CN
- China
- Prior art keywords
- participle
- rwv
- amount
- document
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2471—Distributed queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Fuzzy Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of distributed text detection method and systems.Wherein, comparison database includes material;The comparison database is stored in different site locations using distributed way;Storehouse is segmented, includes participle and corresponding part of speech;Word-dividing mode is segmented;Segment characteristic value generation module generation participle part of speech feature value;Participle free vector dimension determining module determines participle free vector dimension;Participle simplifies vector dimension generation module, and generation participle simplifies vector dimension;Segment feature vector generation module, generation participle feature vector;Document word-dividing mode to be identified obtains word segmentation result for being segmented to document to be identified;Document to be identified segments free vector dimension determining module, determines participle free vector dimension;Document participle to be identified simplifies vector dimension generation module, generates document participle to be identified and simplifies vector dimension;Document to be identified segments feature vector generation module, generates document participle feature vector to be identified;Carry out similarity comparison.
Description
Technical field
The invention belongs to text detection field more particularly to a kind of distributed text detecting systems.
Background technology
Text detection refers to the content of text for judging whether a certain piece document is accused of plagiarizing other one or more documents.But
Not fully it is equal to duplication due to plagiarizing, but replaces or translate foreign language possibly through certain semantic transforms, synonym
The multiple means such as document are accused of plagiarizing the content of text of other documents.
At present, there are mainly two types of methods for text detection techniques:One kind is by fingerprint recognition detection method, and one kind is to pass through base
The paragraph word frequency statistics detection method in text.So-called fingerprint recognition refers to that extracting some from the source text content of submission is known as
The data characteristics string of fingerprint, judges whether a certain piece document is plagiarized other documents according to the identical rate of fingerprint.Institute
Meaning paragraph word frequency statistics detection method refers to segment the text of submission, by the appearance frequency for counting each paragraph in text
Rate, set a threshold value after by each array of text to be checked compared with each array of query text, finally according to this
Index is to determine whether plagiarized.The above method of the prior art there are a degree of discrimination rate is low, efficiency not
The problems such as high.
The content of the invention
To overcome above-mentioned the deficiencies in the prior art, plagiarize detection method the present invention provides a kind of distributed text and be
System.
Wherein, the text plagiarizes detecting system and includes comparison database, for including with the material for comparing object;It is described right
Different site locations is stored in using distributed way than storehouse;It can be selected when accessing comparison database according to the loading condition of different websites
Particular station is taken to access;Storehouse is segmented, for including participle and corresponding part of speech;It segments in storehouse and is carried out uniquely for each participle
Number represents unique number of a certain participle in storehouse is segmented using W_ID;Word-dividing mode, for being segmented to each material,
And word segmentation result is preserved into comparison database;Participle characteristic value generation module counts what each participle occurred in corresponding material
Quantity generates the corresponding participle part of speech feature value of each participle;Segment point of the free vector dimension determining module according to material
Word result determines participle free vector dimension WFV;The participle free vector dimension WFV is equal to after segmenting specific material
The quantity of obtained different participles;Participle simplifies vector dimension generation module, and generation participle simplifies vector dimension RWV;Participle is special
Vector generation module is levied, participle described in each material is extracted and simplifies the corresponding characteristic value generation participle features of vector dimension RWV
Vectorial WVE_RWV;User's access mode detection module, for user to be prompted to upload document to be identified;User's detection pattern determines
Module, for judge active user's detection pattern for it is common plagiarize identification pattern when, document word-dividing mode to be identified is for treating
Identification document is segmented, and obtains word segmentation result;Document to be identified segments free vector dimension determining module, determines participle freely
Vector dimension WFV_TBI;Document participle to be identified simplifies vector dimension generation module, generates document participle to be identified and simplifies vector
Dimension RWV_TBI;Document to be identified segments feature vector generation module, generates document participle characteristic vector W VE_RWV_ to be identified
TBI;When user's detection pattern determining module judges active user's detection pattern for common plagiarism identification pattern, similarity ratio is carried out
It is right;After the completion of document to be identified and the comparison of all materials, extract all doubtful materials, by document to be identified and doubtful material into
Row further comparison.
Above description is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention,
And can be practiced according to the content of specification, below with presently preferred embodiments of the present invention and coordinate attached drawing be described in detail as after.
Description of the drawings
Fig. 1 shows the block diagram of distributed text detecting system according to an embodiment of the invention;
Fig. 2 shows sliding window detection method according to an embodiment of the invention.
Specific embodiment
Further to illustrate the present invention to reach the technological means and effect that predetermined goal of the invention is taken, below in conjunction with
Attached drawing and preferred embodiment, to according to system and method specific embodiment proposed by the present invention, feature and its effect, specifically
It is bright as after.In the following description, what different " embodiment " or " embodiment " referred to is not necessarily same embodiment.This
Outside, special characteristic, structure or the feature in one or more embodiments can be combined by any suitable form.
As shown in Figure 1, material subsystem is included in the distributed text detecting system (calling system in the following text) of the present invention;User's
System;Doubtful story extraction subsystem;Subsystem is compared, wherein the material subsystem, for preparing for plagiarizing detection comparison
The material used;User subsystem, user management user login information and definite user's writing style;Doubtful story extraction
Subsystem, for the extraction from comparison database and the doubtful material of document to be identified;Compare subsystem, for by doubtful material with treating
Identification document is compared, and generates comparison report.
A specific embodiment according to the present invention, material subsystem may further include:Comparison database;Segment storehouse,
It segments and synonymous near synonym storehouse and middle foreign language thesaurus is included in storehouse;Word-dividing mode;Participle group module;Middle foreign language participle group mould
Block;Segment parts of speech classification module;Participle group parts of speech classification module;Middle foreign language participle group parts of speech classification module;Segment characteristic value life
Into module;Participle group characteristic value generation module;Middle foreign language participle group characteristic value generation module;Segment tightening coefficient generation module;
Participle group tightening coefficient generation module;Middle foreign language participle group tightening coefficient generation module;Segment the generation of tightening coefficient feature vector
Module;Participle group tightening coefficient feature vector generation module;Middle foreign language participle group tightening coefficient feature vector generation module;Participle
Free vector dimension determining module;Participle group free vector dimension determining module;Middle foreign language participle group free vector dimension determines
Module;Participle simplifies vector dimension generation module;Participle group simplifies vector dimension generation module;Middle foreign language participle group simplifies vector
Dimension generation module;Segment feature vector generation module;Participle group feature vector generation module;And middle foreign language participle group feature
One or more of vector generation module.
A specific embodiment according to the present invention, user subsystem may further include:User's access mode is examined
Survey module;User's detection pattern determining module;User's writing style test module;Test pictures word description characteristic value generates mould
Block;Test article word description characteristic value generation module;Test pictures word description feature vector generation module;Test article text
Word description feature vector generation module;Test pictures reference characteristic vector generation module;Test the vector generation of article reference characteristic
Module;User test picture character Expressive Features value generation module;User test picture character Expressive Features vector generation module;
User's picture writing style feature vector generation module;User test article word description characteristic value generation module;User test
Article word description feature vector generation module;User's article writing style and features vector generation module;User's writing style is special
Levy vector generation module;Pending file characteristics value generation module;Pending file characteristics value tag vector generation module;User
Writing style similarity calculation module;User's writing style judgment module;In user's writing style structural auxiliary word judgment module
It is one or more.
A specific embodiment according to the present invention, doubtful story extraction subsystem may further include:It is to be identified
Document word-dividing mode;Document participle group module to be identified;Foreign language participle group module in document to be identified;Document to be identified segments word
Property sort module;Document participle group parts of speech classification module to be identified;Foreign language participle group parts of speech classification module in document to be identified;It treats
Identify document participle characteristic value generation module;Document participle group characteristic value generation module to be identified;Foreign language point in document to be identified
Phrase characteristic value generation module;Document to be identified segments tightening coefficient generation module;Document participle group tightening coefficient life to be identified
Into module;Foreign language participle group tightening coefficient generation module in document to be identified;Document to be identified segments tightening coefficient feature vector
Generation module;Document participle group tightening coefficient feature vector generation module to be identified;Foreign language participle group is close in document to be identified
Coefficient characteristics vector generation module;Document to be identified segments free vector dimension determining module;Document participle group to be identified is free
Vector dimension determining module;Foreign language participle group free vector dimension determining module in document to be identified;Document participle essence to be identified
Simple vector dimension generation module;Document participle group to be identified simplifies vector dimension generation module;Foreign language segments in document to be identified
Group simplifies vector dimension generation module;Document to be identified segments feature vector generation module;Document participle group feature to be identified to
Measure generation module;Foreign language participle group feature vector generation module in document to be identified;File characteristics vector adjustment module to be identified;
Material feature vector adjusts module;Common to plagiarize identification similarity calculation module, identification similarity calculation module is plagiarized in extension;It is more
Languages plagiarize identification similarity calculation module;Document tightening coefficient statistical module to be identified;Material tightening coefficient statistical module;It is public
Formula extraction module;Formula decomposing module;One or more of doubtful story extraction module of tightening coefficient.
A specific embodiment according to the present invention, comparison subsystem may further include:Sliding window sets mould
Block;Sliding window contrast module and comparison report generation module.
In a specific embodiment party according to the present invention, the system comprises comparison database, for including with comparing object
Material.The comparison database further comprises books storehouse, paper storehouse, patent database, formula storehouse, proverb common saying storehouse, proverb storehouse, famous person
The word banks such as well-known saying storehouse, poem storehouse.Wherein, books storehouse is used to include the books of public publication;Paper storehouse for include journal article,
Meeting paper, academic dissertation etc.;Patent database is used to include disclosure etc.., it is necessary to further preserve institute when including material
State the source of material, such as the publication date of books, publishing house, author, book number etc.;The date issued of journal article, corresponding phase
The periodical name of periodical, issue, author etc.;The meeting title of meeting paper, Meeting Held place, Meeting Held date, author etc.;Degree
School, graduation time, degree grade, author of paper etc.;According to the quarry information included, those skilled in the art can
Uniquely to obtain the material.Preferably, the material that comparison database is included is not limited to Chinese material, further comprises foreign language element
Material.Comparison database establish after also need to periodically or non-periodically be safeguarded, supplement newly-increased books, journal article, meeting paper,
Academic dissertation and disclosure etc..Proverb common saying storehouse for be embodied in sentence wide-spread between network or masses,
The materials such as phrase.For including famous sayings of famous figures material, poem storehouse is used to include the materials such as poem, word, song, tax in famous sayings of famous figures storehouse.
The purpose that proverb common saying storehouse, famous sayings of famous figures storehouse, poem storehouse etc. are further established in comparison database is by the material of object as a comparison
Scope is further expanded from traditional books, paper, patent file etc., is improved and is plagiarized the comprehensive of detection.People in the art
Member knows that comparison database can also further include other kinds of material, and details are not described herein.
Preferably, comparison database is classified when including material according to material fields.A tool according to the present invention
The classification in Chinese library taxonomy, the Chinese library taxonomy totally 5 may be employed in body embodiment, field designation
Basic category, 22 major classes, the mixing number combined using Chinese phonetic alphabet with Arabic numerals represent one with a letter
A major class alphabetically reflects the order of major class, is marked after letter with number.For example, A1 represents Marx, Engels
Works, K6 represent Oceania history, and TN represents electronic technology, the communication technology.To be applicable in industrial technology development, to the two of industrial technology
Grade classification uses biliteral.Those skilled in the art know, other taxonomic hierarchieses can also be used to carry out field mark to material
Know.
Preferably, comparison database is when including material, to the material included according to title, author, summary and text
Mode is indexed respectively.For establishing incidence relation between the title of each material, author, summary and text each several part,
The rest part of same material can be obtained by any portion therein.
Preferably, comparison database is when including material, extracts duplication to formula present in the material included, and builds
Vertical formula storehouse is individually preserved.Each formula in the formula storehouse established with its material being extracted it is relevant,
Its corresponding material full text can be obtained by the formula in formula storehouse.A specific embodiment according to the present invention, is being received
When recording formula, the respective variable parameter of formula and dependent variable parameter and oeprator are extracted into preservation respectively.According to
The specific embodiment of the present invention, respective variable parameter and the laggard onestep extraction of dependent variable parameter for extracting formula are each
Concrete meaning, dimension and the value range of parameter, and preserved respectively.A specific embodiment according to the present invention,
After the oeprator for extracting formula, middle foreign language textual annotation is further subject to operator.In formula storehouse, that is included is every
One formula preserves the symbolic indication of corresponding independent variable parameter and dependent variable parameter, each independent variable, dependent variable
The middle foreign language statement of concrete meaning, dimension and the middle foreign language textual annotation of value range and operator AND operator.Right
Purpose than further establishing formula storehouse in storehouse is that the material scope of object as a comparison is further expanded to formula contrast, is carried
Height plagiarizes the comprehensive of detection.Those skilled in the art know, comparison database can also to the other content in material further into
Row extraction, such as chemical formula, gene order etc., details are not described herein.
A specific embodiment according to the present invention, the comparison database are stored in different websites using distributed way
Position;Particular station can be chosen when accessing comparison database according to the loading condition of different websites to access.Each station statistics are current
The material quantity being extracted in unit interval from comparison database, the material quantity can be the number or material of material
Byte number;Obtain the average load amount of this website;The average load amount of this website is periodically reported doubtful material by each website
Extract subsystem;When the doubtful story extraction subsystem needs to extract material from comparison database for choosing doubtful material,
A minimum website of average load amount is chosen according to the average load amount of each website reported recently to access;List therein
The position period is configured by system;It can be chosen for according to actual needs 5 minutes, 10 minutes, 30 minutes or 60 minutes.Root
According to the specific embodiment of the present invention, different word banks can be used distributed way and be stored in different stations in the comparison database
Point position;The site location stored according to different word banks during comparison database is accessed to access respectively.Doubtful story extraction subsystem
System need from comparison database extract material for choose doubtful material when, according to the fields for the material of being extracted or affiliated
Type selects different comparison word banks to access.
A specific embodiment according to the present invention, it is described access comparison database when can be according to the loading condition of different websites
Choose particular station to access and refers to, the loading condition of variant website obtained before accessing, choose load minimum website into
Row accesses.
A specific embodiment according to the present invention, comprising participle storehouse in system, for including participle and corresponding part of speech.
The participle storehouse is set in advance by system, and periodic maintenance, is mended and is increased neologisms etc..Preferably, segment storehouse in for it is each segment into
Row unique number can use W_ID to represent unique number of a certain participle in storehouse is segmented.Preserve participle in the participle storehouse
Part of speech, such as noun, verb, adjective, number, quantifier, pronoun, adverbial word, preposition, conjunction, auxiliary word, interjection and onomatopoeia.
Word segmentation result is divided into notional word and function word by a specific embodiment according to the present invention according to part of speech, and wherein notional word includes
Noun, verb, adjective, number, quantifier and pronoun;Function word includes adverbial word, preposition, conjunction, auxiliary word, interjection, onomatopoeia.It is preferred that
Ground segments in storehouse and has further included synonymous near synonym storehouse, wherein the same or similar participle of meaning is formed one group, using group as
Unit is numbered.Multiple equivalent in meaning or similar participle corresponds to a participle group #, can represent certain using WG_ID
Unique number of one participle in storehouse is segmented.Preferably, segment in storehouse and further included the synonymous near synonym storehouse of middle foreign language, wherein
The same or similar middle foreign language participle of meaning is formed one group, is numbered in units of group.It is multiple equivalent in meaning or similar
Middle foreign language participle corresponds to a middle foreign language participle group #, can represent that a certain middle foreign language participle group is being segmented using WFG_ID
Unique number in storehouse.
A specific embodiment according to the present invention, comprising word-dividing mode in system, for being segmented to each material,
And word segmentation result is preserved into comparison database.Preferably, word-dividing mode compares the part of speech that word segmentation result is preserved with participle storehouse
It is right, determine the part of speech of word segmentation result.Preferably, parts of speech classification module is segmented according to the corresponding part of speech of word segmentation result to word segmentation result
Carry out classification processing.
A specific embodiment according to the present invention, comprising participle group module in system, for dividing each material
Word, and participle group result is preserved into comparison database.Preferably, the part of speech that participle group module preserves word segmentation result with participle storehouse
It is compared, determines the part of speech of participle group result.Preferably, participle group parts of speech classification module is according to the corresponding word of participle group result
Property carries out classification processing to participle group result.
A specific embodiment according to the present invention, comprising middle foreign language participle group module in system, for each material
It is segmented, and middle foreign language participle group result is preserved into comparison database.Preferably, middle foreign language participle group module divides middle foreign language
Word result is compared with the part of speech that participle storehouse preserves, the part of speech of foreign language participle group result in determining.Preferably, middle foreign language participle
Group parts of speech classification module corresponding part of speech centering foreign language participle group result of foreign language participle group result in carries out classification processing.
A specific embodiment according to the present invention, participle parts of speech classification module, participle group parts of speech classification module and
Middle foreign language participle group parts of speech classification module respectively divides word segmentation result, participle group result and middle foreign language participle group according to part of speech
For A classes notional word, B classes notional word, C classes notional word, D classes notional word and V class function words, wherein A classes notional word includes noun;B class notional words include
Verb, adjective;C classes notional word includes number, quantifier;D classes notional word includes pronoun;V classes function word includes adverbial word, preposition, conjunction, helps
Word, interjection, onomatopoeia.Preferably, segment in storehouse and noun is further divided into technical term and common noun.According to this hair
Word segmentation result is divided into A1 classes notional word, A2 classes notional word, B classes notional word, C classes reality by a bright specific embodiment according to part of speech
Word, D classes notional word and V class function words, wherein A1 classes notional word include technical term noun;A2 classes notional word includes common noun;B classes are real
Word includes verb, adjective;C classes notional word includes number, quantifier;D classes notional word includes pronoun;V classes function word include adverbial word, preposition,
Conjunction, auxiliary word, interjection, onomatopoeia.Those skilled in the art can choose different classification processing schemes according to actual needs.
A specific embodiment according to the present invention, participle characteristic value generation module count each participle in corresponding element
The quantity occurred in material, generates the corresponding participle characteristic value WCV=[W_ID, W_N] of each participle, and wherein W_ID represents this point
Unique number of the word in storehouse is segmented, W_N represent the total degree that the participle occurs in the material.Preferably, it is contemplated that each
The part of speech of a participle, participle characteristic value generation module generation participle part of speech feature value WCCV=[W_ID, W_N, W_CHAR], wherein
W_CHAR represents the part of speech of the participle.
A specific embodiment according to the present invention, participle group characteristic value generation module count each participle group right
The quantity occurred in material is answered, generates the corresponding participle group characteristic value WGCV=[WG_ID, WG_N] of each participle group, wherein
WG_ID represents unique number of the participle group in storehouse is segmented, and WG_N represents the total degree that the participle group occurs in the material.
Preferably, it is contemplated that the part of speech of each participle group, participle group characteristic value generation module generation participle group part of speech feature value WGCCV
=[WG_ID, WG_N, WG_CHAR], wherein WG_CHAR represent the part of speech of the participle group.
A specific embodiment according to the present invention, middle foreign language participle group characteristic value generation module count each China and foreign countries
The quantity that literary participle group occurs in corresponding material, generates the corresponding participle group characteristic value WFGCV of foreign language participle group in each
=[WFG_ID, WFG_N], wherein WFG_ID represent unique number of the foreign language participle group in storehouse is segmented in this, and WFG_N is represented should
The total degree that middle foreign language participle group occurs in the material.Preferably, it is contemplated that the part of speech of foreign language participle group in each, participle
Foreign language participle group part of speech feature value WFGCCV=[WFG_ID, WFG_N, WFG_CHAR] in the generation module generation of group characteristic value,
Middle WFG_CHAR represents the part of speech of foreign language participle group in this.
A specific embodiment according to the present invention, participle tightening coefficient generation module segment close system for generating
Number.The participle tightening coefficient refers to that same participle is adjacent in entire material and occurs be spaced participle quantity twice.According to
The specific embodiment of the present invention, the corresponding participle tightening coefficient of each participle are expressed as WGC=[G_W_ID_1, G_
W_ID_2 ..., G_W_ID_ (W_N-1)], wherein, G_W_ID_1 represents that the participle occurs and second for the first time in the material
The participle quantity being spaced between appearance, G_W_ID_2 represent that the participle occurs occurring it with third time second in the material
Between the participle quantity that is spaced, G_W_ID_ (W_N-1) represents that the participle the W_N-1 times appearance in the material goes out with the W_N times
The participle quantity being spaced between existing;G_W_ID_1, G_W_ID_2 ..., G_W_ID_ (W_N-1) are corresponding point of the participle
Word tightening coefficient.A specific embodiment according to the present invention, participle tightening coefficient feature vector generation module generation participle
Tightening coefficient characteristic vector W GCVE=[W_ID, W_N, W_CHAR, G_W_ID_1, G_W_ID_2 ..., G_W_ID_ (W_N-1)],
Wherein W_ID represents unique number of the participle in storehouse is segmented, and W_N represents that the participle of the specific participle in the material is always secondary
Number, W_CHAR represent the part of speech of the participle.By segmenting tightening coefficient, entirety of the specific participle in corresponding material can be known
Distribution situation.
A specific embodiment according to the present invention, participle group tightening coefficient generation module are close for generating participle group
Coefficient.The participle group tightening coefficient refers to that same participle group is adjacent in entire material and occurs be spaced participle number twice
Amount.A specific embodiment according to the present invention, the corresponding participle group tightening coefficient of each participle group are expressed as WGGC=
[G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1)], wherein, G_WG_ID_1 represents the participle group in the material
The participle quantity that middle first time occurs and is spaced between occurring for second, G_WG_ID_2 represent the participle group in the material
Second of the participle quantity occurred being spaced between occurring for the third time, G_WG_ID_ (WG_N-1) represent the participle group in the element
The participle quantity being spaced in material between the WG_N-1 times appearance and the WG_N times appearance;G_WG_ID_1, G_WG_ID_2 ...,
G_WG_ID_ (WG_N-1) is the corresponding participle group tightening coefficient of the participle group.A specific embodiment party according to the present invention
Formula, participle group tightening coefficient feature vector generation module generation participle group tightening coefficient characteristic vector W GGCVE=[WG_ID, WG_
N, WG_CHAR, G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1)], wherein WG_ID represents that the participle group is being divided
Unique number in dictionary, WG_N represent the participle total degree of the specific participle group in the material, and WG_CHAR represents the participle
The part of speech of group.By participle group tightening coefficient, overall distribution situation of the specific participle group in corresponding material can be known.
A specific embodiment according to the present invention, middle foreign language participle group tightening coefficient generation module is for generation China and foreign countries
Literary participle group tightening coefficient.The middle foreign language participle group tightening coefficient refers to that same middle foreign language participle group is adjacent in entire material
Occurs be spaced participle quantity twice.A specific embodiment according to the present invention, foreign language participle group corresponds in each
Middle foreign language participle group tightening coefficient be expressed as WFGGC=[G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N-
1)], wherein, G_WFG_ID_1 represents that foreign language participle group occurs between second of appearance in the material for the first time between institute in this
Every participle quantity, between G_WFG_ID_2 represents in this that foreign language participle group occurs for second in the material and third time occurs
The participle quantity being spaced, G_WFG_ID_ (WFG_N-1) represent that foreign language participle group goes out for the WFG_N-1 times in the material in this
The participle quantity being spaced between now occurring with the WFG_N times;G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_
N-1) it is the corresponding participle group tightening coefficient of foreign language participle group in this.A specific embodiment according to the present invention, China and foreign countries
Foreign language participle group tightening coefficient characteristic vector W FGGCVE=in literary participle group tightening coefficient feature vector generation module generation
[WFG_ID, WFG_N, WFG_CHAR, G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N-1)], wherein WFG_
ID represents unique number of the foreign language participle group in storehouse is segmented in this, and WFG_N represents the specific middle foreign language participle group in the material
In participle total degree, WFG_CHAR represents the part of speech of foreign language participle group in this.It, can be with by middle foreign language participle group tightening coefficient
Know overall distribution situation of the specific middle foreign language participle group in corresponding material.
A specific embodiment according to the present invention segments participle knot of the free vector dimension determining module according to material
Fruit determines participle free vector dimension WFV;The participle free vector dimension WFV is equal to specific material is segmented after obtain
Different participles quantity.When the length of material is shorter or word segmentation result therein is less, obtained participle freely to
It is less to measure dimension WFV;When the length of material is longer or word segmentation result therein is more, obtained participle free vector dimension
Number WFV is more.
A specific embodiment according to the present invention, participle group free vector dimension determining module is according to the participle of material
As a result participle group free vector dimension WGFV is determined;The participle group free vector dimension WGFV is equal to and specific material is divided
The quantity of the different participle groups obtained after word.It is acquired when the length of material is shorter or participle group result therein is less
Participle group free vector dimension WGFV it is less;When the length of material is longer or participle group result therein is more, gained
The participle group free vector dimension WGFV arrived is more.
A specific embodiment according to the present invention, middle foreign language participle group free vector dimension determining module is according to material
Word segmentation result determine middle foreign language participle group free vector dimension WFGFV;The middle foreign language participle group free vector dimension WFGFV
Equal to the quantity of foreign language participle group in the difference obtained after being segmented to specific material.When the length of material is shorter or wherein
Middle foreign language participle group result it is less when, obtained middle foreign language participle group free vector dimension WFGFV is less;When a piece for material
Width is longer or when participle group result therein is more, and obtained middle foreign language participle group free vector dimension WFGFV is more.
A specific embodiment according to the present invention, participle simplify vector dimension generation module for each material
Participle free vector dimension WFV is simplified, and generation participle simplifies vector dimension RWV.It is described participle simplify vector dimension RWV by
System is specified.Preferably, system specifies participle to simplify vector dimension RWV as 500.Preferably, system specifies participle to simplify vector
Dimension RWV is 800.Preferably, system specifies participle to simplify vector dimension RWV as 1000.
A specific embodiment according to the present invention, participle simplify vector dimension generation module using extracted at equal intervals method
Participle free vector dimension WFV is simplified.It is as follows to simplify process:Judge whether participle free vector dimension WFV is more than to divide
Word simplifies vector dimension RWV, if it is, participle free vector dimension WFV divided by the system participle specified are simplified vectorial dimension
Number RWV, and upper rounding operation is carried out to obtained quotient, it further obtains simplifying coefficients R EDU;Then in participle free vector
At interval of one characteristic value of REDU-1 extraction in characteristic value corresponding to dimension WFV;After all characteristics extractions, sentence
Whether the quantity of disconnected extracted characteristic value, which is equal to participle, is simplified vector dimension RWV;When the quantity for the characteristic value extracted is equal to
When participle simplifies vector dimension RWV, then complete participle free vector dimension WFV and simplify;When the quantity for the characteristic value extracted is small
When participle simplifies vector dimension RWV, then calculate participle and simplify vector dimension RWV and the difference of characteristic value quantity;It is not carried
Extraction simplifies the vector dimension RWV characteristic values equal with the difference quantities of characteristic value with participle at random in the characteristic value taken, completes
Participle free vector dimension WFV's simplifies.
A specific embodiment according to the present invention, participle simplify vector dimension generation module using part of speech screening method pair
Participle free vector dimension WFV is simplified.It is as follows to simplify process:By the characteristic value of word segmentation result according to corresponding participle part of speech
Classify;Feature value division is A1 class notional words characteristic value, A2 classes notional word spy by a specific embodiment according to the present invention
Value indicative, B class notional words characteristic value, C class notional words characteristic value, D class notional word characteristic values and V class function word characteristic values.Generally, it is considered that notional word
Effect bigger played in the similarity comparison of corresponding characteristic value, wherein technical term noun can more be embodied compared with common noun
Effective content of material.Count respectively lower eigenvalue of all categories quantity AMOUNT_A1 (quantity of A1 class notional word characteristic values),
(C classes are real by AMOUNT_A2 (quantity of A2 class notional word characteristic values), AMOUNT_B (quantity of B class notional word characteristic values), AMOUNT_C
The quantity of word characteristic value), AMOUNT_D (quantity of D class notional word characteristic values), AMOUNT_V (quantity of V class notional word characteristic values).
It calculates participle and simplifies vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+
AMOUNT_V value RWV_S_V);If greater than 0, exit and if this time simplify;If equal to 0, then it completes this time to simplify;Such as
Fruit is less than 0, then further calculates participle and simplify vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_
C+AMOUNT_D value RWV_S_D);If greater than 0, then extracted and the difference at random from the characteristic value corresponding to AMOUNT_V
The equal characteristic value of RWV_S_D quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then into
One step calculates the value RWV_S_ that participle simplifies vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C)
C;If greater than 0, then the feature equal with difference RWV_S_C quantity is extracted at random from the characteristic value corresponding to AMOUNT_D
Value, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then further calculate participle and simplify vectorial dimension
The value RWV_S_B of number RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B);If greater than 0, then from corresponding to AMOUNT_C
Characteristic value in extract the characteristic value equal with difference RWV_S_B quantity at random, complete this time to simplify;If equal to 0, then it is complete
It is simplified into this;If less than 0, then further calculate participle and simplify vector dimension RWV-'s (AMOUNT_A1+AMOUNT_A2)
Value RWV_S_A2;If greater than 0, then extracted and difference RWV_S_A2 quantity at random from the characteristic value corresponding to AMOUNT_B
Equal characteristic value, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then participle is further calculated
Simplify the value RWV_S_A1 of vector dimension RWV-AMOUNT_A1;If greater than 0, then from the characteristic value corresponding to AMOUNT_A2
The random extraction characteristic value equal with difference RWV_S_A1 quantity, completion are this time simplified;If equal to 0, then complete this time essence
Letter;If less than 0, then extracted at random from the characteristic value corresponding to AMOUNT_A1 equal with simplifying vector dimension RWV quantity
Characteristic value, completion are this time simplified.
Vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+ are simplified for calculating participle
AMOUNT_D+AMOUNT_V value RWV_S_V) is more than 0 situation, that is, means that the material length is smaller or information content is less,
Therefore be not suitable for being compared using characteristic value.
Participle free vector dimension WFV, which is less than when participle simplifies vector dimension RWV, represents that itself dimension is small, then other are tieed up
Magnitude under several is equivalent to 0.Such situation needs Direct Mark in systems, individually includes processing.Such as common saying among the people, famous person
Well-known saying etc. is searched as index and used.Subsequently usable full text sliding window carries out full text comparison and uses.
A specific embodiment according to the present invention, participle group simplify vector dimension generation module for each material
Participle group free vector dimension WGFV simplified, generation participle group simplify vector dimension RWGV.The participle group simplify to
Amount dimension RWGV is specified by system.Preferably, system specifies participle group to simplify vector dimension RWGV as 500.Preferably, system refers to
Determine participle group and simplify vector dimension RWGV as 800.Preferably, system specifies participle group to simplify vector dimension RWGV as 1000.
A specific embodiment according to the present invention, participle group simplify vector dimension generation module using extracted at equal intervals
Method simplifies participle group free vector dimension WGFV.It is as follows to simplify process:Judging participle group free vector dimension WGFV is
It is no to simplify vector dimension RWGV more than participle group, if it is, participle group free vector dimension WGFV divided by system are specified point
Phrase simplifies vector dimension RWGV, and carries out upper rounding operation to obtained quotient, further obtains simplifying coefficients R EDU;Then
At interval of one characteristic value of REDU-1 extraction in the characteristic value corresponding to participle group free vector dimension WGFV;As all spies
After value indicative is extracted, judge whether the quantity of extracted characteristic value equal to participle group simplifies vector dimension RWGV;When being carried
When the quantity of the characteristic value taken simplifies vector dimension RWGV equal to participle group, then participle group free vector dimension WGFV essences are completed
Letter;When the quantity for the characteristic value extracted simplifies vector dimension RWGV less than participle group, then calculate participle group and simplify vectorial dimension
Number RWGV and the difference of characteristic value quantity;Extraction simplifies vector dimension RWGV with participle group at random in non-extracted characteristic value
The characteristic value equal with the difference quantities of characteristic value completes simplifying for participle group free vector dimension WGFV.
A specific embodiment according to the present invention, participle group simplify vector dimension generation module using part of speech screening method
Participle group free vector dimension WGFV is simplified.It is as follows to simplify process:Characteristic value is carried out according to corresponding participle part of speech
Classification;Feature value division is A1 class notional words characteristic value, A2 class notional word features by a specific embodiment according to the present invention
Value, B class notional words characteristic value, C class notional words characteristic value, D class notional word characteristic values and V class function word characteristic values.Generally, it is considered that notional word pair
Effect bigger played in the similarity comparison for the characteristic value answered, wherein technical term noun can more embody element compared with common noun
Effective content of material.Count respectively lower eigenvalue of all categories quantity AMOUNT_A1 (quantity of A1 class notional word characteristic values),
(C classes are real by AMOUNT_A2 (quantity of A2 class notional word characteristic values), AMOUNT_B (quantity of B class notional word characteristic values), AMOUNT_C
The quantity of word characteristic value), AMOUNT_D (quantity of D class notional word characteristic values), AMOUNT_V (quantity of V class notional word characteristic values).
It calculates participle group and simplifies vector dimension RWGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+
AMOUNT_V value RWGV_S_V);If greater than 0, exit and if this time simplify;If equal to 0, then it completes this time to simplify;
If less than 0, then further calculate participle group and simplify vector dimension RWGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+
AMOUNT_C+AMOUNT_D value RWGV_S_D);If greater than 0, then extracted at random from the characteristic value corresponding to AMOUNT_V
The characteristic value equal with difference RWGV_S_D quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;It is if small
In 0, then further calculate participle and simplify vector dimension RWGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C)
Value RWGV_S_C;If greater than 0, then extracted and difference RWGV_S_C numbers at random from the characteristic value corresponding to AMOUNT_D
Equal characteristic value is measured, completion is this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then further calculate and divide
Phrase simplifies the value RWGV_S_B of vector dimension RWGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B);If greater than 0, then
It extracts the characteristic value equal with difference RWGV_S_B quantity at random from the characteristic value corresponding to AMOUNT_C, completes this time essence
Letter;If equal to 0, then it completes this time to simplify;If less than 0, then further calculate participle group and simplify vector dimension RWGV-
(AMOUNT_A1+AMOUNT_A2) value RWGV_S_A2;It is if greater than 0, then random from the characteristic value corresponding to AMOUNT_B
The extraction characteristic value equal with difference RWGV_S_A2 quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;
If less than 0, then the value RWGV_S_A1 that participle group simplifies vector dimension RWGV-AMOUNT_A1 is further calculated;If greater than
0, then it extracts the characteristic value equal with difference RWGV_S_A1 quantity at random from the characteristic value corresponding to AMOUNT_A2, completes
This is simplified;If equal to 0, then it completes this time to simplify;It is if less than 0, then random from the characteristic value corresponding to AMOUNT_A1
The characteristic value equal with simplifying vector dimension RWGV quantity is extracted, completion is this time simplified.
Vector dimension RWGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C are simplified for calculating participle group
+ AMOUNT_D+AMOUNT_V) value RWGV_S_V be more than 0 situation, that is, mean that the material length is smaller or information content compared with
It is few, therefore be not suitable for being compared using characteristic value.
Participle group free vector dimension WGFV represents that itself dimension is small when simplifying vector dimension RWGV less than participle group, then
Magnitude under other dimensions is equivalent to 0.Such situation needs Direct Mark in systems, individually includes processing.Such as custom among the people
Language, famous sayings of famous figures etc. are searched as index and used.Subsequently usable full text sliding window carries out full text comparison and uses.
A specific embodiment according to the present invention, middle foreign language participle group simplify vector dimension generation module for every
The middle foreign language participle group free vector dimension WFGFV of a material is simplified, and foreign language participle group simplifies vector dimension in generation
RWFGV.The middle foreign language participle group is simplified vector dimension RWFGV and is specified by system.Preferably, system specifies middle foreign language participle group
Vector dimension RWFGV is simplified as 500.Preferably, system specifies middle foreign language participle group to simplify vector dimension RWFGV as 800.It is preferred that
Ground, system specify middle foreign language participle group to simplify vector dimension RWFGV as 1000.
A specific embodiment according to the present invention, middle foreign language participle group are simplified between vector dimension generation module use etc.
It is simplified every extraction method centering foreign language participle group free vector dimension WFGFV.It is as follows to simplify process:Foreign language participle group in judgement
Whether free vector dimension WFGFV more than middle foreign language participle group simplifies vector dimension RWFGV, if it is, middle foreign language is segmented
Group free vector dimension WFGFV divided by system specify middle foreign language participle group to simplify vector dimension RWFGV, and to obtained quotient
Rounding operation is carried out, further obtains simplifying coefficients R EDU;Then corresponding to middle foreign language participle group free vector dimension WFGFV
Characteristic value at interval of REDU-1 extraction one characteristic value;After all characteristics extractions, extracted spy is judged
Whether the quantity of value indicative equal to middle foreign language participle group simplifies vector dimension RWFGV;In the quantity for the characteristic value extracted is equal to
When foreign language participle group simplifies vector dimension RWFGV, then foreign language participle group free vector dimension WFGFV is simplified in completing;When being carried
When the quantity of the characteristic value taken simplifies vector dimension RWFGV less than middle foreign language participle group, then calculate in foreign language participle group simplify to
Measure dimension RWFGV and the difference of characteristic value quantity;Extraction is simplified with middle foreign language participle group at random in non-extracted characteristic value
Characteristic value equal with the difference quantities of characteristic value vector dimension RWFGV, foreign language participle group free vector dimension WFGFV in completion
Simplify.
A specific embodiment according to the present invention, middle foreign language participle group simplify vector dimension generation module using part of speech
Screening method centering foreign language participle group free vector dimension WFGFV is simplified.It is as follows to simplify process:By characteristic value according to corresponding
Participle part of speech is classified;Feature value division is A1 class notional words characteristic value, A2 by a specific embodiment according to the present invention
Class notional word characteristic value, B class notional words characteristic value, C class notional words characteristic value, D class notional word characteristic values and V class function word characteristic values.Usually
Think, the effect bigger played in the similarity comparison of the corresponding characteristic value of notional word, wherein technical term noun is compared with generic name
Word can more embody effective content of material.Quantity AMOUNT_A1 (the A1 class notional word characteristic values of lower eigenvalue of all categories are counted respectively
Quantity), AMOUNT_A2 (quantity of A2 class notional word characteristic values), AMOUNT_B (quantity of B class notional word characteristic values), AMOUNT_
C (quantity of C class notional word characteristic values), AMOUNT_D (quantity of D class notional word characteristic values), AMOUNT_V be (V class notional word characteristic values
Quantity).Foreign language participle group simplifies vector dimension RWFGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_ in calculating
C+AMOUNT_D+AMOUNT_V value RWFGV_S_V);If greater than 0, exit and if this time simplify;If equal to 0, then it is complete
It is simplified into this;If less than 0, then foreign language participle group simplifies vector dimension RWFGV- (AMOUNT_A1+ in further calculating
AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D value RWFGV_S_D);It is if greater than 0, then right from AMOUNT_V institutes
The characteristic value equal with difference RWFGV_S_D quantity is extracted in the characteristic value answered at random, completion is this time simplified;If equal to 0,
It then completes this time to simplify;If less than 0, then foreign language participle group simplifies vector dimension RWFGV- (AMOUNT_A1 in further calculating
+ AMOUNT_A2+AMOUNT_B+AMOUNT_C) value RWFGV_S_C;If greater than 0, then from the feature corresponding to AMOUNT_D
The characteristic value equal with difference RWFGV_S_C quantity is extracted in value at random, completion is this time simplified;If equal to 0, then complete this
It is secondary to simplify;If less than 0, then foreign language participle group simplifies vector dimension RWFGV- (AMOUNT_A1+AMOUNT_ in further calculating
A2+AMOUNT_B value RWFGV_S_B);If greater than 0, then extract and be somebody's turn to do at random from the characteristic value corresponding to AMOUNT_C
The equal characteristic value of difference RWFGV_S_B quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than
0, then further calculate the value RWFGV_S_A2 that participle group simplifies vector dimension RWFGV- (AMOUNT_A1+AMOUNT_A2);Such as
Fruit is more than 0, then extracts the feature equal with difference RWFGV_S_A2 quantity at random from the characteristic value corresponding to AMOUNT_B
Value, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then foreign language participle group is smart in further calculating
The value RWFGV_S_A1 of simple vector dimension RWFGV-AMOUNT_A1;If greater than 0, then from the characteristic value corresponding to AMOUNT_A2
In extract the characteristic value equal with difference RWFGV_S_A1 quantity at random, complete this time to simplify;If equal to 0, then complete this
It is secondary to simplify;If less than 0, then vector dimension RWFGV quantity is extracted and simplified at random from the characteristic value corresponding to AMOUNT_A1
Equal characteristic value, completion are this time simplified.
Vector dimension RWFGV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+ are simplified for foreign language participle group in calculating
AMOUNT_C+AMOUNT_D+AMOUNT_V value RWFGV_S_V) be more than 0 situation, that is, mean the material length it is smaller or
Information content is less, therefore is not suitable for being compared using characteristic value.
Participle group free vector dimension WFGFV represents that itself dimension is small when simplifying vector dimension RWFGV less than participle group,
Then the magnitude under other dimensions is equivalent to 0.Such situation needs Direct Mark in systems, individually includes processing.It is such as among the people
Common saying, famous sayings of famous figures etc. are searched as index and used.Subsequently usable full text sliding window carries out full text comparison and uses.
A specific embodiment according to the present invention, participle feature vector generation module simplify vector dimension according to participle
RWV extracts participle described in each material and simplifies the corresponding characteristic value generation participle characteristic vector W VE_RWV of vector dimension RWV;
WVE_RWV=[W_ID1, W_N1 ..., W_IDi, W_Ni ..., W_IDRWV, W_NRWV]
Wherein W_IDi represents unique number of the participle in storehouse is segmented, and W_Ni represents what the participle occurred in the material
Total degree, using the number as the characteristic value of the participle.
A specific embodiment according to the present invention, participle group feature vector generation module simplify vector according to participle group
Dimension RWGV extract participle group described in each material simplify the corresponding characteristic values of vector dimension RWGV generate participle group feature to
Measure WVE_RWGV;
WVE_RWGV=[WG_ID1, WG_N1 ..., WG_IDi, WG_Ni ..., WG_IDRWGV, WG_NRWGV]
Wherein WG_IDi represents unique number of the participle group in storehouse is segmented, and WG_Ni represents the participle group in the material
The total degree of appearance, using the number as the characteristic value of the participle group.
A specific embodiment according to the present invention, middle foreign language participle group feature vector generation module foreign language point in
Phrase simplifies middle foreign language participle group described in each material of vector dimension RWFGV extractions and simplifies the corresponding spies of vector dimension RWFGV
Foreign language participle group characteristic vector W VE_RWFGV in value indicative generation;
WVE_RWFGV=[WFG_ID1, WFG_N1 ..., WFG_IDi, WFG_Ni ..., WFG_IDRWFGV, WFG_
NRWFGV]
Unique number of the foreign language participle group in storehouse is segmented during wherein WFG_IDi is represented, WFG_Ni represent foreign language point in this
The total degree that phrase occurs in the material, using the number as the characteristic value of foreign language participle group in this.
A specific embodiment according to the present invention, system provide a variety of access modes to the user.User accesses system,
User's access mode detection module is used to detect the access mode of active user.
In the specific embodiment of the present invention, user can access system in a manner of on probation, referred to hereinafter as on probation
The user that mode accesses is user on probation.When user's access mode detection module, which detects user, to be accessed in a manner of on probation,
Prompting is sent to user on probation, it is mode on probation to inform current accessed mode, and informs the access right of user on probation.According to this
One specific embodiment of invention, for the user accessed in a manner of on probation, system is only that user on probation provides book character
Several detections are tried out, and the predetermined number of words is set in advance by system.Another embodiment according to the present invention, for
The user that mode on probation accesses, the database that system provides part or all of scope to try out user are tried out for detection.According to this
The another embodiment of invention, for the user accessed in a manner of on probation, system is the plagiarism inspection that user on probation provides
Survey result only provides the prompting of plagiarism rate, does not provide specific plagiarism position and with being compared by the plagiarism of plagiarism document.According to
The another embodiment of the present invention, for the user accessed in a manner of on probation, system is the plagiarism that user on probation provides
Testing result provide it is specific plagiarize position, but pair with carrying out Fuzzy processing by the plagiarism comparison of plagiarism document so that try out
User is only capable of knowing that the specific of the document itself provided plagiarizes position, but None- identified is by the specifying information of plagiarism document.
A specific embodiment according to the present invention, user accesses system with counting mode, referred to hereinafter as with counting mode
The user of access is counting user.When user's access mode detection module, which detects user, to be accessed with counting mode, to meter
Number user sends prompting, informs current accessed mode as counting mode, and prompts to count user and upload to need to carry out plagiarism comparison
Document.A specific embodiment according to the present invention, system statistics count the number of characters that user uploads document, and according to system
The number of characters counted out calculates the expense that this text plagiarizes detection.Another embodiment according to the present invention, system are
The database that counting user provides part or all of scope is selective, and system selects different database scopes according to user is counted
Calculate the expense that this text plagiarizes detection.
A specific embodiment according to the present invention, user accesses system with timing mode, referred to hereinafter as with timing mode
The user of access is timing user.When user's access mode detection module, which detects user, to be accessed with timing mode, to meter
When user send prompting, inform current accessed mode as timing mode, and timing user current residual prompted to use duration.According to
The another embodiment of the present invention, for timing user, system is timing user on display circle in use
Residue is provided in face in real time to prompt using duration countdown.Another embodiment according to the present invention, system are timing
The database that user provides part or all of scope is selective.A specific embodiment according to the present invention, system is according to meter
When user select the number of characters of different database scope and timing user institute uploading detection document, estimate needed for the document
Duration is detected, and prompts timing user remaining whether to complete currently to plagiarize detection using duration.
A specific embodiment according to the present invention, it is true by user's detection pattern after timing user logs in the system
Cover half block determines to plagiarize detection detection pattern.A specific embodiment according to the present invention, system provide self audit mode,
It is selective commonly to plagiarize identification pattern, extension plagiarism identification pattern, multilingual plagiarism identification pattern, formula plagiarism identification pattern.
A specific embodiment according to the present invention, user's detection pattern determining module determine active user's detection pattern
For self audit mode when, user's writing style test module provides one or more test pictures to the user, is being advised by user
Carry out the word description of no less than regulation number of words in fixing time online for test pictures.Preferably, user's writing style is tested
Module further provides one or more test articles to the user, and no less than regulation word is carried out online at the appointed time by user
Several text reviews.The test pictures or test article from test picture library and test library by user's writing style test module
In randomly select.No matter use test pictures or test article, be required for carrying out online word description or comment by user, by
Being limited to the stipulated time can not set long, usually be chosen for 30 minutes or 60 minutes, corresponding word description or text reviews
Regulation number of words is usually chosen for 400 word/30 minute or 800 word/60 minute.Those skilled in the art can as needed further
Other stipulated times or regulation number of words are set., it is specified that the time should not set long from the point of view of experimental data, do not have to avoid user
There are enough time or unstable networks that can not complete accordingly to test;In addition, it is specified that the ratio of number of words and stipulated time are unsuitable too low,
It is accustomed to avoid that cannot reflect that user writes strictly according to the facts.Long, corresponding word description or text can not be set by being limited to the stipulated time
The length of word comment is limited, and the only characteristic value and feature vector of the word description with on-line testing extraction or text reviews may
Also the writing custom of user can not be really reflected, it is therefore desirable to which further extraction test pictures describe reference characteristic vector and survey
Examination article describes reference characteristic vector, for correct feature caused by word description or text reviews word are insufficient to
Measure deviation.
A specific embodiment according to the present invention, the every width test pictures tested in picture library all have test chart chip base
Quasi- feature vector.It is the base that predetermined quantity is randomly selected from different background crowds that the test pictures, which describe reference characteristic vector,
Quasi- tester carries out the description of no less than regulation number of words with regard to fc-specific test FC picture respectively, gathers all word descriptions, statistics
The test pictures word description characteristic value of same test pictures, according to the test pictures word description characteristic value calculate feature to
Amount, and feature vector is weighted, obtain the test pictures reference characteristic vector of fc-specific test FC picture.The weighting fortune
Weights in calculation are set by system.The every test article tested in library all has test article reference characteristic vector.It is described
It is the benchmark test personnel that predetermined quantity is randomly selected from different background crowds to test article reference characteristic vector, just special respectively
Location survey examination article carries out the description of no less than regulation number of words, gathers all word descriptions, statistics is for same test article
Test article word description characteristic value, feature vector calculated according to the test article word description characteristic value, and to feature to
Amount is weighted, and obtains the test article reference characteristic vector of fc-specific test FC article.Weights in the ranking operation by
System is set.
A specific embodiment according to the present invention, the benchmark that predetermined quantity is randomly selected from different background crowds are surveyed
It during examination personnel, can be chosen according to all ages and classes level, can preferably be divided into 20 years old with the following group, 20-29 Sui group, 30-39 Sui
Group, 40-49 Sui group, 50 years old or more group.So as to collect the crowd of age groups for same test pictures or same test text
Chapter is no less than the description situation for providing number of words.
A specific embodiment according to the present invention, the benchmark that predetermined quantity is randomly selected from different background crowds are surveyed
It during examination personnel, can be chosen according to different academic backgrounds level, can preferably be divided into undergraduate education with the following group, undergraduate education group is large
Scholar postgraduate's group, doctoral candidate's group.So as to collect the crowd of different academic backgrounds group for same test pictures or same test text
Chapter is no less than the description situation for providing number of words.
A specific embodiment according to the present invention, the benchmark that predetermined quantity is randomly selected from different background crowds are surveyed
During examination personnel, can be chosen according to different majors field (can divide professional domain, herein not according to different measuring accuracy demands
Repeat again), so as to which the crowd for collecting different majors field group no less than provides for same test pictures or same test article
The description situation of number of words.
A specific embodiment according to the present invention, test pictures word description characteristic value generation module obtain benchmark and survey
The test pictures that examination personnel obtain benchmark test personnel describe text, generate user test picture character Expressive Features value;It is described
Test pictures word description characteristic value includes but not limited to:Chinese number of words, foreign language number of words, total word number, notional word number, function word number, section
Fall number, bout length distribution situation, sentence number, sentence length distribution situation, synonym, near synonym spread scenarios, function word use
Situation, punctuation mark service condition, part of speech service condition.A specific embodiment according to the present invention, Chinese number of words refer to
The Chinese character number included in each test pictures word description in addition to punctuation mark, each word of Chinese are denoted as a word
Symbol;Foreign language number of words refers to the foreign language number of characters included in each test pictures word description in addition to punctuation mark, foreign language
Each word is denoted as a character;Total word number refers to the word obtained after being segmented to each test pictures word description sum,
The participle storehouse that system can be used to carry for middle Chinese word segmentation is segmented, and foreign language can be according to foreign language writing style, directly using per word
Between space segmented;Notional word number obtains often after referring to participle according to word segmentation result compared with segmenting the part of speech in storehouse
Notional word quantity in one test pictures word description, notional word number can be further divided into Chinese notional word number and foreign language notional word number,
In, the summation of Chinese notional word number and foreign language notional word number is equal to notional word number;Function word number refers to after segmenting according to word segmentation result and participle
Part of speech in storehouse is compared to obtain the function word quantity in each test pictures word description, during further function word number can be divided into
Literary function word number and foreign language function word number, wherein, the summation of Chinese function word number and foreign language function word number is equal to function word number;Paragraph number refers to often
Paragraph quantity in one test pictures word description;Bout length distribution situation refers in each test pictures word description
Word number and sentence number included in each paragraph;Sentence number refers to the sentence number in each test pictures word description
Amount;Sentence length distribution situation refers to each word number included in sentence in each test pictures word description;Synonym,
Near synonym spread scenarios refer to the word segmentation result in each test pictures word description being compared with synonymous near synonym storehouse,
The same or similar participle of meaning is formed into a set, calculates the word quantity in each set, thus reflects that this tests
The synonym of the author of picture character description, near synonym writing custom, if wherein included in synonym or near synonym set
Word number it is more, show that the writing style of the author tends to extend using synonym or near synonym, if synonym or nearly justice
Word number is fewer included in set of words, shows that the writing style of the author tends to that synonym or near synonym is not used to extend;
Function word service condition refers to the statistical conditions that function word uses in each test pictures word description, includes but not limited to each piece
The statistics ranking that function word uses in test pictures word description, the word number being each spaced between difference function words, each identical function word
Between the word number that is spaced;Such as " ", " ", the service condition of " obtaining " three structural auxiliary words can also be further counted, thus
Reflect whether the author of this test pictures word description distinguishes use for " ", " ", " obtaining " three structural auxiliary words;Mark
Point symbol service condition refers to the statistical conditions that punctuation mark uses in each test pictures word description, includes but not limited to
The statistics ranking that punctuate uses in each test pictures word description, the word number being each spaced between difference punctuation marks, often
The word number being spaced between a identical punctuation mark;Part of speech service condition refers to after segmenting according to word segmentation result and the word in participle storehouse
Property be compared to obtain the statistical conditions of each part of speech participle in each test pictures word description, such as respectively obtain noun,
Verb, adjective, number, quantifier, pronoun, adverbial word, preposition, conjunction, auxiliary word, quantity and each part of speech of interjection and onomatopoeia
Quantity and the ratio of the total word number of this test pictures word description.
A specific embodiment according to the present invention, test pictures word description characteristic value generation module is according to test chart
Piece word description characteristic value generates test pictures word description feature vector.A specific embodiment according to the present invention, by
System specify the test pictures word description feature vector dimension and feature vector in every particular content and row
The order of row.When the dimension of the feature vector of the test pictures word description is n, TPCVE=[TPC_ are represented by
1 ..., TPC_m ..., TPC_n], wherein, TPC_1 be test pictures word description feature vector in the first entry value, TPC_m
For the m entry value in the feature vector of test pictures word description, TPC_n is in the feature vector of test pictures word description
N-th entry value.
Preferably, the test pictures word description feature vector includes one or more of the following items:Middle word
The ratio of number and total word number, foreign language number of words and the ratio of total word number, the ratio of notional word number and total word number, function word number and total word number
Ratio, the ratio of total word number and paragraph number, most long paragraph word number, synonym, the ratio of near synonym spreading number and total word number, punctuate
Symbol is using the ratio of number and total word number, the ratio of noun number and total word number, the ratio of verb number and total word number, adjective number with
The ratio of total word number, the ratio of number number and total word number, the ratio of quantifier number and total word number, the ratio of pronoun number and total word number,
The ratio of adverbial word number and total word number, the ratio of preposition number and total word number, the ratio of conjunction number and total word number, auxiliary word number and total word number
Ratio, the ratio of interjection number and total word number, the ratio of onomatopoeia number and total word number.
A specific embodiment according to the present invention, test pictures reference characteristic vector generation module statistics is for same
The test pictures word description feature vector of test;Test pictures word description feature vector is weighted, obtains spy
Location survey attempts piece benchmark feature vector, and the weights used in the ranking operation are set by system.Preferably, test pictures benchmark
Feature vector generation module can be directed to age groups, academic group and professional domain group, count the test of predetermined quantity respectively
Picture character Expressive Features vector, and be weighted respectively, obtain each age group, each academic group and each professional domain group
Fc-specific test FC picture reference characteristic vector.
Fc-specific test FC picture reference characteristic vector can be expressed as:
Wherein TPCVE_ID represents the test pictures reference characteristic vector that number is ID;Tester's quantity on the basis of k;
TPC_1iRepresent the first entry value of the feature vector of i-th of benchmark test personnel;TPC_miRepresent i-th benchmark test personnel's
The m entry value of feature vector;TPC_niRepresent the n-th entry value of the feature vector of i-th of benchmark test personnel;W1,iFor TPC_1i's
Weighting coefficient;Wm,iFor TPC_miWeighting coefficient;Wn,,iFor TPC_niWeighting coefficient.
A specific embodiment according to the present invention, test article word description characteristic value generation module obtain benchmark and survey
The test article that examination personnel obtain benchmark test personnel describes text, generates user test article word description characteristic value;It is described
Test article word description characteristic value includes but not limited to:Chinese number of words, foreign language number of words, total word number, notional word number, function word number, section
Fall number, bout length distribution situation, sentence number, sentence length distribution situation, synonym, near synonym spread scenarios, function word use
Situation, punctuation mark service condition, part of speech service condition.A specific embodiment according to the present invention, Chinese number of words refer to
The Chinese character number included in each test article word description in addition to punctuation mark, each word of Chinese are denoted as a word
Symbol;Foreign language number of words refers to the foreign language number of characters included in each test article word description in addition to punctuation mark, foreign language
Each word is denoted as a character;Word number refers to the word sum obtained after being segmented to each test article word description, wherein
The participle storehouse that system can be used to carry for Chinese word segmentation is segmented, foreign language can according to foreign language writing style, directly using per word it
Between space segmented;Notional word number refers to be obtained compared with segmenting the part of speech in storehouse according to word segmentation result after participle each
Notional word quantity in piece test article word description, notional word number can be further divided into Chinese notional word number and foreign language notional word number, wherein,
The summation of Chinese notional word number and foreign language notional word number is equal to notional word number;Function word number refers to after segmenting according to word segmentation result with segmenting in storehouse
Part of speech be compared to obtain function word quantity in each test article word description, further function word number can be divided into Chinese void
Word number and foreign language function word number, wherein, the summation of Chinese function word number and foreign language function word number is equal to function word number;Paragraph number refers to each piece
Test the paragraph quantity in article word description;Bout length distribution situation refers to each in each test article word description
Word number included in paragraph and sentence number;Sentence number refers to the sentence quantity in each test article word description;Sentence
Sub- distribution of lengths situation refers to word number included in each sentence in each test article word description;Synonym, nearly justice
The word segmentation result that word spread scenarios refer to test each in article word description is compared with synonymous near synonym storehouse, will contain
The same or similar participle of justice forms a set, calculates the word quantity in each set, thus reflects that this tests article
The synonym of the author of word description, near synonym writing custom, if wherein word included in synonym or near synonym set
Number is more, shows that the writing style of the author tends to extend using synonym or near synonym, if synonym or near synonym collection
Word number is fewer included in conjunction, shows that the writing style of the author tends to that synonym or near synonym is not used to extend;Function word
Service condition refers to the statistical conditions that function word uses in each test article word description, includes but not limited to each test
The statistics ranking that function word uses in article word description, the word number being each spaced between difference function words, each between identical function word
The word number at interval;Such as " ", " ", the service condition of " obtaining " three structural auxiliary words can also be further counted, thus reflect
Go out this and test the author of article word description and whether distinguish use for " ", " ", " obtaining " three structural auxiliary words;Punctuate accords with
Number service condition refers to the statistical conditions that punctuation mark uses in each test article word description, including but not limited to each
The statistics ranking that punctuate uses in piece test article word description, the word number being each spaced between difference punctuation marks, Mei Gexiang
With the word number being spaced between punctuation mark;Part of speech service condition refer to participle after according to word segmentation result with participle storehouse in part of speech into
Row relatively obtains the statistical conditions of each part of speech participle in each test article word description, for example, respectively obtain noun, verb,
Adjective, number, quantifier, pronoun, adverbial word, preposition, conjunction, auxiliary word, the quantity of interjection and onomatopoeia and each part of speech quantity with
This tests the ratio of the total word number of article word description.
A specific embodiment according to the present invention, test article word description characteristic value generation module is according to test text
Chapter word description characteristic value generates test pictures word description feature vector.A specific embodiment according to the present invention, by
System specifies particular content every in the dimension for testing article word description feature vector and feature vector and row
The order of row.When the dimension of the feature vector of the test article word description is n, TTCVE=[TTC_ are represented by
1 ..., TTC_m ..., TTC_n], wherein, TTC_1 be test pictures word description feature vector in the first entry value, TTC_m
For the m entry value in the feature vector of test pictures word description, TTC_n is in the feature vector of test pictures word description
N-th entry value.
Preferably, the test article word description feature vector includes one or more of the following items:Middle word
The ratio of number and total word number, foreign language number of words and the ratio of total word number, the ratio of notional word number and total word number, function word number and total word number
Ratio, the ratio of total word number and paragraph number, most long paragraph word number, synonym, the ratio of near synonym spreading number and total word number, punctuate
Symbol is using the ratio of number and total word number, the ratio of noun number and total word number, the ratio of verb number and total word number, adjective number with
The ratio of total word number, the ratio of number number and total word number, the ratio of quantifier number and total word number, the ratio of pronoun number and total word number,
The ratio of adverbial word number and total word number, the ratio of preposition number and total word number, the ratio of conjunction number and total word number, auxiliary word number and total word number
Ratio, the ratio of interjection number and total word number, the ratio of onomatopoeia number and total word number.
A specific embodiment according to the present invention, test article reference characteristic vector generation module statistics is for same
The test article word description feature vector of test;Test article word description feature vector is weighted, obtains spy
Location survey examination article reference characteristic is vectorial, and the weights used in the ranking operation are set by system.Preferably, article benchmark is tested
Feature vector generation module can be directed to age groups, academic group and professional domain group, count the test of predetermined quantity respectively
Article word description feature vector, and be weighted respectively, obtain each age group, each academic group and each professional domain group
Fc-specific test FC article reference characteristic vector.
Certain articles reference characteristic vector can be expressed as:
Wherein TTCVE_ID represents the test article reference characteristic vector that number is ID;Tester's quantity on the basis of k;
TTC_1iRepresent the first entry value of the feature vector of i-th of benchmark test personnel;TTC_miRepresent i-th benchmark test personnel's
The m entry value of feature vector;TTC_niRepresent the n-th entry value of the feature vector of i-th of benchmark test personnel;W1,iFor TPC_1i's
Weighting coefficient;Wm,iFor TPC_miWeighting coefficient;Wn,,iFor TPC_niWeighting coefficient.
A specific embodiment according to the present invention, test pictures word description feature vector are retouched with test article word
The dimension and the wherein meaning of each characteristic value and putting in order for stating feature vector are consistent.For example, survey can be set
It is Chinese number of words to attempt piece word description feature vector with testing the Section 1 characteristic value in article word description feature vector
With the ratio of total word number, Section 2 characteristic value is the ratio of foreign language number of words and total word number, and Section 3 characteristic value is notional word number
With the ratio of total word number, Section 4 characteristic value is the ratio of function word number and total word number, Section 5 characteristic value be total word number with
The ratio of paragraph number, Section 6 characteristic value are most long paragraph word number, and Section 7 characteristic value is synonym, near synonym spreading number
With the ratio of total word number, Section 8 characteristic value is ratio of the punctuation mark using number and total word number, and Section 9 characteristic value is
The ratio of noun number and total word number, Section 10 characteristic value is the ratio of verb number and total word number, and Section 11 characteristic value is
The ratio of adjective number and total word number, Section 12 characteristic value are the ratio of number number and total word number, Section 13 characteristic value
It is the ratio of quantifier number and total word number, Section 14 characteristic value is the ratio of pronoun number and total word number, Section 15 Xiang Te
Value indicative is the ratio of adverbial word number and total word number, and Section 16 characteristic value is the ratio of preposition number and total word number, Section 17
Characteristic value is the ratio of conjunction number and total word number, and Section 18 characteristic value is the ratio of auxiliary word number and total word number, and the 19th
Item characteristic value is the ratio of interjection number and total word number, and Section 20 characteristic value is the ratio of onomatopoeia number and total word number.
A specific embodiment according to the present invention can further increase or delete test pictures word description feature
Vector and the characteristic value in test article word description feature vector, but the test pictures word after increase or deletion characteristic value is retouched
Feature vector is stated to still need to the dimension and the wherein meaning of various features value and order for testing article word description feature vector
It is consistent.
A specific embodiment according to the present invention, user test picture character Expressive Features value generation module, which obtains, to be used
Family test pictures describe text, generate user test picture character Expressive Features value;The user test picture character description is special
Value indicative is consistent with the content that test pictures word description characteristic value is included, and details are not described herein.User test picture character is retouched
It states feature vector generation module and user test picture character description spy is calculated according to the user's test pictures word description characteristic value
Sign vector;When the dimension of the test pictures word description feature vector is n, the figure for number ID of active user USER
The feature vector of the test pictures word description of piece is represented by TPCVE_ID_USER=[TPC_1_USER ..., TPC_m_
USER ..., TPC_n_USER], wherein, TPC_1_USER be active user USER user test picture character Expressive Features to
The first entry value in amount, TPC_m_USER are the m in the user test picture character Expressive Features vector of active user USER
Entry value, TPC_n_USER are the n-th entry value in the user test picture character Expressive Features vector of active user USER.
User's picture writing style feature vector generation module calculates the user's test pictures word description feature vector
Difference between TPCVE_ID_USER test pictures reference characteristic vector T PCVE_ID corresponding with the test pictures, uses this
Difference (TPCVE_ID_USER-TPCVE_ID) is used as the user's picture writing style feature vector T PCVE_USER.
A specific embodiment according to the present invention, user test article word description characteristic value generation module, which obtains, to be used
Family test article describes text, generates user test article word description characteristic value;The user test article word description is special
Value indicative is consistent with the content that test article word description characteristic value is included, and details are not described herein.User test article word is retouched
It states feature vector generation module and article word description characteristic value calculating user test article word description spy is tested according to the user
Sign vector;When the dimension of the test article word description feature vector is n, the text for number ID of active user USER
The feature vector of the test article word description of chapter is represented by:TTCVE_ID_USER=[TTC_1_USER ..., TTC_m_
USER ..., TTC_n_USER], wherein, TTC_1_USER be active user USER user test article word description feature to
The first entry value in amount, TTC_m_USER are the m in the user test article word description feature vector of active user USER
Entry value, TTC_n_USER are the n-th entry value in the user test article word description feature vector of active user USER.
User's article writing style and features vector generation module calculates the user and tests article word description feature vector
Difference between TTCVE_ID_USER test article reference characteristic vector T PCVE_ID corresponding with the test article, uses this
Difference (TTCVE_ID_USER-TTCVE_ID) is used as the user's article writing style and features vector T TCVE_USER.
A specific embodiment according to the present invention, it is when several test pictures of use or more test articles or same
When Shi Caiyong one or more test pictures and one or more test articles, the life of user test picture character Expressive Features value
Text is described according to every of user test pictures respectively into module and user test article word description characteristic value generation module
And test article describes text generation user test picture and/or article word description characteristic value, user test picture character
Expressive Features vector generation module and user test article word description feature vector generation module are respectively according to user test figure
Piece and/or article word description characteristic value generation user test picture and/or article word description feature vector;User's picture is write
Make style feature vector generation module and user's article writing style and features vector generation module calculates each user test figure respectively
Difference between piece and/or article word description feature vector and corresponding test pictures and/or article reference characteristic vector;It is right
Each difference, which is weighted, respectively obtains the picture writing style feature vector T PCVE_USER of user and the article style
Lattice feature vector TTCVE_USER;User's writing style feature vector generation module is to the picture writing style feature vector of user
TPCVE_USER and article writing style and features vector T TCVE_USER are weighted to obtain user's writing style feature
Vector T VE_USER;The weights of the ranking operation can be chosen according to actual needs.
TVE_USER=TPCVE_USER*WP+TTCVE_USER*WT
Wherein, WPFor user's picture writing style feature vector T PCVE_USER weighting coefficients;WTFor user's article style
Lattice feature vector TTCVE_USER weighting coefficients.When user only carries out picture writing test or article writing is tested, can will join
1 is arranged to the weighting coefficient of project, the weighting coefficient for having neither part nor lot in project is arranged to 0.Preferably, weights can be chosen for phase
Deng.
User's writing style feature vector is represented by:TVE_USER=[TVE_1 ..., TVE_m ..., TVE_n],
In, TVE_1 is the first entry value in user's writing style feature vector, and TVE_m is the m in user's writing style feature vector
Entry value, TVE_n are the n-th entry value in user's writing style feature vector.
A specific embodiment according to the present invention, user's detection pattern determining module is for further prompting user
Pass pending document;Pending file characteristics value generation module is used to generate the pending file characteristics value of the unexamined document.
The pending file characteristics value includes but not limited to:Chinese number of words, foreign language number of words, total word number, notional word number, function word number, paragraph
Number, bout length distribution situation, sentence number, sentence length distribution situation, synonym, near synonym spread scenarios, function word use feelings
Condition, punctuation mark service condition, part of speech service condition.A specific embodiment according to the present invention, Chinese number of words refer to often
The Chinese character number included in one pending document in addition to punctuation mark, each word of Chinese are denoted as a character;Outer word
Number refers to the foreign language number of characters included in the pending document of each piece in addition to punctuation mark, and each word of foreign language is denoted as a word
Symbol;Word number refers to the word obtained after being segmented to the pending document of each piece sum, and system can be used certainly in wherein Chinese word segmentation
The participle storehouse of band is segmented, and foreign language can be segmented according to foreign language writing style, the direct space using between every word;Notional word
Number refers to obtain the notional word in the pending document of each piece compared with segmenting the part of speech in storehouse according to word segmentation result after segmenting
Quantity, notional word number can be further divided into Chinese notional word number and foreign language notional word number, wherein, Chinese notional word number is total with foreign language notional word number
With equal to notional word number;Function word number refers to that obtaining each piece compared with segmenting the part of speech in storehouse according to word segmentation result after segmenting treats
The function word quantity in document is audited, further function word number can be divided into Chinese function word number and foreign language function word number, wherein, Chinese function word number
It is equal to function word number with the summation of foreign language function word number;Paragraph number refers to the paragraph quantity in the pending document of each piece;Bout length
Distribution situation refers to each word number and sentence number included in paragraph in the pending document of each piece;Sentence number refers to each
Sentence quantity in the pending document of a piece;Sentence length distribution situation refers to be wrapped in each sentence in the pending document of each piece
The word number contained;Synonym, near synonym spread scenarios refer to the word segmentation result in the pending document of each piece and synonymous near synonym
Storehouse is compared, and the same or similar participle of meaning is formed a set, the word quantity in each set is calculated, thus reflects
Go out synonym, the near synonym writing custom of the author of the pending document of this, if wherein institute in synonym or near synonym set
Comprising word number it is more, show that the writing style of the author tends to extend using synonym or near synonym, if synonym or
Word number is fewer included near synonym set, shows that the writing style of the author tends to that synonym or near synonym is not used to expand
Exhibition;Function word service condition refers to the statistical conditions that function word uses in the pending document of each piece, includes but not limited to each piece and treats
The statistics ranking that function word uses in examination & verification document, the word number being each spaced between difference function words, is each spaced between identical function word
Word number;Such as " ", " ", the service condition of " obtaining " three structural auxiliary words can also be further counted, thus reflect this
Whether the author of pending document distinguishes use for " ", " ", " obtaining " three structural auxiliary words;Punctuation mark service condition
Refer to the statistical conditions that punctuation mark uses in the pending document of each piece, include but not limited to the pending document acceptance of the bid of each piece
The statistics ranking that point uses, the word number being each spaced between difference punctuation marks, the word being each spaced between identical punctuation mark
Number;Part of speech service condition refers to after participle compared with segmenting the part of speech in storehouse to obtain each piece according to word segmentation result pending
The statistical conditions of each part of speech participle in document, for example, respectively obtain noun, verb, adjective, number, quantifier, pronoun, adverbial word,
Preposition, conjunction, auxiliary word, interjection and the quantity of onomatopoeia and the ratio of each part of speech quantity and the total word number of the pending document of this.
A specific embodiment according to the present invention, pending file characteristics value tag vector generation module is according to pending
Core file characteristics value generates pending file characteristics vector.A specific embodiment according to the present invention, institute is specified by system
State the feature vector of pending document dimension and feature vector in every particular content and the order of arrangement;It is pending
Every particular content and the order of arrangement should be with test charts in the dimension and feature vector of the feature vector of core document
Piece benchmark feature vector and test article reference characteristic vector dimension and wherein the meaning of various features value and sequentially still
It need to be consistent.When the dimension of the feature vector of the pending document is n, TDCVE_USER=[TDC_ are represented by
1 ..., TDC_m ..., TDC_n], wherein, TDC_1 is the first entry value in the feature vector of pending document, and TDC_m is pending
M entry value in the feature vector of core document, TDC_n are the n-th entry value in the feature vector of pending document.
Preferably, the feature vector of the pending document includes Chinese number of words and the ratio of total word number, foreign language number of words with
The ratio of total word number, the ratio of notional word number and total word number, the ratio of function word number and total word number, the ratio of total word number and paragraph number,
Most long paragraph word number, synonym, the ratio of near synonym spreading number and total word number, punctuation mark use the ratio of number and total word number,
The ratio of noun number and total word number, the ratio of verb number and total word number, the ratio of adjective number and total word number, number number and total word
Several ratio, the ratio of quantifier number and total word number, the ratio of pronoun number and total word number, the ratio of adverbial word number and total word number, preposition
The ratio of number and total word number, the ratio of conjunction number and total word number, the ratio of auxiliary word number and total word number, the ratio of interjection number and total word number
Value, the ratio of onomatopoeia number and total word number.
User's writing style similarity calculation module can pass through following public affairs for calculating active user's writing style similarity
Formula calculates:
User's writing style similarity judgment module is by active user's writing style similarity SimT(USER) it is pre- with system
If self examination & verification thresholding be compared;As user's writing style similarity SimT(USER) higher than self examination & verification thresholding
When, you can think that the pending document of active user's submission and user's writing style are inconsistent;When user's writing style similarity
SimT(USER) less than self examination & verification thresholding when, you can think that the pending document that active user submits writes wind with user
Lattice are consistent.
Self examination & verification thresholding is set in advance for system.Self examination & verification threshold value setting is excessively high, then be easy to cause erroneous judgement
The pending document and user's writing style that active user submits are inconsistent;Self examination & verification threshold value setting is too low, then easily makes
The pending document submitted into erroneous judgement active user is consistent with user's writing style.In general, it is described self examination & verification threshold value when by
System carries out selection verification by experiment in advance, and can be at any time adjusted according to operating condition by system.
A specific embodiment according to the present invention can set first self examination & verification thresholding and second self examination & verification respectively
Thresholding;Described first self examination & verification thresholding self examination & verification thresholding higher than second;As user's writing style similarity SimT(USER)
Higher than described first during self examination & verification thresholding, you can think that the pending document that active user submits differs with user's writing style
It causes;As user's writing style similarity SimT(USER) less than described second during self examination & verification thresholding, you can think active user
The pending document submitted is consistent with user's writing style;As user's writing style similarity SimT(USER) it is greater than or equal to institute
State second self examination & verification thresholding, and self examination & verification thresholding less than or equal to described first;Further verify user's writing style.
Described first self examination & verification thresholding and second self examination & verification thresholding are set in advance for system.If first self examination & verification
Threshold value setting is excessively high, then pending document and the user's writing style for be easy to causeing erroneous judgement active user's submission are inconsistent;The
Two self examination & verification threshold values settings are too low, then be easy to cause pending document and user's writing style that erroneous judgement active user submits
Unanimously;Section is set excessive between first self examination & verification thresholding and second self examination & verification thresholding, then is be easy to cause too much again
Verify user's writing style.In general, described first self examination & verification threshold value and second self examination & verification threshold value are led in advance by system
It crosses experiment and carries out selection verification, and can be at any time adjusted according to operating condition by system.
A specific embodiment according to the present invention, further verification user's writing style refer to that user writes wind
Lattice structural auxiliary word judgment module;Judge pending document and user test picture describes text and/or user test article is retouched
" ", " ", the service condition of " obtaining " three structural auxiliary words in text are stated, thus reflects the author of the pending document of this
And active user is for the differentiation degree of " ", " ", " obtaining " three structural auxiliary words.User's writing style structural auxiliary word
Judgment module judges that pending document " ", " ", the service condition of " obtaining " three structural auxiliary words refer to, counts pending document
" ", " ", the access times of " obtaining " in full text, are denoted as T respectively1、T2And T3;It further counts in pending document full text
" " after institute with participle part of speech be noun number, be denoted as D1;Count in pending document full text " " after institute with point
The part of speech of word is the number of verb, is denoted as D2;Count in pending document full text " " after institute with participle part of speech be describe
The number of word, is denoted as D3;Calculate " " after institute with participle part of speech be noun number and full text in " " use it is always secondary
Several ratio D1/T1;Calculate " " after institute in number and full text that the part of speech of participle is verb " " using total degree
Ratio D2/T2;It is the ratio using total degree " obtained " in the number and full text of verb with the part of speech of participle to calculate institute after " obtaining "
D3/T3;Calculate " ", " ", " obtain " differentiation coefficient DC_TD.The numerical value for distinguishing coefficient DC_TD is greater than or equal to 0, is less than
Or equal to 3.
The user test picture describes text and/or user test article describes in text " ", " ", " obtaining " three
The service condition of structural auxiliary word refers to that counting user test pictures describe text and/or user test article describes text full text
In (such as the user tests several pictures and/or plurality of articles, then all description texts is incorporated as full text) " ",
" ", the access times of " obtaining ", be denoted as T respectively1’、T2' and T3’;Further count in pending document full text " " after institute
Part of speech with participle is the number of noun, is denoted as D1’;Count in pending document full text " " after be with the part of speech of participle
The number of verb, is denoted as D2’;Count in pending document full text " " after institute with participle part of speech be adjectival number,
It is denoted as D3’;Calculate " " after institute with participle part of speech be noun number and full text in " " the ratio using total degree
D1’/T1’;Calculate " " after institute with participle part of speech be verb number and full text in " " the ratio using total degree
D2’/T2’;It is the ratio using total degree " obtained " in the number and full text of verb with the part of speech of participle to calculate institute after " obtaining "
D3’/T3’;Calculate " ", " ", " obtain " differentiation coefficient DC_TPT.The numerical value for distinguishing coefficient DC_TPT is greater than or equal to 0,
Less than or equal to 3.
User's writing style structural auxiliary word judgment module;It calculates and distinguishes between coefficient DC_TD and differentiation coefficient DC_TPT
Computing is normalized to distinguishing coefficient DC_TD and distinguishing the absolute value of the difference of both coefficient DC_TPT in drift rate DC-SC.
When the value of DC_SC is less than or equal to the judgement thresholding of drift rate DC-SC, then user's writing style structural auxiliary word
Judgment module, which judges the author of pending document, and test pictures describe text and/or tests article describes the user of text and exists
Style is consistent in the use of " ", " ", " obtaining " three structural auxiliary words;When the value of DC_SC is more than the judgement of drift rate DC-SC
During thresholding, then user's writing style structural auxiliary word judgment module judges that the author of pending document and test pictures describe text
And/or test article describe the user of text " ", " ", " obtaining " three structural auxiliary words use on style it is inconsistent.Partially
The judgement threshold value of shifting degree DC-SC is configured in advance by system, and can be adjusted at any time according to actual needs.Pass through system
The experimental data of operation early period is understood, when the value of DC_SC is less than or equal to 10%, can preferably reflect pending document
Author and test pictures describe text and/or test article to describe the user of text in " ", " ", " obtaining " three structural auxiliary words
Use on style it is consistent;When the value of DC_SC is more than 10%, then it is believed that the author of pending document retouches with test pictures
State text and/or test article describe the user of text " ", " ", " obtaining " three structural auxiliary words use on style differ
It causes.
User's writing style judgment module is used for as user's writing style similarity SimT(USER) greater than or equal to described
Second self examination & verification thresholding, and self examination & verification thresholding less than or equal to described first;Further judge to work as by drift rate DC-SC
Whether the pending document and user's writing style that preceding user submits are consistent;When drift rate DC-SC sentencing more than drift rate DC-SC
During disconnected thresholding, it is believed that the pending document and user's writing style that active user submits are inconsistent;Be less than as drift rate DC-SC or
During judgement thresholding equal to drift rate DC-SC, you can think the pending document and user's writing style one that active user submits
It causes.
A specific embodiment according to the present invention, user's access mode detection module prompting user upload text to be identified
Shelves.
When user's detection pattern determining module judges active user's detection pattern for common plagiarism identification pattern, text to be identified
Shelves word-dividing mode obtains word segmentation result for being segmented to document to be identified;When carrying out word segmentation processing to document to be identified,
It needs that the material with comparison database is used to carry out segmenting identical process flow.
A specific embodiment according to the present invention, document to be identified segment parts of speech classification module;For further obtaining
Obtain the corresponding part of speech of word segmentation result.It is consistent with the participle mode classification for the material that comparison database is included to segment parts of speech classification mode.
A specific embodiment according to the present invention, document participle characteristic value generation module to be identified are waited to reflect for generating
Determine document participle characteristic value;The quantity that each participle occurs in correspondence document to be identified is counted, obtains each participle pair
The participle characteristic value WCV_TBI=[W_ID, W_N] answered, wherein W_ID represent unique number of the participle in storehouse is segmented, W_N
Represent the total degree that the participle occurs in the document to be identified.Preferably, it is contemplated that the part of speech of each participle is segmented
Part of speech feature value WCCV_TBI=[W_ID, W_N, W_CHAR], wherein W_ID represent unique number of the participle in storehouse is segmented,
W_N represents the participle total degree of the specific participle in the document to be identified, and W_CHAR represents the part of speech of the participle.
A specific embodiment according to the present invention, document participle tightening coefficient generation module to be identified are treated for generating
Identify document participle tightening coefficient.A specific embodiment according to the present invention, the close system of the corresponding participle of each participle
Number can be expressed as WGC_TBI=[G_W_ID_1, G_W_ID_2 ..., G_W_ID_ (W_N-1)], wherein, G_W_ID_1 is represented
The participle quantity that the participle is spaced between occurring for the first time and occur for second in the document to be identified, G_W_ID_2 are represented
There is the participle quantity being spaced between third time appearance, G_W_ID_ (W_N- second in the document to be identified in the participle
1) represent that the participle participle quantity being spaced between the W_N times appearance occurs the W_N-1 times in the document to be identified;G_
W_ID_1, G_W_ID_2 ..., G_W_ID_ (W_N-1) are the corresponding participle tightening coefficient of the participle.According to the present invention one
The corresponding participle tightening coefficient of each participle further can be expressed as segmenting by a specific embodiment in vector form
Tightening coefficient characteristic vector W GCVE_TBI=[W_ID, W_N, W_CHAR, G_W_ID_1, G_W_ID_2 ..., G_W_ID_ (W_N-
1)], wherein W_ID represents unique number of the participle in storehouse is segmented, and W_N represents the specific participle in the document to be identified
Participle total degree, W_CHAR represents the part of speech of the participle, and G_W_ID_1 represents the participle in the document to be identified for the first time
The participle quantity for occurring and being spaced between occurring for second, G_W_ID_2 represent the participle second in the document to be identified
There is the participle quantity being spaced between third time appearance, G_W_ID_ (W_N-1) represents the participle in the document to be identified
The participle quantity being spaced between the W_N-1 times appearance and the W_N times appearance.Wherein, G_W_ID_1, G_W_ID_2 ..., G_W_
ID_ (W_N-1) is the corresponding participle part of speech feature vector tightening coefficient of the participle.By segmenting feature vector tightening coefficient,
Overall distribution situation of the specific participle in correspondence document to be identified can be known, so as in document entirety length mistake to be identified
In the case that length or description viewpoint are disperseed, avoid according to participle total degree W_N or according to (W_N/ segments free vector dimension
WFV) screening segments feature vector and omits crucial participle characteristic value.Preferably, can also be closely according to participle feature vector
Number extracts specific part in a certain document to be identified and is used to compare.
A specific embodiment according to the present invention, document to be identified segment free vector dimension determining module, are used for
Participle free vector dimension WFV_TBI is determined according to the word segmentation result of document to be identified.When the length of document to be identified is shorter or
When person's word segmentation result therein is less, obtained participle free vector dimension WFV_TBI is less;When the length of document to be identified
When word segmentation result longer or therein is more, obtained participle free vector dimension WFV_TBI is more.
When user's detection pattern determining module judges that active user's detection pattern plagiarizes identification pattern for extension, text to be identified
Shelves participle group module obtains participle group result for being segmented to document to be identified;The wherein same or similar participle of meaning
One group is formed, is numbered in units of group.Multiple equivalent in meaning or similar participle corresponds to a participle group #;Right
When document to be identified carries out word segmentation processing, it is necessary to using carrying out segmenting identical process flow with the material of comparison database.
A specific embodiment according to the present invention, document participle group parts of speech classification module to be identified;For further
Obtain the corresponding part of speech of participle group result.The participle group mode classification for the material that participle group parts of speech classification mode is included with comparison database
Unanimously.
A specific embodiment according to the present invention, document participle group characteristic value generation module to be identified are treated for generating
Identify document participle group characteristic value;The quantity that each participle group occurs in correspondence document to be identified is counted, obtains each
The corresponding participle characteristic value WGCV_TBI=[WG_ID, WG_N] of participle group, wherein WG_ID represent the participle group in storehouse is segmented
Unique number, WG_N represents the total degree that the participle group occurs in the document to be identified.Preferably, it is contemplated that each point
The part of speech of phrase, obtains participle group part of speech feature value WGCCV_TBI=[WG_ID, WG_N, WG_CHAR], and wherein WG_ID is represented
Unique number of the participle group in storehouse is segmented, WG_N represent that the participle of the specific participle group in the document to be identified is always secondary
Number, WG_CHAR represent the part of speech of the participle group.
A specific embodiment according to the present invention, document participle group tightening coefficient generation module to be identified are used to generate
Document to be identified segments tightening coefficient.A specific embodiment according to the present invention, the corresponding participle of each participle group are tight
Close coefficient can be expressed as WGGC_TBI=[G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1)], wherein, G_
WG_ID_1 represents the participle number that the participle group is spaced between occurring for the first time and occur for second in the document to be identified
Amount, G_WG_ID_2 represent that the participle group point being spaced between third time appearance occurs second in the document to be identified
Word quantity, G_WG_ID_ (WG_N-1) represent that the participle group occurs and the W_N times appearance for the W_N-1 times in the document to be identified
Between the participle quantity that is spaced;G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1) are that the participle group corresponds to
Participle group tightening coefficient.A specific embodiment according to the present invention, can be further corresponding by each participle group
Participle group tightening coefficient is expressed as participle group tightening coefficient characteristic vector W GGCVE_TBI=[WG_ID, WG_ in vector form
N, WG_CHAR, G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_ (WG_N-1)], wherein WG_ID represents that the participle group is being divided
Unique number in dictionary, WG_N represent the participle total degree of the specific participle group in the document to be identified, and WG_CHAR is represented
The part of speech of the participle group, G_WG_ID_1 represent that the participle group occurs in the document to be identified and occur it for the second time for the first time
Between the participle quantity that is spaced, G_WG_ID_2 represents that the participle group occurs with going out for the third time for second in the document to be identified
The participle quantity being spaced between existing, G_WG_ID_ (WG_N-1) represent the participle group the W_N-1 times in the document to be identified
The participle quantity for occurring and being spaced between occurring for the W_N times.Wherein, G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_
(WG_N-1) it is the corresponding participle part of speech feature vector tightening coefficient of the participle group.It is closely by participle group feature vector
Number, can know overall distribution situation of the specific participle group in correspondence document to be identified, so as in a document entirety piece to be identified
It is long or in the case that description viewpoint is disperseed, it avoids according to participle total degree W_N or according to (W_N/ segments free vector
Dimension WFV) it screens participle feature vector and omits crucial participle characteristic value.It preferably, can also be tight according to participle feature vector
Close coefficient extracts specific part in a certain document to be identified and is used to compare.
A specific embodiment according to the present invention, document participle group free vector dimension determining module to be identified are used
In determining participle group free vector dimension WGFV_TBI according to the word segmentation result of document to be identified.When document to be identified length compared with
When word segmentation result short or therein is less, obtained participle group free vector dimension WGFV_TBI is less;When text to be identified
The length of shelves is longer or when word segmentation result therein is more, and obtained participle group free vector dimension WGFV_TBI is more.
It is to be identified when user's detection pattern determining module judges active user's detection pattern for multilingual plagiarism identification pattern
Foreign language participle group module obtains middle foreign language participle group result for being segmented to document to be identified in document;Wherein meaning phase
Same or similar middle foreign language participle forms one group, is numbered in units of group.Multiple equivalent in meaning or similar middle foreign language point
Word corresponds to a middle foreign language participle group #.To document to be identified carry out word segmentation processing when, it is necessary to using with comparison database
Material carries out segmenting identical process flow.
A specific embodiment according to the present invention, document participle group parts of speech classification module to be identified;For further
Obtain the corresponding part of speech of participle group result.The participle group mode classification for the material that participle group parts of speech classification mode is included with comparison database
Unanimously.
A specific embodiment according to the present invention, foreign language participle group characteristic value generation module is used in document to be identified
Generate foreign language participle group characteristic value in document to be identified;Foreign language participle group in each is counted in correspondence document to be identified to occur
Quantity, obtain foreign language participle group in each it is corresponding participle characteristic value WFGCV_TBI=[WFG_ID, WFG_N], wherein
WFG_ID represents unique number of the foreign language participle group in storehouse is segmented in this, and WFG_N represents that foreign language participle group is waited to reflect at this in this
Determine the total degree occurred in document.Preferably, it is contemplated that the part of speech of foreign language participle group in each obtains middle foreign language participle group word
Property characteristic value WFGCCV_TBI=[WFG_ID, WFG_N, WFG_CHAR], wherein FWG_ID represent in this foreign language participle group point
Unique number in dictionary, WFG_N represent the participle total degree of the specific middle foreign language participle group in the document to be identified, WFG_
CHAR represents the part of speech of foreign language participle group in this.
A specific embodiment according to the present invention, foreign language participle group tightening coefficient generation module is used in document to be identified
Tightening coefficient is segmented in generating foreign language in document to be identified.A specific embodiment according to the present invention, foreign language in each
The corresponding middle foreign language participle tightening coefficient of participle group can be expressed as WFGGC_TBI=[G_WFG_ID_1, G_WFG_ID_2 ...,
G_WFG_ID_ (WFG_N-1)], wherein, G_WFG_ID_1 represents that foreign language participle group goes out for the first time in the document to be identified in this
The participle quantity being spaced between now occurring with second, G_WFG_ID_2 represent that foreign language participle group is in the document to be identified in this
In second occur and third time occur between the participle quantity that is spaced, G_WFG_ID_ (WFG_N-1) represents foreign language point in this
There is the participle quantity being spaced between the W_N times appearance the W_N-1 times in the document to be identified in phrase;G_WFG_ID_
1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N-1) be in this corresponding middle foreign language participle group of foreign language participle group be closely
Number.A specific embodiment according to the present invention, can be further by the corresponding middle foreign language point of foreign language participle group in each
Phrase tightening coefficient is expressed as middle foreign language participle group tightening coefficient characteristic vector W FGGCVE_TBI=[WFG_ in vector form
ID, WFG_N, WFG_CHAR, G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N-1)], wherein WFG_ID is represented
Unique number of the foreign language participle group in storehouse is segmented in this, WFG_N represent the specific middle foreign language participle group in the document to be identified
In participle total degree, WFG_CHAR represents the part of speech of foreign language participle group in this, and G_WFG_ID_1 represents foreign language participle group in this
The participle quantity being spaced between occurring for the first time and occur for second in the document to be identified, G_WFG_ID_2 are represented in this
There is the participle quantity being spaced between third time appearance, G_WFG_ second in the document to be identified in foreign language participle group
ID_ (WG_N-1) represents that foreign language participle group the institute between the W_N times appearance occurs the W_N-1 times in the document to be identified in this
The participle quantity at interval.Wherein, G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_ (WFG_N-1) are foreign language point in this
The corresponding participle part of speech feature vector tightening coefficient of phrase.By middle foreign language participle group feature vector tightening coefficient, can know
Overall distribution situation of the specific middle foreign language participle group in correspondence document to be identified.
A specific embodiment according to the present invention, foreign language participle group free vector dimension determines mould in document to be identified
Block, for determining middle foreign language participle group free vector dimension WFGFV_TBI according to the word segmentation result of document to be identified.When to be identified
The length of document is shorter or when word segmentation result therein is less, obtained middle foreign language participle group free vector dimension WFGFV_
TBI is less;When the length of document to be identified is longer or word segmentation result therein is more, obtained participle group free vector
Dimension WFGFV_TBI is more.
A specific embodiment according to the present invention, document to be identified participle simplify vector dimension generation module for pair
The participle free vector dimension WFV_TBI of document to be identified is simplified, and is generated document participle to be identified and is simplified vector dimension
RWV_TBI.The participle is simplified vector dimension RWV_TBI and is specified by the system.Preferably, system specifies participle to simplify vector
Dimension RWV_TBI is 500.Preferably, system specifies participle to simplify vector dimension RWV_TBI as 800.Preferably, simplified system
Specified participle simplifies vector dimension RWV_TBI as 1000.
A specific embodiment according to the present invention, document participle to be identified simplify vector dimension generation module use etc.
Interval extraction method simplifies document to be identified participle free vector dimension WFV_TBI.It is as follows to simplify process:Judge to be identified
Whether document participle free vector dimension WFV_TBI, which is more than document to be identified participle, is simplified vector dimension RWV_TBI, if so,
Document to be identified is then segmented into free vector dimension WFV_TBI divided by simplified system specifies document participle to be identified to simplify vectorial dimension
Number RWV_TBI, and upper rounding operation is carried out to obtained quotient, it further obtains document to be identified and simplifies coefficients R EDU_
TBI;Then carried in the characteristic value corresponding to document to be identified participle free vector dimension WFV_TBI at interval of REDU_TBI-1
Take a characteristic value;After all characteristics extractions, judge whether the quantity of extracted characteristic value is equal to text to be identified
Shelves participle simplifies vector dimension RWV_TBI;Vectorial dimension is simplified when the quantity for the characteristic value extracted is equal to document to be identified participle
During number RWV_TBI, then complete document participle free vector dimension WFV_TBI to be identified and simplify;When the number for the characteristic value extracted
When amount simplifies vector dimension RWV_TBI less than document to be identified participle, then calculate document participle to be identified and simplify vector dimension
RWV_TBI and the difference of characteristic value quantity;In non-extracted characteristic value at random extraction with document to be identified participle simplify to
The dimension RWV_TBI characteristic values equal with the difference quantities of characteristic value is measured, completes document participle free vector dimension to be identified
WFV_TBI's simplifies.
A specific embodiment according to the present invention, document participle to be identified simplify vector dimension generation module using word
Property screening method to document to be identified participle free vector dimension WFV_TBI simplify.It is as follows to simplify process:By characteristic value according to
Corresponding participle part of speech is classified;A specific embodiment according to the present invention, feature value division is special for A1 classes notional word
Value indicative, A2 class notional words characteristic value, B class notional words characteristic value, C class notional words characteristic value, D class notional word characteristic values and V class function word features
Value.Generally, it is considered that the effect bigger played in the similarity comparison of the corresponding characteristic value of notional word, wherein technical term noun compared with
Common noun can more embody effective content of document to be identified.Quantity AMOUNT_A1 (the A1 of lower eigenvalue of all categories are counted respectively
The quantity of class notional word characteristic value), AMOUNT_A2 (quantity of A2 class notional word characteristic values), AMOUNT_B (B class notional word characteristic values
Quantity), AMOUNT_C (quantity of C class notional word characteristic values), AMOUNT_D (quantity of D class notional word characteristic values), AMOUNT_V (V
The quantity of class notional word characteristic value).It calculates document participle to be identified and simplifies vector dimension RWV_TBI- (AMOUNT_A1+AMOUNT_
A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V value RWV_TBI_S_V);If greater than 0, this is exited if
It is secondary to simplify;If equal to 0, then it completes this time to simplify;If less than 0, then further calculate document participle to be identified and simplify vector
The value RWV_S_D of dimension RWV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D);If
More than 0, then the feature equal with difference RWV_TBI_S_D quantity is extracted at random from the characteristic value corresponding to AMOUNT_V
Value, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then document participle to be identified is further calculated
Simplify the value RWV_TBI_S_C of vector dimension RWV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C);Such as
Fruit is more than 0, then extracts the feature equal with difference RWV_TBI_S_C quantity at random from the characteristic value corresponding to AMOUNT_D
Value, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then document participle to be identified is further calculated
Simplify the value RWV_TBI_S_B of vector dimension RWV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B);If greater than 0,
It then extracts the characteristic value equal with difference RWV_TBI_S_B quantity at random from the characteristic value corresponding to AMOUNT_C, completes
This is simplified;If equal to 0, then it completes this time to simplify;If less than 0, then further calculate document participle to be identified simplify to
Measure the value RWV_TBI_S_A2 of dimension RWV_TBI- (AMOUNT_A1+AMOUNT_A2);If greater than 0, then from AMOUNT_B institutes
The characteristic value equal with difference RWV_TBI_S_A2 quantity is extracted in corresponding characteristic value at random, completion is this time simplified;If
Equal to 0, then complete this time to simplify;If less than 0, then further calculate document participle to be identified and simplify vector dimension RWV_TBI-
The value RWV_TBI_S_A1 of AMOUNT_A1;If greater than 0, then extract and be somebody's turn to do at random from the characteristic value corresponding to AMOUNT_A2
The equal characteristic value of difference RWV_TBI_S_A1 quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;It is if small
In 0, then extraction simplifies vector dimension RWV_TBI with document to be identified participle at random from the characteristic value corresponding to AMOUNT_A1
The equal characteristic value of quantity, completion are this time simplified.
Vector dimension RWV_TBI- (AMOUNT_A1+AMOUNT_A2+ are simplified for calculating document participle to be identified
AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V value RWV_TBI_S_V) is more than 0 situation, that is, means that this is to be identified
Document length is smaller or information content is less, therefore is not suitable for being compared using characteristic value.
Document participle free vector dimension WFV_TBI to be identified is less than document to be identified participle and simplifies vector dimension RWV_
During TBI, expression dimension itself is small, then the magnitude under other dimensions is equivalent to 0, can Direct Mark in systems, individually include
Processing.
A specific embodiment according to the present invention, document participle group to be identified are simplified vector dimension generation module and are used for
The participle group free vector dimension WGFV_TBI of document to be identified is simplified, document participle group to be identified is generated and simplifies vector
Dimension RGWV_TBI.The participle group is simplified vector dimension RWGV_TBI and is specified by the system.Preferably, system specifies participle
Group simplifies vector dimension RWGV_TBI as 500.Preferably, system specifies participle group to simplify vector dimension RWGV_TBI as 800.It is excellent
Selection of land, simplified system specify participle group to simplify vector dimension RWGV_TBI as 1000.
A specific embodiment according to the present invention, document participle group to be identified simplify the use of vector dimension generation module
Extracted at equal intervals method simplifies document participle group free vector dimension WGFV_TBI to be identified.It is as follows to simplify process:Judge
Whether document participle group free vector dimension WGFV_TBI to be identified more than document participle group to be identified simplifies vector dimension RWGV_
TBI, if it is, document participle group free vector dimension WGFV_TBI to be identified divided by simplified system are specified document to be identified
Participle group simplifies vector dimension RWGV_TBI, and carries out upper rounding operation to obtained quotient, further obtains simplifying coefficient
REDU_TBI;Then at interval of REDU_TBI-1 in the characteristic value corresponding to document participle group free vector dimension WGFV to be identified
One characteristic value of a extraction;After all characteristics extractions, judge whether the quantity of extracted characteristic value is equal to and wait to reflect
Determine document participle group and simplify vector dimension RWGV_TBI;When the quantity for the characteristic value extracted is equal to document participle group to be identified essence
During simple vector dimension RWGV_TBI, then complete document participle group free vector dimension WGFV_TBI to be identified and simplify;When being extracted
The quantity of characteristic value when simplifying vector dimension RWGV_TBI less than document participle group to be identified, then calculate document participle to be identified
Group simplifies the difference of vector dimension RWGV_TBI and characteristic value quantity;In non-extracted characteristic value at random extraction with it is to be identified
Document participle group simplifies the vector dimension RWGV_TBI characteristic values equal with the difference quantities of characteristic value, completes document to be identified point
Phrase free vector dimension WGFV_TBI's simplifies.
A specific embodiment according to the present invention, document participle group to be identified simplify the use of vector dimension generation module
Part of speech screening method simplifies document participle group free vector dimension WGFV_TBI to be identified.It is as follows to simplify process:By feature
Value is classified according to corresponding participle group part of speech;Feature value division is A1 by a specific embodiment according to the present invention
Class notional word characteristic value, A2 class notional words characteristic value, B class notional words characteristic value, C class notional words characteristic value, D class notional word characteristic values and V classes
Function word characteristic value.Generally, it is considered that the effect bigger played in the similarity comparison of the corresponding characteristic value of notional word, wherein technical term
Noun can more embody effective content of document to be identified compared with common noun.The quantity of lower eigenvalue of all categories is counted respectively
AMOUNT_A1 (quantity of A1 class notional word characteristic values), AMOUNT_A2 (quantity of A2 class notional word characteristic values), AMOUNT_B (B classes
The quantity of notional word characteristic value), AMOUNT_C (quantity of C class notional word characteristic values), the AMOUNT_D (numbers of D class notional word characteristic values
Amount), AMOUNT_V (quantity of V class notional word characteristic values).It calculates document participle group to be identified and simplifies vector dimension RWGV_TBI-
(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V) value RWGV_TBI_S_V;Such as
Fruit is more than 0, exits and if this time simplifies;If equal to 0, then it completes this time to simplify;If less than 0, then further calculate and treat
Identification document participle group simplifies vector dimension RWGV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+
AMOUNT_D value RWGV_S_D);If greater than 0, then extracted and the difference at random from the characteristic value corresponding to AMOUNT_V
The equal characteristic value of RWGV_TBI_S_D quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0,
It then further calculates document participle group to be identified and simplifies vector dimension RWGV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_
B+AMOUNT_C value RWGV_TBI_S_C);If greater than 0, then from the characteristic value corresponding to AMOUNT_D at random extraction with
The equal characteristic value of difference RWGV_TBI_S_C quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;If
Less than 0, then further calculate document participle group to be identified and simplify vector dimension RWGV_TBI- (AMOUNT_A1+AMOUNT_A2+
AMOUNT_B value RWGV_TBI_S_B);If greater than 0, then extract and be somebody's turn to do at random from the characteristic value corresponding to AMOUNT_C
The equal characteristic value of difference RWGV_TBI_S_B quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;It is if small
In 0, then further calculate document participle group to be identified and simplify vector dimension RWGV_TBI-'s (AMOUNT_A1+AMOUNT_A2)
Value RWV_TBI_S_A2;If greater than 0, then extracted and difference RWGV_ at random from the characteristic value corresponding to AMOUNT_B
The equal characteristic value of TBI_S_A2 quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then into
One step calculates the value RWGV_TBI_S_A1 that document participle group to be identified simplifies vector dimension RWGV_TBI-AMOUNT_A1;If
More than 0, then the spy equal with difference RWGV_TBI_S_A1 quantity is extracted at random from the characteristic value corresponding to AMOUNT_A2
Value indicative, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then from the spy corresponding to AMOUNT_A1
Extraction and the document participle group to be identified characteristic value that simplify vector dimension RWGV_TBI quantity equal at random in value indicative, complete this
It simplifies.
Vector dimension RWGV_TBI- (AMOUNT_A1+AMOUNT_A2+ are simplified for calculating document participle group to be identified
AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V value RWGV_TBI_S_V) is more than 0 situation, that is, means that this waits to reflect
Determine that document length is smaller or information content is less, therefore be not suitable for being compared using characteristic value.
Document participle group free vector dimension WGFV_TBI to be identified simplifies vector dimension less than document participle group to be identified
During RWGV_TBI, expression dimension itself is small, then the magnitude under other dimensions is equivalent to 0, can Direct Mark in systems, individually
Include processing.
A specific embodiment according to the present invention, foreign language participle group simplifies vector dimension generation mould in document to be identified
Block generates document to be identified for being simplified to the middle foreign language participle group free vector dimension WFGFV_TBI of document to be identified
Middle foreign language participle group simplifies vector dimension RFGWV_TBI.The middle foreign language participle group simplifies vector dimension RWFGV_TBI by described
System is specified.Preferably, system specifies middle foreign language participle group to simplify vector dimension RWFGV_TBI as 500.Preferably, system refers to
Foreign language participle group simplifies vector dimension RWFGV_TBI as 800 in fixed.Preferably, simplified system specifies middle foreign language participle group to simplify
Vector dimension RWFGV_TBI is 1000.
A specific embodiment according to the present invention, foreign language participle group simplifies vector dimension generation mould in document to be identified
Block simplifies foreign language participle group free vector dimension WFGFV_TBI in document to be identified using extracted at equal intervals method.It simplifies
Process is as follows:Judge whether foreign language participle group free vector dimension WFGFV_TBI is more than in document to be identified in document to be identified
Foreign language participle group simplifies vector dimension RWFGV_TBI, if it is, by foreign language participle group free vector dimension in document to be identified
WFGFV_TBI divided by simplified system specify foreign language participle group in document to be identified to simplify vector dimension RWFGV_TBI, and to gained
To quotient carry out upper rounding operation, further obtain simplifying coefficients R EDU_TBI;The then foreign language participle group in document to be identified
At interval of one characteristic value of REDU_TBI-1 extraction in characteristic value corresponding to free vector dimension WFGFV;When all features
After value extraction, judge whether the quantity of extracted characteristic value is equal to foreign language participle group in document to be identified and simplifies vectorial dimension
Number RWFGV_TBI;Vector dimension is simplified when the quantity for the characteristic value extracted is equal to foreign language participle group in document to be identified
During RWFGV_TBI, then complete foreign language participle group free vector dimension WFGFV_TBI in document to be identified and simplify;When what is extracted
When the quantity of characteristic value simplifies vector dimension RWFGV_TBI less than foreign language participle group in document to be identified, then text to be identified is calculated
Foreign language participle group simplifies the difference of vector dimension RWFGV_TBI and characteristic value quantity in shelves;In non-extracted characteristic value with
Machine extraction simplifies the vector dimension RWFGV_TBI spies equal with the difference quantities of characteristic value with foreign language participle group in document to be identified
Value indicative completes simplifying for foreign language participle group free vector dimension WFGFV_TBI in document to be identified.
A specific embodiment according to the present invention, foreign language participle group simplifies vector dimension generation mould in document to be identified
Block simplifies foreign language participle group free vector dimension WFGFV_TBI in document to be identified using part of speech screening method.It simplified
Journey is as follows:Characteristic value is classified according to corresponding middle foreign language participle group part of speech;A specific embodiment party according to the present invention
Feature value division is A1 class notional words characteristic value, A2 class notional words characteristic value, B class notional words characteristic value, C class notional words characteristic value, D by formula
Class notional word characteristic value and V class function word characteristic values.Generally, it is considered that the work played in the similarity comparison of the corresponding characteristic value of notional word
With bigger, wherein technical term noun can more embody effective content of document to be identified compared with common noun.It counts respectively all kinds of
Quantity AMOUNT_A1 (quantity of A1 class notional word characteristic values), the AMOUNT_A2 (numbers of A2 class notional word characteristic values of other lower eigenvalue
Amount), AMOUNT_B (quantity of B class notional word characteristic values), AMOUNT_C (quantity of C class notional word characteristic values), AMOUNT_D (D classes
The quantity of notional word characteristic value), AMOUNT_V (quantity of V class notional word characteristic values).It calculates document participle group to be identified and simplifies vector
The value of dimension RWFGV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V)
RWFGV_TBI_S_V;If greater than 0, exit and if this time simplify;If equal to 0, then it completes this time to simplify;If less than
0, then it further calculates foreign language participle group in document to be identified and simplifies vector dimension RWFGV_TBI- (AMOUNT_A1+AMOUNT_
A2+AMOUNT_B+AMOUNT_C+AMOUNT_D value RWFGV_S_D);If greater than 0, then from the spy corresponding to AMOUNT_V
The characteristic value equal with difference RWFGV_TBI_S_D quantity is extracted in value indicative at random, completion is this time simplified;If equal to 0, then
Completion is this time simplified;If less than 0, then further calculate foreign language participle group in document to be identified and simplify vector dimension RWFGV_
The value RWFGV_TBI_S_C of TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C);If greater than 0, then from
The characteristic value equal with difference RWFGV_TBI_S_C quantity is extracted in characteristic value corresponding to AMOUNT_D at random, completes this
It is secondary to simplify;If equal to 0, then it completes this time to simplify;If less than 0, then foreign language participle group in document to be identified is further calculated
Simplify the value RWFGV_TBI_S_B of vector dimension RWFGV_TBI- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B);It is if big
In 0, then the feature equal with difference RWFGV_TBI_S_B quantity is extracted at random from the characteristic value corresponding to AMOUNT_C
Value, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then document China and foreign countries to be identified are further calculated
Literary participle group simplifies the value RWV_TBI_S_A2 of vector dimension RWFGV_TBI- (AMOUNT_A1+AMOUNT_A2);If greater than
0, then extract the characteristic value equal with difference RWFGV_TBI_S_A2 quantity at random from the characteristic value corresponding to AMOUNT_B,
Completion is this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then foreign language in document to be identified is further calculated
Participle group simplifies the value RWGV_TBI_S_A1 of vector dimension RWFGV_TBI-AMOUNT_A1;If greater than 0, then from AMOUNT_
The characteristic value equal with difference RWFGV_TBI_S_A1 quantity is extracted in characteristic value corresponding to A2 at random, completes this time essence
Letter;If equal to 0, then it completes this time to simplify;If less than 0, then from the characteristic value corresponding to AMOUNT_A1 at random extraction with
Document participle group to be identified simplifies the equal characteristic value of vector dimension RWFGV_TBI quantity, and completion is this time simplified.
Vector dimension RWFGV_TBI- (AMOUNT_A1+ are simplified for calculating foreign language participle group in document to be identified
AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V value RWFGV_TBI_S_V) is more than 0 situation, i.e.,
It means that the document length to be identified is smaller or information content is less, therefore is not suitable for being compared using characteristic value.
Foreign language participle group free vector dimension WFGFV_TBI is less than foreign language participle group in document to be identified in document to be identified
When simplifying vector dimension RWFGV_TBI, expression dimension itself is small, then the magnitude under other dimensions is equivalent to 0, can be in systems
Direct Mark individually includes processing.
Preferably, compared for ease of similarity, the material participle selected in system simplifies vector dimension RWV and text to be identified
The participle of shelves simplifies vector dimension RWV_TBI should be equal;Material participle group simplifies point of vector dimension RWGV and document to be identified
Phrase simplifies vector dimension RWGV_TBI should be equal;Foreign language participle group simplifies vector dimension RWFGV and document to be identified in material
Middle foreign language participle group simplify vector dimension RWFGV_TBI should be equal.
A specific embodiment according to the present invention, document to be identified segments feature vector generation module, according to participle
It simplifies in each document to be identified of vector dimension RWV_TBI extractions and simplifies vector dimension RWV_ with the document participle to be identified
The corresponding characteristic values of TBI generate document participle characteristic vector W VE_RWV_TBI to be identified, wherein
WVE_RWV_TBI=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV_TBI,W_NRWV_TBI]
Wherein W_IDiRepresent unique number of the participle in storehouse is segmented, W_NiRepresent that the participle goes out in the document to be identified
Existing total degree, using the number as the characteristic value of the participle.
A specific embodiment according to the present invention, user's detection pattern determining module judge active user's detection pattern
During commonly to plagiarize identification pattern, when carrying out similarity comparison, document participle feature vector generation module to be identified, which generates, to be waited to reflect
Determine the participle characteristic vector W VE_RWV_TBI of document;WVE_RWV_TBI=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_
IDRWV_TBI,W_NRWV_TBI], the dimension of the participle feature vector of document to be identified is RWV_TBI;Segment feature vector generation module
Generate the participle characteristic vector W VE_RWV of material in comparison database;WVE_RWV=[W_ID1,W_N1,...,W_IDi,W_Ni,...,
W_IDRWV,W_NRWV];Wherein, the dimension RWV_TBI of the participle feature vector of document to be identified is equal to the dimension of participle feature vector
Number RWV.
It should be noted that although W_ID is all used in characteristic vector W VE_RWV_TBI and WVE_RWV is segmentediTable
Show unique number of the participle in storehouse is segmented, W_NiRepresent the total degree that the participle occurs in the document to be identified, and should
Characteristic value of the number as the participle, but should be noted that the W_ID in participle characteristic vector W VE_RWV_TBIiHave very big
It may be with the W_ID in WVE_RWViAnd it differs.Therefore when carrying out similarity comparison, it is necessary to segment feature vector by two
Dimension be adjusted to consistent.
A specific embodiment according to the present invention, file characteristics vector adjustment module to be identified, for spy will to be segmented
Levy the corresponding W_ID of all characteristic values in vector WVE_RWV_TBIiValue carries out ascending or descending order according to the number in participle storehouse
Arrangement, and the W_ID that will lackiValue insertion, the participle number W_ID of insertioniCorresponding characteristic value is 0;Assuming that in participle storehouse
Participle number sum is W, then needs the participle number number being inserted into for W-RWV_TBI, the document to be identified being thus expanded
Segment characteristic vector W VE_RWV_TBI_EXT=[W_IDTBI_EXT_1,W_NTBI_EXT_1,...,W_IDTBI_EXT_i,W_
NTBI_EXT_i,...,W_IDTBI_EXT_RWV_TBI,W_NTBI_EXT_RWV_TBI,...,W_IDW,W_NW]。
A specific embodiment according to the present invention, material feature vector adjustment module, for feature vector will to be segmented
The corresponding W_ID of all characteristic values in WVE_RWViValue carries out ascending or descending order arrangement according to the number in participle storehouse, and will lack
Few W_IDiValue insertion, the participle number W_ID of insertioniCorresponding characteristic value is 0;Assuming that the participle number in participle storehouse is total
Number is W, then it is W-RWV, the participle characteristic vector W VE_RWV_EXT=being thus expanded to need the participle number number being inserted into
[W_IDEXT_1,W_NEXT_1,...,W_IDEXT_i,W_NEXT_i,...,W_IDEXT_RWV,W_NEXT_RWV,...,W_IDW,W_NW]。
By the above-mentioned means, the dimension of the participle feature vector of the material in document to be identified and comparison database is all extended
Unified arrangement is carried out to W, and by carrying out ascending or descending order according to the number in participle storehouse, so as to two participle feature vectors pair
The dimension for the characteristic value answered is consistent.
It is common to plagiarize identification similarity calculation module, it calculates between any material in document to be identified and comparison database
Similarity;It is calculated by the following formula:
A specific embodiment according to the present invention, user's detection pattern determining module judge active user's detection pattern
When plagiarizing identification pattern for extension, when carrying out similarity comparison, document participle group feature vector generation module generation to be identified is treated
Identify the participle group characteristic vector W VE_RWGV_TBI of document;WVE_RWGV_TBI=[WG_ID1,WG_N1,...,WG_IDi,
WG_Ni,...,WG_IDRWGV_TBI,WG_NRWGV_TBI], the dimension of the participle group feature vector of document to be identified is RWGV_TBI;Point
The participle group characteristic vector W VE_RWGV of material in phrase feature vector generation module generation comparison database;WVE_RWGV=[WG_
ID1,WG_N1,...,WG_IDi,WG_Ni,...,WG_IDRWGV,WG_NRWGV];Wherein WG_IDiRepresent participle group in storehouse is segmented
Unique number, WG_NiThe total degree that the participle group occurs in the document to be identified is represented, using the number as the participle group
Characteristic value.Wherein, the dimension RWGV_TBI of the participle group feature vector of document to be identified is equal to the dimension of participle group feature vector
Number RWGV.
Similar with the common processing procedure for plagiarizing identification pattern, a specific embodiment according to the present invention, extension is copied
Identification file characteristics vector adjustment module to be identified is attacked, adjusts the document participle group characteristic vector W VE_ to be identified being expanded
RWGV_TBI_EXT=[WG_IDTBI_EXT_1,WG_NTBI_EXT_1,...,WG_IDTBI_EXT_i,WG_NTBI_EXT_i,...,WG_
IDTBI_EXT_RWV_TBI,WG_NTBI_EXT_RWGV_TBI,...,WG_IDW,WG_NW];Material feature vector adjusts module, and adjustment is expanded
The participle group characteristic vector W VE_RWGV_EXT=[WG_ID of exhibitionEXT_1,WG_NEXT_1,...,WG_IDEXT_i,WG_NEXT_i,...,
WG_IDEXT_RWV,WG_NEXT_RWGV,...,WG_IDW,W_NW].The participle group characteristic vector W VE_RWGV_TBI_EXT=of extension
[WG_IDTBI_EXT_1,WG_NTBI_EXT_1,...,WG_IDTBI_EXT_i,WG_NTBI_EXT_i,...,WG_IDTBI_EXT_RWGV_TBI,WG_
NTBI_EXT_RWGV_TBI,...,WG_IDW,WG_NW]。
By the above-mentioned means, the dimension of the participle feature vector of the material in document to be identified and comparison database is all extended
Unified arrangement is carried out to W, and by carrying out ascending or descending order according to the number in participle storehouse, so as to two participle feature vectors pair
The dimension for the characteristic value answered is consistent.
Identification similarity calculation module is plagiarized in extension, is calculated between any material in document to be identified and comparison database
Similarity;It is calculated by the following formula:
A specific embodiment according to the present invention, user's detection pattern determining module judge active user's detection pattern
For multilingual plagiarism identification pattern when, when carrying out similarity comparison, foreign language participle group feature vector generation mould in document to be identified
Block generates the middle foreign language participle group characteristic vector W VE_RWFGV_TBI of document to be identified;WVE_RWFGV_TBI=[WFG_ID1,
WFG_N1,...,WFG_IDi,WFG_Ni,...,WFG_IDRWFGV_TBI,WFG_NRWFGV_TBI], the middle foreign language participle of document to be identified
The dimension of group feature vector is RWFGV_TBI;The middle foreign language point of material in participle group feature vector generation module generation comparison database
Phrase characteristic vector W VE_RWFGV;WVE_RWFGV=[WFG_ID1,WFG_N1,...,WFG_IDi,WFG_Ni,...,WFG_
IDRWFGV,WFG_NRWFGV];Wherein WFG_IDiUnique number of the foreign language participle group in storehouse is segmented, WFG_N in expressioniRepresenting should
The total degree that middle foreign language participle group occurs in the document to be identified, using the number as the characteristic value of foreign language participle group in this.
Wherein, the dimension RWFGV_TBI of the middle foreign language participle group feature vector of document to be identified is equal to middle foreign language participle group feature vector
Dimension RWFGV.
Similar with the common processing procedure for plagiarizing identification pattern, a specific embodiment according to the present invention is multilingual
It plagiarizes under identification pattern, file characteristics vector adjustment module to be identified adjusts foreign language in the document to be identified being expanded and segments
Group characteristic vector W VE_RWFGV_TBI_EXT=[WFG_IDTBI_EXT_1,WFG_NTBI_EXT_1,...,WFG_IDTBI_EXT_i,WFG_
NTBI_EXT_i,...,WFG_IDTBI_EXT_RWFGV_TBI,WFG_NTBI_EXT_RWFGV_TBI,...,WFG_IDW,WFG_NW];Material feature to
Amount adjustment module, adjusts the participle group characteristic vector W VE_RWFGV_EXT=[WFG_ID being expandedEXT_1,WFG_
NEXT_1,...,WFG_IDEXT_i,WFG_NEXT_i,...,WFG_IDEXT_RWV,WFG_NEXT_RWFGV,...,WFG_IDW,WFG_NW]。
The participle characteristic vector W VE_RWFGV_TBI_EXT=[WFG_ID of extensionTBI_EXT_1,WFG_NTBI_EXT_1,...,WFG_
IDTBI_EXT_i,WFG_NTBI_EXT_i,...,WFG_IDTBI_EXT_RWFGV_TBI,WFG_NTBI_EXT_RWFGV_TBI,...,WFG_IDW,WFG_
NW]。
By the above-mentioned means, the dimension of the participle feature vector of the material in document to be identified and comparison database is all extended
Unified arrangement is carried out to W, and by carrying out ascending or descending order according to the number in participle storehouse, so as to two participle feature vectors pair
The dimension for the characteristic value answered is consistent.
It is multilingual to plagiarize identification similarity calculation module, it calculates between any material in document to be identified and comparison database
Similarity;It is calculated by the following formula:
A specific embodiment according to the present invention, for avoid extension after dimension it is excessive, also can will participle feature to
All participle ID in WVE_RWV_TBI are measured as a set;And collect the participle ID in WVE_RWV as another
It closes;Or using all participle ID in participle group characteristic vector W VE_RWGV_TBI as a set;And by WVE_RWGV
In participle ID as another gather;Or by all points in middle foreign language participle group characteristic vector W VE_RWFGV_TBI
Word ID is as a set;And gather the participle ID in WVE_RWFGV as another;Two collection conjunction unions obtain total
Segment ID set;Gather according to total participle ID by the dimension of the participle feature vector of the material in document to be identified and comparison database
Number is extended, and the corresponding participle ID of all characteristic values is carried out ascending or descending order arrangement according to the number in participle storehouse, is inserted
Enter and included in total participle ID set and originally itself gathered the W_ID not includediValue, the participle number W_ID being inserted intoiIt is corresponding
Characteristic value be 0;Or it is included in the total participle group ID set of insertion and WG_ID that itself original set does not includeiValue, is inserted
The participle number WG_ID enterediCorresponding characteristic value is 0;Or it is included in the total middle foreign language participle group ID set of insertion and original
The WFG_ID that itself set does not includeiValue, the participle number WFG_ID being inserted intoiCorresponding characteristic value is 0.
According to the access mode of user, the material for providing different word banks in comparison database carries out similarity comparison, compares and use
The mode of traversal, the characteristic vector pickup that will select all materials in scope come out, and similarity is carried out with document to be identified
Comparison;And compare the similarity value being calculated with predetermined threshold, it, will when similarity value is higher than predetermined threshold
Corresponding material records spare as doubtful material.
After the completion of document to be identified and the comparison of all materials, extract all doubtful materials, by document to be identified with it is doubtful
Material is further compared.
A preferred embodiment according to the present invention, can will be in proverb common saying storehouse, famous sayings of famous figures storehouse, poem storehouse it is all
Material selectiong is doubtful material.
Participle free vector dimension WFV can be less than participle and simplify vector by a preferred embodiment according to the present invention
The material selectiong of dimension RWV is doubtful material.
A preferred embodiment according to the present invention can simplify participle group free vector dimension WGFV less than participle group
The material selectiong of vector dimension RWGV is doubtful material.
A preferred embodiment according to the present invention, during can middle foreign language participle group free vector dimension WFGFV be less than
The material selectiong that foreign language participle group simplifies vector dimension RWFGV is doubtful material.
A preferred embodiment according to the present invention can further choose doubtful material by segmenting tightening coefficient.
A specific embodiment according to the present invention, common plagiarize can be according to point of document to be identified under identification pattern
The participle tightening coefficient of word tightening coefficient and material screens doubtful material.Document tightening coefficient statistical module to be identified is according to this
Corresponding participle tightening coefficient characteristic vector W GCVE_TBI=[W_ID, W_N, W_CHAR, G_W_ID_ are segmented in document to be identified
1, G_W_ID_2 ..., G_W_ID_i ..., G_W_ID_ (W_N-1)] extraction high density segments and corresponding position.It is described to wait to reflect
Determine participle part of speech W_CHAR of the document tightening coefficient statistical module in participle tightening coefficient feature vector, choose part of speech as in fact
The participle of word, and count the spacing participle total amount of predetermined adjacent quantity participle:Wherein n is predetermined adjacent
Quantity, when the spacing participle total amount of predetermined adjacent quantity participle is less than predetermined close threshold T HGWhen, then record the participle
ID and corresponding position.
A specific embodiment according to the present invention, extension is plagiarized can be according to point of document to be identified under identification pattern
The participle group tightening coefficient of phrase tightening coefficient and material screens doubtful material.Document tightening coefficient statistical module root to be identified
According to corresponding participle tightening coefficient characteristic vector W GGCVE_TBI=[WG_ID, WG_N, the WG_ of participle group in the document to be identified
CHAR, G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_i ..., G_WG_ID_ (W_N-1)] extraction high density participle group and
Corresponding position.Participle group of the document tightening coefficient statistical module to be identified in participle group tightening coefficient feature vector
Part of speech WG_CHAR chooses the participle group that part of speech is notional word, and counts the spacing participle total amount for making a reservation for adjacent quantity participle group:Wherein n is to make a reservation for adjacent quantity, when the spacing participle total amount for making a reservation for adjacent quantity participle group is less than
Predetermined close threshold T HGWhen, then record the ID of the participle group and corresponding position.
A specific embodiment according to the present invention, multilingual plagiarize can be according to document to be identified under identification pattern
The middle foreign language participle group tightening coefficient of middle foreign language participle group tightening coefficient and material screens doubtful material.Document to be identified is close
Coefficients statistics module is according to the corresponding participle tightening coefficient characteristic vector W FGGCVE_ of middle foreign language participle group in the document to be identified
TBI=[WFG_ID, WFG_N, WFG_CHAR, G_WFG_ID_1, G_WFG_ID_2 ..., G_WFG_ID_i ..., G_WFG_ID_
(W_N-1)] high density participle group and corresponding position are extracted.The document tightening coefficient statistical module to be identified is according to China and foreign countries
Participle group part of speech WFG_CHAR in literary participle group tightening coefficient feature vector chooses part of speech and is the participle group of notional word, and counts
Make a reservation for the spacing participle total amount of adjacent quantity participle group:Wherein n is to make a reservation for adjacent quantity, when pre-
The spacing participle total amount of fixed adjacent quantity participle group is less than predetermined close threshold T HGWhen, then record foreign language participle group in this
ID and corresponding position.
The value for making a reservation for adjacent quantity n and close threshold T HGIt is pre-set by system, and can be according to reality
It needs to be adjusted;When the spacing participle total amount of predetermined adjacent quantity participle is less than predetermined close threshold T HGWhen, then it can recognize
It is more intensive in corresponding position appearance for notional word participle, it is possible to which that concentration elaborates a certain viewpoint, it is necessary to which emphasis is paid close attention to.
It is common to plagiarize under identification pattern, the doubtful story extraction module of tightening coefficient, according between predetermined adjacent quantity participle
It is less than predetermined close threshold T H every participle total amountGWhen, the participle ID that is recorded is extracted and all in comparison database is included the participle
The material of ID;Calculate respectively participle tightening coefficient characteristic vector W GCVE=corresponding with participle ID in material [W_ID, W_N,
W_CHAR, G_W_ID_1, G_W_ID_2 ..., G_W_ID_i ..., G_W_ID_ (W_N-1)], the predetermined adjacent quantity participle of statistics
Spacing participle total amount:Wherein n is to make a reservation for adjacent quantity, when the interval point of predetermined adjacent quantity participle
Word total amount is less than predetermined close threshold T HGWhen, then it is doubtful material by the material selectiong.The participle ID is one or more
It is a, it is one or more according to the material comprising one or more participle ID is extracted for one or more participle ID.
Extension is plagiarized under identification pattern, the doubtful story extraction module of tightening coefficient, according to predetermined adjacent quantity participle group
Spacing participle total amount is less than predetermined close threshold T HGWhen, the participle group ID that is recorded is extracted all comprising should in comparison database
Segment the material of ID groups;Participle group tightening coefficient characteristic vector W GGCVE=corresponding with participle group ID in material is calculated respectively
[WG_ID, WG_N, WG_CHAR, G_WG_ID_1, G_WG_ID_2 ..., G_WG_ID_i ..., G_WG_ID_ (WG_N-1)], system
Meter makes a reservation for the spacing participle total amount of adjacent quantity participle group:Wherein n is to make a reservation for adjacent quantity, when pre-
The spacing participle group total amount of fixed adjacent quantity participle is less than predetermined close threshold T HGWhen, then it is doubtful by the material selectiong
Material.The participle group ID is one or more, is extracted according to for one or more participle group ID comprising the one or more
The material of participle group ID is one or more.
Under multilingual plagiarism identification pattern, the doubtful story extraction module of tightening coefficient, according to predetermined adjacent quantity China and foreign countries text
The spacing participle total amount of participle group is less than predetermined close threshold T HGWhen, the middle foreign language participle group ID that is recorded, extraction comparison
All materials for including foreign language participle ID groups in this in storehouse;China and foreign countries corresponding with foreign language participle group ID in this in material are calculated respectively
Literary participle group tightening coefficient characteristic vector W FGGCVE=[WFG_ID, WFG_N, WFG_CHAR, G_WFG_ID_1, G_WFG_ID_
2 ..., G_WFG_ID_i ..., G_WFG_ID_ (WFG_N-1)], the spacing participle of the predetermined adjacent literary participle group in quantity China and foreign countries of statistics
Total amount:Wherein n is to make a reservation for adjacent quantity, when foreign language point in the interval of predetermined adjacent quantity participle
Phrase total amount is less than predetermined close threshold T HGWhen, then it is doubtful material by the material selectiong.The middle foreign language participle group ID
For one or more, extracted according to for foreign language participle group ID in one or more comprising foreign language participle group in the one or more
The material of ID is one or more.
By this extracting mode, can total degree occur not high by some in the document to be identified, but may be at certain
Notional word participle and corresponding position described in the collection of a little positions are extracted and further compared.
A specific embodiment according to the present invention, in the case where formula plagiarizes identification pattern, formulas Extraction module, for inciting somebody to action
Extract the formula in document to be identified;Formula decomposing module, for by the respective variable parameter of formula and dependent variable parameter, fortune
Operator number, the concrete meaning of each parameter, dimension and value range extract respectively;Formula contrast module, for that will wait to reflect
Determine the respective variable parameter of the formula extracted in document and dependent variable parameter, oeprator, the concrete meaning of each parameter, dimension
And the respective variable parameter of formula and dependent variable parameter, oeprator, each parameter preserved in value range and formula storehouse
Concrete meaning, dimension and value range compared one by one;When the formula in document to be identified respective variable parameter with
And the formula preserved in dependent variable parameter, oeprator, dimension and value range and formula storehouse respective variable parameter and
Dependent variable parameter, oeprator, the registration of dimension and value range are more than formula comparison threshold T HMATHWhen, by formula
In storehouse with currently by compared with the associated material of formula as doubtful material.The registration refers to the formula in document to be identified
The sum of independent variable parameter, dependent variable parameter, oeprator, dimensions number compared with the formula in formula storehouse, identical with it is to be identified
The ratio of the sum of the independent variable parameter of current formula, dependent variable parameter, oeprator, dimensions number in document.
A specific embodiment according to the present invention, may be employed sliding window by document to be identified and doubtful material into
Row full text compares.The size of sliding window can be configured by system.The size of sliding window directly affects contrast effect, sliding
Dynamic window selection is too small, be easy to cause erroneous judgement, sliding window selection is excessive, be easy to cause and fails to judge.The slip step of sliding window
Length is also pre-set by system.As shown in Fig. 2, step S0:Start;S1:Sliding window setup module initializes similar window
Mouth counter CT1=0, Hua Dong Walk long counters CT2=0;Step S2:Sliding window setup module sets document to be identified with doubting
Document initial position is respectively positioned on like the sliding window of material;Step S3:Sliding window contrast module compares the cunning of document to be identified
The sliding window of dynamic window and doubtful material, the quantity of the wherein identical notional word participle of statistics;Step S4:Sliding window compares mould
Block judges whether the quantity of identical notional word participle is greater than or equal to threshold T HW;When more than or equal to threshold value hour counter
Value plus one, i.e. CT1=CT1+ 1, and record the position and cunning for identifying that the sliding window of document is current with the sliding window of doubtful material
Content in dynamic window;Step S5:Sliding window setup module sets the sliding window of doubtful material to slide a sliding step;
Step S6:Sliding window setup module judges whether to be located at document end position;If not end position, then return to step
S3:If end position, then step S11 is gone to;Step S11:Sliding window setup module judges the slip of document to be identified
Whether window is located at document end position;If not end position, then step S12 is gone to, if end position, then gone
Toward step S13;Step S12:Sliding window setup module sets the sliding window of doubtful material to return to document initial position;It waits to reflect
The sliding window for determining document slides a sliding step, CT2=CT2+ 1 goes to step S3;Step S13:Sliding window contrast module
Calculate similar window counter CT1Numerical value Yu Hua Dong Walk long counters CT2The ratio M of numerical value;S14:Sliding window contrast module is sentenced
Whether disconnected ratio M is greater than or equal to predetermined threshold value THm, as M >=THMWhen, then it is assumed that the document to be identified and the doubtful material phase
Seemingly;Work as M<THMWhen, then it is assumed that the document to be identified and the doubtful material are dissimilar;S15:Sliding window contrast module judges
It is no to also have doubtful material to need to compare, if so, then return to step S1;Step S16 is gone to if not;Step S16:Comparison
Report generation module generates and exports comparison report, and the identification document and all similar doubtful elements are included in the comparison report
The similar window counter CT of material1Numerical value, Hua Dong Walk long counters CT2The ratio of numerical value and the two, the identification document and phase
As doubtful material similar portion specific location and particular content;Step S17:Comparison terminates.
A specific embodiment according to the present invention, step S3:Sliding window contrast module compares document to be identified
The sliding window of sliding window and doubtful material, the quantity of the wherein identical notional word participle of statistics;Wherein identification is plagiarized common
Under pattern, identical notional word participle refers to that ID of the notional word participle in storehouse is segmented is identical;Wherein in the case where identification pattern is plagiarized in extension,
Identical notional word participle refers to that ID of the notional word participle group in storehouse is segmented is identical;Wherein under multilingual plagiarism identification pattern, phase
With notional word participle refer to that ID of the foreign language participle group in storehouse is segmented is identical in notional word.
A specific embodiment according to the present invention, step S16:Comparison report generation module exports comparison report, into
One step includes the content of comparison report according to the different and different of identification pattern.It is common to plagiarize under identification pattern, in comparison report
Specific location and particular content comprising the document to be identified to similar doubtful material similar portion;Document to be identified uses
The form of presentation consistent with similar portion in the similar doubtful material;The word statement used is also completely the same;It may
Only indivedual word orders are adjusted;If the document that identified document plagiarizes it is rewritten, when the degree of rewriting compared with
When big, common identification pattern of plagiarizing possibly can not find its document plagiarized.Extension is plagiarized under identification pattern, in comparison report
Specific location and particular content comprising the document to be identified to similar doubtful material similar portion;If identified document
The document plagiarized to it has carried out synonym or near synonym are rewritten, and when file structure rewriting is little, identification mould is plagiarized in extension
Formula can may also find its document plagiarized.Under multilingual plagiarism identification pattern, the document to be identified is included in comparison report
To the specific location and particular content of similar doubtful material similar portion;If the document that identified document plagiarizes it
It has carried out translation to rewrite, when file structure rewriting degree is little, extension plagiarism identification pattern may can also find it and be plagiarized
Document.
A specific embodiment according to the present invention, sliding window are located at document initial position and refer to sliding window most
Left side is overlapped with document initial position;Sliding window is located at document end position and refers to that the rightmost side of sliding window and document terminate
Position overlaps.
According to system operation test in advance, four notional words participle sizes of sliding window selected as are more suitable, sliding window
Size can also other sizes of selected as needed.Sliding window slides the step-length of a notional word participle every time during comparison;
In comparison process when occur in sliding window three or three or more notional words participle it is identical when (at this time without considering the elder generation of notional word participle
Order afterwards), then record current location and content of the sliding window in document to be identified and doubtful material.
The above described is only a preferred embodiment of the present invention, not make limitation in any form to the present invention, though
So the present invention is disclosed above with preferred embodiment, however is not limited to the present invention, any to be familiar with this professional technology people
Member, without departing from the scope of the present invention, when the technology contents using the disclosure above make a little change or modification
For the equivalent embodiment of equivalent variations, as long as being the content without departing from technical solution of the present invention, technical spirit according to the invention
To any simple modification, equivalent change and modification that above example is made, in the range of still falling within technical solution of the present invention.
Claims (10)
1. a kind of distributed text detecting system, which is characterized in that including:
Comparison database, for including with the material for comparing object;The comparison database is stored in different stations using distributed way
Point position;Particular station is chosen when accessing comparison database according to the loading condition of different websites to access;
The comparison database further comprises books storehouse, paper storehouse, patent database, formula storehouse, proverb common saying storehouse, proverb storehouse, name name
Yan Ku, poem storehouse word bank;
Storehouse is segmented, for including participle and corresponding part of speech;It segments in storehouse and carries out unique number for each participle, use W_ID tables
Show unique number of a certain participle in storehouse is segmented;
Word-dividing mode for being segmented to each material, and word segmentation result is preserved into comparison database;Word-dividing mode will be segmented and tied
Fruit is compared with the part of speech that participle storehouse preserves, and determines the part of speech of word segmentation result;
Participle characteristic value generation module counts the quantity that each participle occurs in corresponding material, generates each participle and corresponds to
Participle part of speech feature value WCCV=[W_ID, W_N, W_CHAR], WCV=[W_ID, W_N], wherein W_ID represent that the participle exists
The unique number in storehouse is segmented, W_N represents the total degree that the participle occurs in the material;W_CHAR represents the word of the participle
Property;
Participle free vector dimension determining module determines participle free vector dimension WFV according to the word segmentation result of material;Described point
Word free vector dimension WFV is equal to the quantity of the different participles obtained after being segmented to specific material;
Participle simplifies vector dimension generation module, is simplified for the participle free vector dimension WFV to each material, generates
Participle simplifies vector dimension RWV;
Feature vector generation module is segmented, participle essence described in each material is extracted for simplifying vector dimension RWV according to participle
The corresponding characteristic value generation participle characteristic vector W VE_RWV of simple vector dimension RWV;
WVE_RWV=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV,W_NRWV]
Wherein W_IDiRepresent unique number of the participle in storehouse is segmented, W_Ni, represent total time that the participle occurs in the material
Number, using the number as the characteristic value of the participle;
User's access mode detection module, for user to be prompted to upload document to be identified;
User's detection pattern determining module, for judge active user's detection pattern for it is common plagiarize identification pattern when, it is to be identified
Document word-dividing mode obtains word segmentation result for being segmented to document to be identified;
Document to be identified segments free vector dimension determining module, for determining participle certainly according to the word segmentation result of document to be identified
By vector dimension WFV_TBI;
Document participle to be identified simplifies vector dimension generation module, for the participle free vector dimension WFV_ to document to be identified
TBI is simplified;It generates document participle to be identified and simplifies vector dimension RWV_TBI;
Document to be identified segments feature vector generation module, and it is each to be identified to simplify vector dimension RWV_TBI extractions according to participle
The corresponding characteristic value generations of vector dimension RWV_TBI document to be identified is simplified with the document participle to be identified segment spy in document
Vector WVE_RWV_TBI is levied, wherein
WVE_RWV_TBI=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV_TBI,W_NRWV_TBI]
Wherein W_IDiRepresent unique number of the participle in storehouse is segmented, W_NiRepresent what the participle occurred in the document to be identified
Total degree, using the number as the characteristic value of the participle;
When user's detection pattern determining module judges active user's detection pattern for common plagiarism identification pattern, similarity ratio is carried out
Pair when, document to be identified participle feature vector generation module generates the participle characteristic vector W VE_RWV_TBI of document to be identified;
WVE_RWV_TBI=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV_TBI,W_NRWV_TBI], point of document to be identified
The dimension of word feature vector is RWV_TBI;Segment the participle feature vector of material in feature vector generation module generation comparison database
WVE_RWV;WVE_RWV=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV,W_NRWV];Wherein, document to be identified
Participle feature vector dimension RWV_TBI be equal to participle feature vector dimension RWV;
File characteristics vector adjustment module to be identified, for all characteristic values pair in characteristic vector W VE_RWV_TBI will to be segmented
The W_ID answerediValue carries out ascending or descending order arrangement, and the W_ID that will lack according to the number in participle storehouseiValue insertion, point of insertion
Word number W_IDiCorresponding characteristic value is 0;The document to be identified participle characteristic vector W VE_RWV_TBI_EXT being expanded
=[W_IDTBI_EXT_1,W_NTBI_EXT_1,...,W_IDTBI_EXT_i,W_NTBI_EXT_i,...,W_IDTBI_EXT_RWV_TBI,W_
NTBI_EXT_RWV_TBI,...,W_IDW,W_NW];
Material feature vector adjusts module, for that will segment the corresponding W_ID of all characteristic values in characteristic vector W VE_RWViValue
Ascending or descending order arrangement, and the W_ID that will lack are carried out according to the number in participle storehouseiValue insertion, the participle number W_ID of insertioni
Corresponding characteristic value is 0;The participle characteristic vector W VE_RWV_EXT=[W_ID being expandedEXT_1,W_NEXT_1,...,W_
IDEXT_i,W_NEXT_i,...,W_IDEXT_RWV,W_NEXT_RWV,...,W_IDW,W_NW];
It is common to plagiarize identification similarity calculation module, it calculates similar between document to be identified and any material in comparison database
Degree;It is calculated by the following formula:
After the completion of document to be identified and the comparison of all materials, all doubtful materials are extracted, by document to be identified and doubtful material
Further compared.
2. distributed text detecting system according to claim 1, wherein the participle part of speech classification that the participle storehouse preserves
For noun, verb, adjective, number, quantifier, pronoun, adverbial word, preposition, conjunction, auxiliary word, interjection and onomatopoeia.
3. document to be identified and doubtful material are carried out full text by distributed text detecting system according to claim 1 or 2
Comparison.
4. distributed text detecting system according to claim 3, wherein:Participle simplifies the use of vector dimension generation module
Part of speech screening method simplifies participle free vector dimension WFV;It is as follows to simplify process:By the characteristic value of word segmentation result according to right
The participle part of speech answered is classified;It is A1 class notional words characteristic value by feature value division, A2 class notional words characteristic value, B class notional word features
Value, C class notional words characteristic value, D class notional word characteristic values and V class function word characteristic values;The quantity of lower eigenvalue of all categories is counted respectively;
AMOUNT_A1, the quantity for referring to A1 class notional word characteristic values, AMOUNT_A2, the quantity for referring to A2 class notional word characteristic values, AMOUNT_B, refer to
The quantity of B class notional word characteristic values, the quantity of AMOUNT_C, C class notional word characteristic value, the number of AMOUNT_D, D class notional word characteristic value
Amount, the quantity of AMOUNT_V, V class notional word characteristic value;It calculates participle and simplifies vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+
AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V value RWV_S_V);If greater than 0, then exit and this time simplify;If
Equal to 0, then complete this time to simplify;If less than 0, then further calculate participle and simplify vector dimension RWV- (AMOUNT_A1+
AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D value RWV_S_D);If greater than 0, then from corresponding to AMOUNT_V
Characteristic value in extract the characteristic value equal with value RWV_S_D quantity at random, complete this time to simplify;If equal to 0, then complete this
It is secondary to simplify;If less than 0, then further calculate participle and simplify vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_
B+AMOUNT_C value RWV_S_C);If greater than 0, then extracted and value RWV_ at random from the characteristic value corresponding to AMOUNT_D
The equal characteristic value of S_C quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then further
Calculate the value RWV_S_B that participle simplifies vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B);If greater than 0,
Then extract the characteristic value equal with value RWV_S_B quantity at random from the characteristic value corresponding to AMOUNT_C, completion is this time simplified;
If equal to 0, then it completes this time to simplify;If less than 0, then further calculate participle and simplify vector dimension RWV- (AMOUNT_A1
+ AMOUNT_A2) value RWV_S_A2;If greater than 0, then extract and be worth at random from the characteristic value corresponding to AMOUNT_B
The equal characteristic value of RWV_S_A2 quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then into
One step calculates the value RWV_S_A1 that participle simplifies vector dimension RWV-AMOUNT_A1;It is if greater than 0, then right from AMOUNT_A2 institutes
The characteristic value equal with value RWV_S_A1 quantity is extracted in the characteristic value answered at random, completion is this time simplified;If equal to 0, then it is complete
It is simplified into this;If less than 0, then vector dimension RWV numbers are extracted and simplified at random from the characteristic value corresponding to AMOUNT_A1
Equal characteristic value is measured, completion is this time simplified.
5. distributed text detecting system according to claim 4 simplifies vector dimension RWV- for calculating participle
(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V) value RWV_S_V is more than 0 feelings
Shape, using corresponding material as doubtful material.
6. a kind of distributed text detection method, which is characterized in that including:
Comparison database is included with the material for comparing object;The comparison database is stored in different website positions using distributed way
It puts;Particular station is chosen when accessing comparison database according to the loading condition of different websites to access;The comparison database is further wrapped
Include books storehouse, paper storehouse, patent database, formula storehouse, proverb common saying storehouse, proverb storehouse, famous sayings of famous figures storehouse, poem storehouse word bank;
Include participle and corresponding part of speech in participle storehouse;It segments in storehouse and carries out unique number for each participle, certain is represented using W_ID
Unique number of one participle in storehouse is segmented;
Word-dividing mode segments each material, and word segmentation result is preserved into comparison database;Word-dividing mode by word segmentation result with
The part of speech that participle storehouse preserves is compared, and determines the part of speech of word segmentation result;
Participle characteristic value generation module counts the quantity that each participle occurs in corresponding material, generates each participle and corresponds to
Participle part of speech feature value WCCV=[W_ID, W_N, W_CHAR], WCV=[W_ID, W_N], wherein W_ID represent that the participle exists
The unique number in storehouse is segmented, W_N represents the total degree that the participle occurs in the material;W_CHAR represents the word of the participle
Property;
Participle free vector dimension determining module determines participle free vector dimension WFV according to the word segmentation result of material;Described point
Word free vector dimension WFV is equal to the quantity of the different participles obtained after being segmented to specific material;
Participle simplifies vector dimension generation module and the participle free vector dimension WFV of each material is simplified, generation participle
Simplify vector dimension RWV;
Participle feature vector generation module simplifies participle described in each material of vector dimension RWV extractions according to participle and simplifies vector
The corresponding characteristic value generation participle characteristic vector W VE_RWV of dimension RWV;
WVE_RWV=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV,W_NRWV]
Wherein W_IDiRepresent unique number of the participle in storehouse is segmented, W_Ni, represent total time that the participle occurs in the material
Number, using the number as the characteristic value of the participle;
User's access mode detection module prompting user uploads document to be identified;
When user's detection pattern determining module judges active user's detection pattern for common plagiarism identification pattern, document to be identified point
Word module obtains word segmentation result for being segmented to document to be identified;
Document to be identified participle free vector dimension determining module according to the word segmentation result of document to be identified determine participle freely to
Measure dimension WFV_TBI;
Document to be identified participle simplify vector dimension generation module to the participle free vector dimension WFV_TBI of document to be identified into
Row is simplified;It generates document participle to be identified and simplifies vector dimension RWV_TBI;
Document participle feature vector generation module to be identified simplifies each text to be identified of vector dimension RWV_TBI extractions according to participle
The corresponding characteristic value generations of vector dimension RWV_TBI document to be identified is simplified with the document participle to be identified segment feature in shelves
Vectorial WVE_RWV_TBI, wherein
WVE_RWV_TBI=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV_TBI,W_NRWV_TBI]
Wherein W_IDiRepresent unique number of the participle in storehouse is segmented, W_NiRepresent what the participle occurred in the document to be identified
Total degree, using the number as the characteristic value of the participle;
When user's detection pattern determining module judges active user's detection pattern for common plagiarism identification pattern, similarity ratio is carried out
Pair when, document to be identified participle feature vector generation module generates the participle characteristic vector W VE_RWV_TBI of document to be identified;
WVE_RWV_TBI=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV_TBI,W_NRWV_TBI], point of document to be identified
The dimension of word feature vector is RWV_TBI;Segment the participle feature vector of material in feature vector generation module generation comparison database
WVE_RWV;WVE_RWV=[W_ID1,W_N1,...,W_IDi,W_Ni,...,W_IDRWV,W_NRWV];Wherein, document to be identified
Participle feature vector dimension RWV_TBI be equal to participle feature vector dimension RWV;
The corresponding W_ of all characteristic values that file characteristics vector adjustment module to be identified will be segmented in characteristic vector W VE_RWV_TBI
IDiValue carries out ascending or descending order arrangement, and the W_ID that will lack according to the number in participle storehouseiValue insertion, the participle number of insertion
W_IDiCorresponding characteristic value is 0;The document to be identified participle characteristic vector W VE_RWV_TBI_EXT=[W_ being expanded
IDTBI_EXT_1,W_NTBI_EXT_1,...,W_IDTBI_EXT_i,W_NTBI_EXT_i,...,W_IDTBI_EXT_RWV_TBI,W_
NTBI_EXT_RWV_TBI,...,W_IDW,W_NW];
Material feature vector adjusts the corresponding W_ID of all characteristic values that module will be segmented in characteristic vector W VE_RWViValue according to point
Number in dictionary carries out ascending or descending order arrangement, and the W_ID that will lackiValue insertion, the participle number W_ID of insertioniIt is corresponding
Characteristic value be 0;The participle characteristic vector W VE_RWV_EXT=[W_ID being expandedEXT_1,W_NEXT_1,...,W_IDEXT_i,
W_NEXT_i,...,W_IDEXT_RWV,W_NEXT_RWV,...,W_IDW,W_NW];
Common plagiarism identification similarity calculation module calculates similar between document to be identified and any material in comparison database
Degree;It is calculated by the following formula:
After the completion of document to be identified and the comparison of all materials, all doubtful materials are extracted, by document to be identified and doubtful material
Further compared.
7. distributed text detection method according to claim 6, wherein the participle part of speech classification that the participle storehouse preserves
For noun, verb, adjective, number, quantifier, pronoun, adverbial word, preposition, conjunction, auxiliary word, interjection and onomatopoeia.
8. document to be identified and doubtful material are carried out full text by the distributed text detection method according to claim 6 or 7
Comparison.
9. distributed text detection method according to claim 8, wherein:Participle simplifies the use of vector dimension generation module
Part of speech screening method simplifies participle free vector dimension WFV;It is as follows to simplify process:By the characteristic value of word segmentation result according to right
The participle part of speech answered is classified;It is A1 class notional words characteristic value by feature value division, A2 class notional words characteristic value, B class notional word features
Value, C class notional words characteristic value, D class notional word characteristic values and V class function word characteristic values;The quantity of lower eigenvalue of all categories is counted respectively;
AMOUNT_A1, the quantity for referring to A1 class notional word characteristic values, AMOUNT_A2, the quantity for referring to A2 class notional word characteristic values, AMOUNT_B, refer to
The quantity of B class notional word characteristic values, the quantity of AMOUNT_C, C class notional word characteristic value, the number of AMOUNT_D, D class notional word characteristic value
Amount, the quantity of AMOUNT_V, V class notional word characteristic value;It calculates participle and simplifies vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+
AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V value RWV_S_V);If greater than 0, then exit and this time simplify;If
Equal to 0, then complete this time to simplify;If less than 0, then further calculate participle and simplify vector dimension RWV- (AMOUNT_A1+
AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D value RWV_S_D);If greater than 0, then from corresponding to AMOUNT_V
Characteristic value in extract the characteristic value equal with value RWV_S_D quantity at random, complete this time to simplify;If equal to 0, then complete this
It is secondary to simplify;If less than 0, then further calculate participle and simplify vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_
B+AMOUNT_C value RWV_S_C);If greater than 0, then extracted and value RWV_ at random from the characteristic value corresponding to AMOUNT_D
The equal characteristic value of S_C quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then further
Calculate the value RWV_S_B that participle simplifies vector dimension RWV- (AMOUNT_A1+AMOUNT_A2+AMOUNT_B);If greater than 0,
Then extract the characteristic value equal with value RWV_S_B quantity at random from the characteristic value corresponding to AMOUNT_C, completion is this time simplified;
If equal to 0, then it completes this time to simplify;If less than 0, then further calculate participle and simplify vector dimension RWV- (AMOUNT_A1
+ AMOUNT_A2) value RWV_S_A2;If greater than 0, then extract and be worth at random from the characteristic value corresponding to AMOUNT_B
The equal characteristic value of RWV_S_A2 quantity, completion are this time simplified;If equal to 0, then it completes this time to simplify;If less than 0, then into
One step calculates the value RWV_S_A1 that participle simplifies vector dimension RWV-AMOUNT_A1;It is if greater than 0, then right from AMOUNT_A2 institutes
The characteristic value equal with value RWV_S_A1 quantity is extracted in the characteristic value answered at random, completion is this time simplified;If equal to 0, then it is complete
It is simplified into this;If less than 0, then vector dimension RWV numbers are extracted and simplified at random from the characteristic value corresponding to AMOUNT_A1
Equal characteristic value is measured, completion is this time simplified.
10. distributed text detection method according to claim 9 simplifies vector dimension RWV- for calculating participle
(AMOUNT_A1+AMOUNT_A2+AMOUNT_B+AMOUNT_C+AMOUNT_D+AMOUNT_V) value RWV_S_V is more than 0 feelings
Shape, using corresponding material as doubtful material.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610020566.XA CN105550172B (en) | 2016-01-13 | 2016-01-13 | A kind of distributed text detection method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610020566.XA CN105550172B (en) | 2016-01-13 | 2016-01-13 | A kind of distributed text detection method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105550172A CN105550172A (en) | 2016-05-04 |
CN105550172B true CN105550172B (en) | 2018-06-01 |
Family
ID=55829361
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610020566.XA Active CN105550172B (en) | 2016-01-13 | 2016-01-13 | A kind of distributed text detection method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105550172B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110334325A (en) * | 2019-07-16 | 2019-10-15 | 同方知网数字出版技术股份有限公司 | A kind of full text similarity analysis method compared towards publishing house's strange land resource joint |
CN111159337A (en) * | 2019-12-20 | 2020-05-15 | 中国建设银行股份有限公司 | Chemical expression extraction method, device and equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103226546A (en) * | 2013-04-15 | 2013-07-31 | 北京邮电大学 | Suffix tree clustering method on basis of word segmentation and part-of-speech analysis |
CN104899304A (en) * | 2015-06-12 | 2015-09-09 | 北京京东尚科信息技术有限公司 | Named entity identification method and device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2014026455A (en) * | 2012-07-26 | 2014-02-06 | Nippon Telegr & Teleph Corp <Ntt> | Media data analysis device, method and program |
-
2016
- 2016-01-13 CN CN201610020566.XA patent/CN105550172B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103226546A (en) * | 2013-04-15 | 2013-07-31 | 北京邮电大学 | Suffix tree clustering method on basis of word segmentation and part-of-speech analysis |
CN104899304A (en) * | 2015-06-12 | 2015-09-09 | 北京京东尚科信息技术有限公司 | Named entity identification method and device |
Non-Patent Citations (2)
Title |
---|
云环境中的近似复制文本检测;许君 等;《计算机研究与发展》;20121231;第329-335页 * |
基于左归词频向量空间模型的中文文本抄袭检测算法;谢松山 等;《西南大学学报(自然科学版)》;20150531;第37卷(第5期);第158-161页 * |
Also Published As
Publication number | Publication date |
---|---|
CN105550172A (en) | 2016-05-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liu et al. | Using argument-based features to predict and analyse review helpfulness | |
CN105701076A (en) | Thesis plagiarism detection method and system | |
Jayakodi et al. | An automatic classifier for exam questions in Engineering: A process for Bloom's taxonomy | |
CN106156204A (en) | The extracting method of text label and device | |
CN109388801A (en) | The determination method, apparatus and electronic equipment of similar set of words | |
CN105701085A (en) | Network duplicate checking method and system | |
CN110472203A (en) | A kind of duplicate checking detection method, device, equipment and the storage medium of article | |
CN110019660A (en) | A kind of Similar Text detection method and device | |
Ronan et al. | Determining light verb constructions in contemporary British and Irish English | |
Argamon | Computational forensic authorship analysis: Promises and pitfalls | |
CN105677641B (en) | A kind of paper self checking method and system | |
CN105701086A (en) | Method and system for detecting literature through sliding window | |
CN105550172B (en) | A kind of distributed text detection method and system | |
Tedeschi et al. | ID10M: Idiom identification in 10 languages | |
Kuzman et al. | The GINCO training dataset for web genre identification of documents out in the wild | |
Yan et al. | On the robustness of reading comprehension models to entity renaming | |
Curtotti et al. | Machine learning for readability of legislative sentences | |
CN111680506A (en) | External key mapping method and device of database table, electronic equipment and storage medium | |
CN105701213B (en) | A kind of document control methods and system | |
Bian et al. | Detecting spam game reviews on steam with a semi-supervised approach | |
Taerungruang et al. | Constructing an Academic Thai Plagiarism Corpus for Benchmarking Plagiarism Detection Systems. | |
Shrestha | Detecting fake news with sentiment analysis and network metadata | |
CN105701206B (en) | A kind of document detection method and system based on sampling | |
Chaturvedi et al. | Detecting fake news using machine learning algorithms | |
CN105701077A (en) | Multi-language literature detection method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |