CN110297882A

CN110297882A - Training corpus determines method and device

Info

Publication number: CN110297882A
Application number: CN201910156947.4A
Authority: CN
Inventors: 崔家亮
Original assignee: Alibaba Group Holding Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2019-03-01
Filing date: 2019-03-01
Publication date: 2019-10-01

Abstract

This application discloses a kind of training corpus to determine method and device, this method comprises: obtaining text to be analyzed and an at least received text；Calculate the Jacobi similarity of the text to be analyzed Yu each piece received text；It is calculated by least one vectorization, determines corresponding vector cosine similarity of the text to be analyzed with each piece received text；According to the Jacobi similarity and all vector cosine similarities of the text to be analyzed and each piece received text, the polymerization similarity of the text to be analyzed and each piece received text is determined；According to the polymerization similarity of the text to be analyzed and each piece received text, determine whether the text to be analyzed is used as training corpus.

Description

Training corpus determines method and device

Technical field

This specification is related to computer field.

Background technique

Natural language processing or text-processing, as a branch very important in machine learning, in practical business Meaning with very extensive application and research.

A large amount of text analyzing is based on supervised learning at present, that is, need to rely on training corpus, passes through the instruction handled well Practice corpus come training pattern and adjust ginseng etc..And primary work of the related corpus as all machine learning projects is collected, at present Dependent on artificial assessment, therefore, cost of labor is very high.

For example, some electric business website wants the training one sentiment analysis device for being directed to customer comment, needs a large amount of phases The customer comment of pass.Although the technologies such as web crawlers can automatically grab text in large quantities, where climbs, which is climbed Artificial decision is still needed, also, internet text climbs down full dose text with crawler merely and be not added full of uncertainty Filtering is easy to cause and obtains excessive rubbish text, is unfavorable for the mark and model training in later period.

Summary of the invention

The purpose of the present invention is to provide a kind of training corpus to determine method and device, can either be determined more accurately Training corpus, and cost of labor can be effectively reduced.

To solve the above-mentioned problems, this application discloses a kind of training corpus to determine method, comprising:

Obtain text to be analyzed and an at least received text；

Calculate the Jacobi similarity of the text to be analyzed Yu each piece received text；

Calculated by least one vectorization, determine the text to be analyzed it is corresponding with each piece received text to Measure cosine similarity；

It is similar to the Jacobi similarity of each piece received text and all vector cosine according to the text to be analyzed Degree, determines the polymerization similarity of the text to be analyzed and each piece received text；

According to the polymerization similarity of the text to be analyzed and each piece received text, whether the text to be analyzed is determined As training corpus.

In a preferred embodiment, it should be calculated by least one vectorization, and determine the text to be analyzed and each piece mark The step of corresponding vector cosine similarity of quasi- text, comprising:

At least one vectorization is carried out to the text to be analyzed and each piece received text to calculate, and it is to be analyzed to obtain this Text is with the corresponding vectorization of each piece received text as a result, and according to the text to be analyzed and each piece standard text This corresponding vectorization is as a result, calculate the vector cosine similarity of the text to be analyzed Yu each piece received text.

In a preferred embodiment, it should be calculated by least one vectorization, and determine the text to be analyzed and each piece mark The step of corresponding vector cosine similarity of quasi- text, including following vectorization calculate in any one or they appoint Meaning combination:

Vectorization calculating carried out to the text to be analyzed and each piece received text by Word2Vec, and according to Quantized result calculates the text to be analyzed and the corresponding cosine similarity under the Word2Vec of each piece received text； And/or

Vectorization calculating is carried out to the text to be analyzed and each piece received text by TF-IDF, and according to vector Change as a result, calculating the text to be analyzed and the corresponding cosine similarity under the TF-IDF of each piece received text；And/or

Vectorization calculating is carried out to the text to be analyzed and each piece received text by LSI, and according to vectorization knot Fruit calculates the text to be analyzed and the corresponding cosine similarity under the LSI of each piece received text.

In a preferred embodiment, the Jacobi similarity according to the text to be analyzed and each piece received text and All vector cosine similarities, the step of determining the polymerization similarity of the text to be analyzed and each piece received text, packet It includes:

By mean value computation or weighted average calculation to the Jacobi of the text to be analyzed and each piece received text Similarity and institute's directed quantity cosine similarity are polymerize, and the polymerization of the text to be analyzed and each piece received text is obtained Similarity.

In a preferred embodiment, the polymerization similarity according to the text to be analyzed and each piece received text determines The step of whether text to be analyzed is used as training corpus, comprising:

The polymerization similarity and first threshold for comparing the text to be analyzed and each piece received text, if there is any One polymerization similarity is less than the first threshold, then using the text to be analyzed as training corpus；And/or

Compare the average value and second threshold of the polymerization similarity of the text to be analyzed and each piece received text, such as The fruit average value is less than the second threshold, then using the text to be analyzed as training corpus.

In a preferred embodiment, the step of acquisition text to be analyzed is with an at least received text, comprising:

Using web crawlers, text to be analyzed is crawled from network.

The text to be analyzed is pre-processed, wherein the pretreatment includes that Chinese word segmentation and text are clearly handled.

Disclosed herein as well is a kind of training corpus determining devices, comprising:

Module is obtained, for obtaining text to be analyzed and an at least received text；

Jacobi similarity module, it is similar to the Jacobi of each piece received text for calculating the text to be analyzed Degree；

Vector cosine similarity module, for being calculated by least one vectorization, determine the text to be analyzed with it is each The corresponding vector cosine similarity of the piece received text；

It polymerize similarity module, for the Jacobi similarity according to the text to be analyzed and each piece received text With all vector cosine similarities, the polymerization similarity of the text to be analyzed and each piece received text is determined；

Training corpus determining module, for the polymerization similarity according to the text to be analyzed and each piece received text, Determine whether the text to be analyzed is used as training corpus.

Disclosed herein as well is a kind of training corpus to determine equipment, comprising:

Memory, for storing computer executable instructions；And

Processor, for realizing the step in method as previously described when executing the computer executable instructions.

Disclosed herein as well is a kind of computer readable storage medium, meter is stored in the computer readable storage medium Calculation machine executable instruction, the computer executable instructions realize the step in method as previously described when being executed by processor.

According to the technical solution of the application, firstly, calculating a large amount of texts to be analyzed and received text crawled from network The similarity on various dimensions, and text to be analyzed is determined by calculating average value or weighted average to these similarities It polymerize similarity with received text, is then based on polymerization similarity, determines whether the text to be analyzed is used as training corpus, In this way, it is excessive to final accuracy generation to effectively prevent some individual similarity value bring deviation Influence, to reduce error caused by relying on single method.Also, this mode can also significantly reduce artificial throwing Enter.

A large amount of technical characteristic is described in this specification, is distributed in each technical solution, if this Shen is set out Specification please can be made excessively tediously long if the combination (i.e. technical solution) of all possible technical characteristic.In order to avoid this Problem, each technical characteristic disclosed in this specification foregoing invention content disclose in each embodiment and example below Each technical characteristic and attached drawing disclosed in each technical characteristic, can freely be combined with each other, to constitute various New technical solution (these technical solutions should be considered as have been recorded in the present specification), unless this technical characteristic Combination is technically infeasible.For example, disclosing feature A+B+C in one example, disclose in another example Feature A+B+D+E, and feature C and D are the equivalent technologies means for playing phase same-action, as long as technically select a use, It can not use simultaneously, feature E can be technically combined with feature C, then, the scheme of A+B+C+D is answered because technology is infeasible When not being considered as having recorded, and the scheme of A+B+C+E should be considered as being described.

Detailed description of the invention

Fig. 1 is the flow diagram that method is determined according to the training corpus of this specification first embodiment；

Fig. 2 is the flow diagram that the part of method is determined according to the training corpus of this specification first embodiment；

Fig. 3 is the structural schematic diagram according to the training corpus determining device of this specification second embodiment.

Specific embodiment

In the following description, in order to make the reader understand this application better, many technical details are proposed.But It will be understood by those skilled in the art that even if without these technical details and based on the following respective embodiments various The application technical solution claimed also may be implemented in change and modification.

It is described in further detail below in conjunction with embodiment of the attached drawing to this specification.

The first embodiment of this specification is related to a kind of training corpus and determines method, and process is as shown in Figure 1, the party Method the following steps are included:

Step 110: obtaining text to be analyzed and an at least received text.

Specifically, in this step, choosing in advance and arranging a part of existing corpus as received text, standard text This quantity can determines according to actual conditions, but at least one.

It may be noted that the definition of these received texts is: can be used in representing the most mark under current research or project purpose Quasi-, most representative text.

In addition, in this step, also using web crawlers, text to be analyzed being crawled from network.In the next steps, The degree of approximation of these texts to be analyzed and above-mentioned standard text will be analyzed respectively, and determines that these texts to be analyzed are accordingly It is no to can be used as training corpus.

It may be noted that these texts to be analyzed need to be pre-processed after crawling text to be analyzed by crawler, In, pretreatment may include Chinese word segmentation and text is clearly handled, etc..Specifically, text clearly handle including stop words, The irrelevant informations such as punctuate.

Step 120: calculating the Jacobi similarity (Jaccard of the text to be analyzed and each piece received text Similarity)。

It may be noted that the definition of Jacobi similarity is: the ratio of sample intersection number and sample union number, this is one Compare the method for text similarity in a Keywords matching dimension.The calculation formula of Jacobi similarity is as follows:

Wherein, A represents text to be analyzed, and B represents received text.

A ∩ B represents the intersection of text and received text to be analyzed；A ∪ B represent text and received text to be analyzed and Collection.

The advantage of doing so is that, it is only necessary to the intersection union of specific word in SS text and text to be analyzed, compared with Easy to operate and operation.

Step 130: being calculated by least one vectorization, determine the text to be analyzed and each piece standard text This corresponding vector cosine similarity.

Specifically, in this step, to the text to be analyzed and each piece received text carry out it is at least one to Quantum chemical method obtains the text to be analyzed with the corresponding vectorization of each piece received text as a result, simultaneously according to Text to be analyzed is with the corresponding vectorization of each piece received text as a result, calculating text to be analyzed and this standard text This corresponding vector cosine similarity.

Specifically, this step further includes any combination of any one or they in following steps:

Step 132: vectorization meter is carried out to the text to be analyzed and each piece received text by Word2Vec It calculates, and according to vectorization as a result, the calculating text to be analyzed is corresponding under the Word2Vec of each piece received text Cosine similarity.

It may be noted that Word2Vec is the method by deep learning, training word and its context, it is final obtain word to Amount indicates.The Word2Vec vector that training obtains under large-scale corpus, possesses the part attribute in semantics: such as after tested ' King '+' Man '-' Woman '=' Queen '.In addition, cosine similarity can preferably be suitable for higher dimensional space under two to The similarity degree of amount calculates.

Specifically, cosine similarity formula is as follows in this step:

Wherein, A represents text to be analyzed, and B represents received text.

The advantage of doing so is that the text to be analyzed of input is represented with preparatory trained high quality Word2Vec, That is, crawler text and received text, then the degree of approximation both calculated by cosine similarity, due to training in advance Word2Vec contains the information of the training from other vast corpus, can be from semantic angle to specific text Carry out vectorization.

Step 134: passing through TF-IDF (term frequency-inverse document frequency, word frequency- Reverse document-frequency) vectorization calculating is carried out to the text to be analyzed and each piece received text, and according to vectorization As a result, calculating the text to be analyzed and the corresponding cosine similarity under the TF-IDF of each piece received text.

It may be noted that TF-IDF is a kind of statistical method, to assess a words for a file set or a corpus In a copy of it file significance level.The importance of words with the directly proportional increase of number that it occurs hereof, but The frequency that can occur in corpus with it is inversely proportional decline simultaneously.

Specifically, TF-IDF represents a text, a usual word using the combination of word frequency and reverse document-frequency Frequency of occurrence is more in an article, while frequency of occurrence is fewer in all documents, then the word can more represent this article Chapter.

Therefore, this have the advantage that, it can be filtered out and certain all be occurred in all texts by TF-IDF Word selects the word that can preferably represent text, the deficiency of single-morpheme word matching process can be effectively supplemented, specifically, word In method of completing the square, the weight of each word is identical, but some words than other words more have information in practice, can more distinguish text The theme and content of chapter.Therefore the specific gravity that insignificant word can be effectively reduced with TF-IDF filtering, improves the ratio of important word Weight.

Step 136: by LSI (Latent Semantic Indexing, latent semantic relations) to the text to be analyzed This and each piece received text carry out vectorization calculating, and according to vectorization as a result, calculating the text to be analyzed and every Corresponding cosine similarity under the LSI of one received text.

It may be noted that LSI is considered as a kind of improvement of classical vector space model (VSM).LSI is that a kind of establish is being united Learning method on meter: the structural relation that it tries to find out between association mode and its hiding object between object.

Specifically, LSI is to utilize singular value decomposition (Singular Value Decomposition, hereinafter referred to as SVD) Word and text are mapped to new space, can solve polysemy and the more word problems of justice.

Specifically, for example, ' seller ' and ' boss ' electric business comment in often refer to it is identical, in another example, ' bank ' i.e. finger For bank it can be appreciated that riverbank.

Therefore, this have the advantage that, by consider polysemy the case where, can more effectively be caught semantically Catch the content of text.

It may be noted that carrying out at least one to text to be analyzed and each received text in presently filed embodiment Vector cosine similarity calculates.Specific vector cosine similarity calculation can include but is not limited to: more than Word2Vec+ String similarity, TFIDF+ cosine similarity, LSI+ cosine similarity, etc..And it is possible to be any one vector cosine phase It is calculated like degree, is also possible to the combination for any way that a variety of vector cosine similarities calculate, this will not be repeated here.

It may be noted that in embodiments herein, it can also be further using other Feature Engineering method such as graphic calculations Method etc..Specifically, such as using deep learning such as Recognition with Recurrent Neural Network (Recurrent Neural Networks, hereinafter referred to as RNN), convolutional neural networks (Convolutional Neural Networks, CNN) etc., utilize the term vector of other pre-training As GloVe replaces Word2Vec, or using the measuring similarities such as Euclidean distance mode substitute cosine similarity, etc..

It may be noted that step 130 is compared with the Jacobi similarity in step 120, to text to be analyzed and received text Vectorization calculate after calculate cosine similarity again, be use entirely different latent structure and model.

This have the advantage that by polymerizeing above-mentioned a variety of vectorization calculation methods and Jacobi similarity based method, Whole deviation (bias) can be effectively reduced.For example, in some cases, text and received text to be analyzed are calculated Jacobi similarity it is very high, and TF-IDF+ cosine similarity or LSI+ cosine similarity are very low, at this moment, by this The similarity that a little algorithms of different obtain is polymerize, and can be effectively reduced whole deviation (bias).

Step 140: according to the Jacobi similarity of the text to be analyzed and each piece received text and all Vector cosine similarity determines the polymerization similarity of the text to be analyzed and each piece received text.

Specifically, by mean value computation or weighted average calculation to the text to be analyzed and each piece standard text This Jacobi similarity and institute's directed quantity cosine similarity is polymerize, and is obtained described in the text to be analyzed and each piece The polymerization similarity of received text.

It may be noted that polymerization here, substantially just refer to being obtained in previous step to text to be analyzed and each piece A variety of similarity calculation average values of received text calculate weighted average, obtain text to be analyzed and this standard with this The polymerization similarity of text.

It is appreciated that this polymerization similarity can embody text to be analyzed and each standard simultaneously to some extent Text calculates the similarity of acquisition in a number of different ways, therefore, can be effectively reduced whole deviation (bias).

Specifically, if by weighted average calculation to the Jacobi similarity of text to be analyzed and received text and to Amount cosine similarity is polymerize, and average weighted calculation can be for example: traverse parameter group by actual training It closes, determines specific weighted value.

This have the advantage that the weighting side of optimization can be adjusted flexibly according to the text in different background field Formula.

It may be noted that according to presently filed embodiment, average weighted calculation can there are many, however it is not limited on State specific calculation.

Step 150: according to the polymerization similarity of the text to be analyzed and each piece received text, determine described in Whether text to be analyzed is used as training corpus.

Specifically, in this step, presetting first threshold and second threshold.Then, text more to be analyzed and every The polymerization similarity of one received text and pre-set first threshold, if there is any polymerization similarity is less than the One threshold value then illustrates that the text to be analyzed and one in received text are very approximate, accordingly, it is determined that by the text to be analyzed As training corpus.

Alternatively, in this step, can also further text more to be analyzed it is similar with the polymerization of each received text The average value of degree and pre-set second threshold, if the polymerization similarity of text to be analyzed and each received text Average value is less than second threshold, then illustrates that the text to be analyzed is also very approximate with received text, accordingly, it is determined that this is waited for point Text is analysed as training corpus.

For example, if one shares 4 received texts, corresponding the 4 of text and 4 received texts are analysed to The average value of a polymerization similarity is compared with second threshold, if the average value is less than second threshold, it is determined that should be wait divide Analysis text is training corpus.

It may be noted that in presently filed embodiment, first threshold and second threshold can by successive ignition training and The modes such as parameter optimization obtain.

It may be noted that in presently filed embodiment, can using any one in above-mentioned two condition or they Whether combination, the condition of training corpus is used as determination text to be analyzed.

Above-described embodiment has technical effect that, for calculate a large amount of texts to be analyzed crawled from network not really Qualitative larger, the very high problem of cost of labor needed for leading to text analyzing calculates text and received text to be analyzed in multidimensional Similarity on degree, such as calculate Jacobi similarity, the corresponding cosine similarity under Word2Vec, the phase under TF-IDF Corresponding cosine similarity etc. under the cosine similarity and LSI answered, polymerize these similarities, obtains polymerization phase It like degree, both solves the problems, such as high labor cost, while avoiding the deviation of some specific similarity to whole result Accuracy rate generates excessive influence.Further, the multiple threshold values and Rule of judgment for presetting suitable different situations, according to poly- The comparison result for closing similarity and threshold value, can make the selection of training corpus more reasonable.

The second embodiment of this specification is related to a kind of training corpus determining device, and structure is as shown in Fig. 2, the instruction Practicing corpus determining device includes: to obtain module, Jacobi similarity module, vector cosine similarity module, polymerization similarity mould Block and training corpus determining module.It is specific:

Jacobi similarity module, for calculating the Jacobi of the text to be analyzed Yu each piece received text Similarity；

Vector cosine similarity module determines the text to be analyzed and every for calculating by least one vectorization The corresponding vector cosine similarity of one received text；

It polymerize similarity module, for the Jacobi phase according to the text to be analyzed and each piece received text Like degree and all vector cosine similarities, determine that the text to be analyzed is similar with the polymerization of each piece received text Degree；

Training corpus determining module, for the polymerization phase according to the text to be analyzed and each piece received text Like degree, determine whether the text to be analyzed is used as training corpus.

First embodiment is method implementation corresponding with present embodiment, the technology in first embodiment Details can be applied to present embodiment, and the technical detail in present embodiment also can be applied to first embodiment.

It should be noted that it will be appreciated by those skilled in the art that in the embodiment of above-mentioned training corpus determining device Shown in the realization function of each module can refer to aforementioned training corpus and determine the associated description of method and understand.Above-mentioned trained language Expect that the function of each module shown in the embodiment of determining device can be (executable to refer to by running on the program on processor Enable) and realize, it can also be realized by specific logic circuit.The above-mentioned training corpus determining device of this specification embodiment is such as Fruit is realized and when sold or used as an independent product in the form of software function module, also can store and calculates at one In machine read/write memory medium.Based on this understanding, the technical solution of this specification embodiment is substantially in other words to existing The part that technology contributes can be embodied in the form of software products, which is stored in one and deposits In storage media, including some instructions are used so that a computer equipment (can be personal computer, server or network Equipment etc.) execute each embodiment the method for this specification all or part.And storage medium above-mentioned include: USB flash disk, Mobile hard disk, read-only memory (ROM, Read Only Memory), magnetic or disk etc. are various to can store program code Medium.It is combined in this way, this specification embodiment is not limited to any specific hardware and software.

Correspondingly, this specification embodiment also provides a kind of computer readable storage medium, wherein being stored with computer Executable instruction, the computer executable instructions realize each method embodiment of this specification when being executed by processor.It calculates Machine readable storage medium storing program for executing includes permanent and non-permanent, removable and non-removable media can be by any method or technique To realize that information stores.Information can be computer readable instructions, data structure, the module of program or other data.Computer The example of storage medium include but is not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamically with Machine accesses memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable can Program read-only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storages are set Standby or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, Computer readable storage medium does not include temporary computer readable media (transitory media), such as the data-signal of modulation And carrier wave.

In addition, this specification embodiment, which also provides a kind of training corpus, determines equipment, calculated including for storing The memory of machine executable instruction, and, processor；The computer that the processor is used in the execution memory is executable The step in above-mentioned each method embodiment is realized when instruction.Wherein, which can be central processing unit (Central Processing Unit, referred to as " CPU "), it can also be other general processors, digital signal processor (Digital Signal Processor, referred to as " DSP "), specific integrated circuit (Application Specific Integrated Circuit, referred to as " ASIC ") etc..Memory above-mentioned can be read-only memory (read-only memory, abbreviation " ROM "), random access memory (random access memory, referred to as " RAM "), flash memory (Flash), hard disk Or solid state hard disk etc..The step of method disclosed in each embodiment of the present invention, can be embodied directly in hardware processor and hold Row complete, or in processor hardware and software module combine execute completion.

It should be noted that relational terms such as first and second and the like are only in the application documents of this patent For distinguishing one entity or operation from another entity or operation, without necessarily requiring or implying these realities There are any actual relationship or orders between body or operation.Moreover, the terms "include", "comprise" or its is any Other variants are intended to non-exclusive inclusion, so that including the process, methods of a series of elements, article or setting Standby includes not only those elements, but also including other elements that are not explicitly listed, or further include for this process, Method, article or the intrinsic element of equipment.In the absence of more restrictions, limited by sentence " including one " Element, it is not excluded that there is also other identical elements in the process, method, article or apparatus that includes the element.This In the application documents of patent, if it is mentioned that execute certain behavior according to certain element, then refer to and execute the behavior according at least to the element The meaning, which includes two kinds of situations: executing the behavior according only to the element and is executed according to the element and other elements The behavior.The expression such as multiple, multiple, a variety of include 2,2 times, 2 kinds and 2 or more, 2 times or more, two or more.

It is included in the disclosure of the specification with being considered as globality in all documents that this specification refers to In, to can be used as the foundation of modification if necessary.In addition, it should also be understood that, the foregoing is merely the preferable implementations of this specification Example, is not intended to limit the protection scope of this specification.All spirit and original in this specification one or more embodiment Within then, any modification, equivalent replacement, improvement and so on should be included in the guarantor of this specification one or more embodiment Within the scope of shield.

It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims It is interior.In some cases, the movement recorded in detail in the claims or step can be come according to the sequence being different from embodiment It executes and desired result still may be implemented.In addition, process depicted in the drawing not necessarily require show it is specific suitable Sequence or consecutive order are just able to achieve desired result.In some embodiments, multitasking and parallel processing be also can With or may be advantageous.

Claims

1. a kind of training corpus determines method characterized by comprising

Obtain text to be analyzed and an at least received text；

It is calculated by least one vectorization, determines the corresponding vector of the text to be analyzed with each piece received text Cosine similarity；

2. the method as described in claim 1, which is characterized in that it is described to be calculated by least one vectorization, determine it is described to Analyze the step of text is with the corresponding vector cosine similarity of each piece received text, comprising:

At least one vectorization is carried out to the text to be analyzed and each piece received text to calculate, and is obtained described to be analyzed Text is with the corresponding vectorization of each piece received text as a result, and according to the text to be analyzed and each piece mark The corresponding vectorization of quasi- text is as a result, the calculating text to be analyzed is similar to the vector cosine of each piece received text Degree.

3. method according to claim 1 or 2, which is characterized in that it is described to be calculated by least one vectorization, described in determination The step of text to be analyzed is with the corresponding vector cosine similarity of each piece received text, including following vectorization calculate In any one or they any combination:

Vectorization calculating is carried out to the text to be analyzed and each piece received text by Word2Vec, and according to vector Change as a result, calculating the text to be analyzed and the corresponding cosine similarity under the Word2Vec of each piece received text； And/or

Vectorization calculating is carried out to the text to be analyzed and each piece received text by TF-IDF, and according to vectorization As a result, calculating the text to be analyzed and the corresponding cosine similarity under the TF-IDF of each piece received text；And/or

4. the method as described in claim 1, which is characterized in that described according to the text to be analyzed and each piece standard The Jacobi similarity of text and all vector cosine similarities determine the text to be analyzed and each piece standard text The step of this polymerization similarity, comprising:

By mean value computation or weighted average calculation to the Jacobi phase of the text to be analyzed and each piece received text It is polymerize like degree and institute's directed quantity cosine similarity, obtains the polymerization of the text to be analyzed and each piece received text Similarity.

5. the method as described in claim 1, which is characterized in that described according to the text to be analyzed and each piece standard The polymerization similarity of text determines the step of whether text to be analyzed is used as training corpus, comprising:

The polymerization similarity and first threshold for comparing the text to be analyzed and each piece received text, if there is any one A polymerization similarity is less than the first threshold, then using the text to be analyzed as training corpus；And/or

Compare the average value and second threshold of the polymerization similarity of the text to be analyzed and each piece received text, if The average value is less than the second threshold, then using the text to be analyzed as training corpus.

6. the method as described in claim 1, which is characterized in that described to obtain text to be analyzed and at least received text Step, comprising:

Using web crawlers, text to be analyzed is crawled from network.

7. the method as described in claim 1, which is characterized in that described to obtain text to be analyzed and at least received text Step, comprising:

8. a kind of training corpus determining device characterized by comprising

Vector cosine similarity module determines the text to be analyzed and each piece for calculating by least one vectorization The corresponding vector cosine similarity of the received text；

Polymerize similarity module, for according to the Jacobi similarity of the text to be analyzed and each piece received text and All vector cosine similarities determine the polymerization similarity of the text to be analyzed and each piece received text；

9. a kind of training corpus determines equipment characterized by comprising

Memory, for storing computer executable instructions；And

Processor, it is as claimed in any of claims 1 to 7 in one of claims for being realized when executing the computer executable instructions Step in method.

10. a kind of computer readable storage medium, computer executable instructions are stored in the computer readable storage medium, It is characterized in that, being realized as described in any one of claim 1 to 7 when the computer executable instructions are executed by processor Method in step.