CN107229605A - The computational methods and device of text similarity - Google Patents

The computational methods and device of text similarity Download PDF

Info

Publication number
CN107229605A
CN107229605A CN201710223484.XA CN201710223484A CN107229605A CN 107229605 A CN107229605 A CN 107229605A CN 201710223484 A CN201710223484 A CN 201710223484A CN 107229605 A CN107229605 A CN 107229605A
Authority
CN
China
Prior art keywords
text
participle
sample storehouse
black sample
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710223484.XA
Other languages
Chinese (zh)
Other versions
CN107229605B (en
Inventor
郑丹丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201710223484.XA priority Critical patent/CN107229605B/en
Priority to CN202010419437.4A priority patent/CN111611786B/en
Publication of CN107229605A publication Critical patent/CN107229605A/en
Application granted granted Critical
Publication of CN107229605B publication Critical patent/CN107229605B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a kind of computational methods of text similarity, including:By based on identical drop policy, the text participle that word segmentation processing is obtained is carried out to the samples of text in original black Sample Storehouse and the samples of text of new typing, text filtering ratio according to multiple holding gradients carries out text participle filtration treatment respectively, and the samples of text and the samples of text of new typing in original black Sample Storehouse are reconstructed respectively using remaining text participle after filtering, then the samples of text of new typing and the similarity of black sample are characterized using the gating rate of text participle, by matching the black Sample Storehouse after reconstruct and the text participle in the samples of text of new typing, the black Sample Similarity of text participle setting that participle is obtained is carried out for the samples of text of new typing.The application can be obviously improved the computational efficiency when calculating the similarity of samples of text and the samples of text in black Sample Storehouse of new typing.

Description

The computational methods and device of text similarity
Technical field
The application is related to computer application field, more particularly to a kind of text similarity computational methods and device.
Background technology
Social networking application, the problem of generally all facing content auditing.And a social product, it generally might have several ten million Even several hundred million customer volume, has huge information content in interaction all the time daily.Therefore how based on auditing out not Good historical content, quickly finishes prevention and control on the actual time line of various harmful contents, and tool is of great significance.
In the related art, real-time line is being carried out for various harmful contents based on the bad historical content audited out During upper prevention and control, it is normally based on text similarity to realize;Such as, it can be calculated based on editing distance or COS distance etc. Method, the samples of text for calculating social networking application generation is similar to the text for auditing out each black sample comprising harmful content Degree, then completes prevention and control on the actual time line of harmful content by the text similarity calculated.
However, by such as editing distance or COS distance scheduling algorithm, calculate samples of text that social text produces with During the similarity of each black sample, 1 is generally all suffered from:N poll;Therefore, when the quantity of black sample is more, poll owns Black sample carry out the calculating of similarity successively, from response speed, it is impossible to meet the requirement of prevention and control on real-time line.
The content of the invention
The application proposes a kind of computational methods of text similarity, applied to computer equipment, the computer equipment bag Include multiple black Sample Storehouses;The multiple black Sample Storehouse is based on default filtering policy, for the part in original black Sample Storehouse After samples of text is filtered, created and obtained based on remaining samples of text;Wherein, the multiple black Sample Storehouse is corresponded to not respectively Same text filtering ratio;Methods described includes:
Samples of text for new typing carries out word segmentation processing, obtains some text participles;
The multiple black Sample Storehouse is chosen to be target sample storehouse successively, and based on the default filtering policy, according to institute The corresponding text filtering ratio in target sample storehouse is stated, is filtered for the part text participle in some text participles;
Remaining text participle in some text participles is chosen to be target text participle successively, and by the target Text participle is matched successively with the text participle in the target sample storehouse;
If the target text participle with when any text participle is matched in the target sample storehouse, based on the mesh The corresponding text filtering ratio of Sample Storehouse is marked, is that the target text participle sets black Sample Similarity.
The application also proposes a kind of computing device of text similarity, applied to computer equipment, the computer equipment Including multiple black Sample Storehouses;The multiple black Sample Storehouse is based on default filtering policy, for the portion in original black Sample Storehouse After point samples of text is filtered, created and obtained based on remaining samples of text;Wherein, the multiple black Sample Storehouse is corresponded to respectively Different text filtering ratios;Described device includes:
Word-dividing mode, the samples of text for new typing carries out word segmentation processing, obtains some text participles;
Filtering module, target sample storehouse is chosen to be by the multiple black Sample Storehouse successively, and based on the default filtering plan Slightly, according to the corresponding text filtering ratio in the target sample storehouse, for the part text participle in some text participles Filtered;
Matching module, target text participle is chosen to be by remaining text participle in some text participles successively, and The target text participle is matched successively with the text participle in the target sample storehouse;
Setup module, if the target text participle with when any text participle is matched in the target sample storehouse, base It is that the target text participle sets black Sample Similarity in text filtering ratio corresponding with the target sample storehouse.
In the application, by based on identical drop policy, to the samples of text in original black Sample Storehouse and new record The samples of text entered carries out the text participle that word segmentation processing is obtained, and is carried out respectively according to the text filtering ratio of multiple holding gradients Text participle filtration treatment, and using filtering after remaining text participle respectively to the samples of text in original black Sample Storehouse with And the samples of text of new typing is reconstructed, then characterized using the gating rate of text participle the samples of text of new typing with The similarity of black sample, is new record by matching the black Sample Storehouse after reconstruct and the text participle in the samples of text of new typing The samples of text entered carries out the black Sample Similarity of text participle setting that participle is obtained, and can be obviously improved and calculate new typing The computational efficiency during similarity of the samples of text in samples of text and black Sample Storehouse, thus based on black sample to new typing When samples of text carries out prevention and control on real-time line, the content auditing of the samples of text for new typing can be quickly finished, is carried The response speed of high system.
Brief description of the drawings
Fig. 1 is a kind of flow chart of the computational methods for text similarity that the implementation of the application one is exemplified;
Fig. 2 is a kind of whole design and framework figure for text similarity measurement algorithm that the implementation of the application one is exemplified;
Fig. 3 is the place that a kind of social text in original black Sample Storehouse that the implementation of the application one is exemplified is reconstructed Manage flow chart;
Fig. 4 is the processing stream that a kind of social text to new typing that the implementation of the application one is exemplified performs similarity marking Cheng Tu;
Fig. 5 is a kind of logic diagram of the computing device for text similarity that the implementation of the application one is exemplified;
Fig. 6 is involved by the computer equipment of the computing device for the carrying text similarity that the implementation of the application one is exemplified And hardware structure diagram.
Embodiment
In the related art, based on the black sample for including harmful content audited out, for being produced in social networking application Social text carry out content auditing, when completing prevention and control on real-time line, can generally be accomplished by the following way:
In a kind of implementation shown, at the beginning of social networking application is reached the standard grade, special air control personnel can be set, by wind Control personnel are by manually browsing through the social text that social networking application is produced, by what is manually issued to judge user by social networking application Social activity text such as message or service content etc., if there is the harmful content violated the rules.When the number of users of social networking application Constantly increase, during by being manually not enough to support quick examination & verification, air control personnel can empirically configure a large amount of keywords rules, And then audit platform can based on configuration these keywords rule come automatically check social networking application generation social text in With the presence or absence of bad keyword.
However, keyword rule is often auditor audits what experience was extract according to history, it can not cover complete Portion's history msu message, and it is more mechanical by the form progress content auditing of keyword, generally all it is direct matching, exists The situation of a large amount of erroneous judgements.
In another implementation shown, the social text produced in social networking application can be directed to, and audited The black sample comprising harmful content gone out enters the accurate content matching of every trade, and then completes the social text to social networking application generation Content auditing.
However, by way of accurately matching, although response speed when can meet the prevention and control on real-time line is carried out It is required that, but the expression-form of the content of text of social networking application generation is rich and varied, thus accurate content matching is used, recall Rate is too low;Moreover, examination & verification platform needs to expend a large amount of process resources and does accurate inquiry, carry out content auditing it is effective very Difference, it is impossible to meet and require in real time.
, can be based on the similarity algorithms such as editing distance or COS distance, meter in the third implementation shown The text similarity of the social networking application social text produced and each black sample comprising harmful content audited out is calculated, will Social text and the black sample that social networking application is produced carry out fuzzy matching, then by the text similarity that calculates to completing Prevention and control on the actual time line of harmful content.
However, by way of fuzzy matching, based on the similarity algorithm such as editing distance or COS distance, meter When calculating samples of text that social text produces and the similarity of each black sample, 1 is generally all suffered from:N poll, it is necessary to according to Secondary calculating social networking application produces the social text of wall scroll, the text similarity with all black samples in black Sample Storehouse, therefore works as black The quantity of sample is more, and all black samples of poll carry out the calculating of similarity successively, from response speed, it is impossible to meet real When line on prevention and control requirement.
It can be seen that, content auditing is carried out in the social text produced for social networking application at present, prevention and control on real-time line are completed When, the degree of accuracy when carrying out content auditing and the response efficiency of system can not be taken into account well;Therefore, how to utilize and examine A large amount of black samples for include harmful content of core platform precipitation, the social text of completion social networking application generation rapidly and efficiently it is interior Hold examination & verification, as urgent problem to be solved in the industry.
In view of this, the application propose a kind of text for characterizing new typing using the text filtering ratio of text participle with The text similarity of black sample, and by the way of accurate matched text participle, to complete the samples of text and black sample of new typing This fuzzy matching, and then draw the algorithm of the text similarity of the two.
In the algorithm, by based on identical drop policy, to the samples of text in original black Sample Storehouse and newly The samples of text of typing carries out the text participle that word segmentation processing is obtained, and enters respectively according to the text filtering ratio of multiple holding gradients Remaining text participle is respectively to the samples of text in original black Sample Storehouse after this participle filtration treatment, and use filtering of composing a piece of writing And the samples of text of new typing is reconstructed, and the samples of text of new typing is then characterized using the gating rate of text participle It is new by matching the black Sample Storehouse after reconstruct and the text participle in the samples of text of new typing with the similarity of black sample The samples of text of typing carries out the black Sample Similarity of text participle setting that participle is obtained, and can be obviously improved and calculate new typing Samples of text and black Sample Storehouse in samples of text similarity when computational efficiency so that based on black sample to new typing Samples of text when carrying out prevention and control on real-time line, the content auditing of the samples of text for new typing can be quickly finished, The response speed of raising system.
The application is described below by specific embodiment and with reference to specific application scenarios.
Fig. 1 is refer to, Fig. 1 is a kind of computational methods for text similarity that the embodiment of the application one is provided, applied to meter Machine equipment is calculated, the computer equipment includes multiple black Sample Storehouses;The multiple black Sample Storehouse is based on default filtering policy, pin After being filtered to the part samples of text in original black Sample Storehouse, created and obtained based on remaining samples of text;Wherein, institute State multiple black Sample Storehouses and correspond to different text filtering ratios respectively;And the corresponding text filtering ratio of the multiple black Sample Storehouse Keep gradient;Methods described performs following steps:
Step 101, the samples of text for new typing carries out word segmentation processing, obtains some text participles;
Step 102, the multiple black Sample Storehouse is chosen to be target sample storehouse successively, and based on the default filtering plan Slightly, according to the selected corresponding text filtering ratio in the target sample storehouse, for the part text in some text participles This participle is filtered;
Step 103, remaining text participle in some text participles is chosen to be target text participle successively, and will The target text participle is matched successively with the text participle in the target sample storehouse;
Step 104, if the target text participle is with when any text participle is matched in the target sample storehouse, being based on Text filtering ratio corresponding with the target sample storehouse, is that the target text participle sets black Sample Similarity.
Above computer equipment, can include being used to carry the text similarity measurement algorithm shown by step 101-104, be based on Some black samples for including harmful content of completion have been audited, any shape of the content auditing to the samples of text of new typing is completed The computer equipment of formula.In actual applications, above-mentioned computing device can be server device or client device; For example, above computer equipment can be specifically a server or one and content auditing in content auditing platform The PC terminals for being used to perform content auditing of platform docking.
Above-mentioned samples of text, can specifically include the social text produced by social networking application;For example, can be logical including user The chat messages of social networking application issue are crossed, the related to user social contact of social networking application generation used in user can also be included Service message, etc..
The samples of text of above-mentioned new typing, then can above computer equipment extract, user using it is social should The new social text of used time typing;And the samples of text in above-mentioned black Sample Storehouse, then can be the history of content auditing platform The a large amount of social texts for including harmful content precipitated in audit logging.Certainly, in actual applications, above-mentioned samples of text also may be used It is other types of to need to carry out content auditing to be beyond social text, text on the line of prevention and control is completed on real-time line, Will be without being particularly limited in the application.
In this application, will propose a kind of text for characterizing new typing using the text filtering ratio of text participle with it is black The text similarity of sample, and by the way of accurate matched text participle, to complete the samples of text and black sample of new typing Fuzzy matching, and then draw the algorithm of the text similarity of the two.
Fig. 2 is referred to, Fig. 2 is the whole design and framework figure of the text similarity algorithm shown in the application.
As shown in Fig. 2 in the algorithm, identical drop policy can be based on, to text sample whole in black Sample Storehouse The samples of text of this and new typing carries out the text participle that word segmentation processing is obtained, according to the gating rates point of multiple holding gradients Not carry out text participle filtration treatment, and using remaining text participle centrifugal pump, respectively to original black Sample Storehouse sheet with And the samples of text of new typing is reconstructed, and the text sample of new typing is then characterized using the text filtering ratio of text participle Originally with the similarity of black sample, and by matching the black Sample Storehouse after reconstruct and the text participle in the samples of text of new typing, To carry out the text participle that participle is obtained for the samples of text of new typing, black Sample Similarity is set;
Due in the similarity algorithm, being matched by simple text participle, it is possible to be rapidly completed text similarity Calculating, be that the obtained text participle of samples of text participle of new typing sets out black Sample Similarity, therefore can significantly carry Rise the computational efficiency when calculating the similarity of samples of text and the samples of text in black Sample Storehouse of new typing, thus based on When black sample carries out prevention and control on real-time line to the samples of text of new typing, the text sample for new typing can be quickly finished This content auditing, improves the response speed of system.
The social text produced below by social networking application of above-mentioned samples of text, and combine for social text progress content Examination & verification, completes to illustrate exemplified by the application scenarios of prevention and control on real-time line.Obviously, it is using above-mentioned samples of text as social networking application Example, it is exemplary only, it is not used to be defined the technical scheme of the application.
In this application, above computer equipment can collect a large amount of general social texts, to create a general sample This storehouse.Social text in the general Sample Storehouse, can cover the society that needs are carried out content of text examination & verification by the computer equipment Hand over and apply produced social text, all that can also cover on the internet that the computer equipment can be collected into are other Social text produced by social networking application;That is above computer equipment, can be by collecting each middle social networking application institute on internet The social text of generation, is then based on the social text that is collected into create above-mentioned general Sample Storehouse.
Wherein, in actual applications, the social text in above-mentioned general Sample Storehouse quantity, it is necessary to keep one it is larger The order of magnitude, so as to ensure the social text in the general Sample Storehouse as far as possible, can cover user on daily line All issuable keywords in social activity;For example, in the example shown, above computer equipment can collect extraction General social text on 20000000000 lines, to create above-mentioned general Sample Storehouse.
After the completion of above-mentioned general Sample Storehouse is created, the social text difference of full dose that can be directed in general Sample Storehouse first Carry out text word segmentation processing;Wherein, the text segmentation methods that use when carrying out text word segmentation processing, in this application without It is particularly limited to, those skilled in the art may be referred to the note in correlation technique when the technical scheme of the application is put into effect Carry.
After the completion of the social text word segmentation processing of full dose in for generic text storehouse, due to now participle obtain it is a large amount of In text participle, some invalid text participles may be included;Such as, punctuation mark, and some such as " ", " " etc. does not have There is the stop words of physical meaning;Therefore, after the completion of participle, above computer equipment can also be obtained further directed to word segmentation processing The a large amount of text participles arrived, carry out filtration treatment, further remove the punctuation mark in these text participles, and combine what is carried Dictionary is disabled, the stop words in these text participles is removed.
Certainly, in actual applications, can also be based on actual in addition to further punctuation mark and stop words filtering Demand be further introduced into the filtering policys of other forms;Carried out for example, a large amount of text participles after word segmentation processing can be directed to Part of speech is analyzed, and the result analyzed according to part of speech, selectively retains the text participle which has physical meaning;Such as, only Retain the related text participle of subject in these text participles, predicate and object.
After the completion of text participle after for word segmentation processing is further filtered, now above computer equipment can be with Further combined with default statistical analysis algorithms, quantify each text participle after word segmentation processing and correspond to the general Sample Storehouse Significance level, obtains the weighted value that each text participle corresponds to the general Sample Storehouse.
Wherein, it is used statistics side quantifying each text participle corresponding to the significance level of the general Sample Storehouse Method, in this application without being particularly limited to.
In a kind of embodiment shown, above-mentioned weighted value can be specifically IDF (inverse document Frequency, inverse text frequency) value;Above computer equipment can characterize each text participle corresponding to general based on IDF values The significance level of Sample Storehouse.
Wherein, when the target word in calculating some corpus corresponds to the IDF values of the corpus, it can generally use General act number in the corpus, divided by the file comprising the target word number, then obtained business taken the logarithm obtained. And above computer equipment can count general successively when each text participle of calculating corresponds to the significance level of general Sample Storehouse The quantity of social text comprising each text participle in Sample Storehouse, then using the total quantity of social text in general Sample Storehouse, Respectively divided by the quantity that counts, then by the obtained business calculating that take the logarithm each text participle is obtained relative to general sample The IDF values in storehouse.
Certainly, in actual applications, except characterizing important journey of the text participle relative to general Sample Storehouse by IDF values Beyond degree, the statistical method of other forms can also be used to quantify important journey of each text participle relative to general Sample Storehouse Degree;
For example, in actual applications, can also be using statistical methods such as chi, information moisture in the soils, to quantify each text point Word is no longer described in detail in this application relative to the significance level of general Sample Storehouse, and those skilled in the art are by the application Technical scheme when putting into practice, may be referred to the record in correlation technique.
In this example, above computer equipment can be with pre-configured one original black Sample Storehouse, and the black Sample Storehouse is used to deposit The substantial amounts of social text (i.e. black sample) comprising harmful content audited out precipitated in storage content auditing platform.When above-mentioned Computer equipment quantifies each text participle, relative to the significance level of general Sample Storehouse, obtains after respective weights value, subsequently may be used With using weighted value of each text participle relative to general Sample Storehouse for quantifying as foundation, and according to the multiple of pre-configured completion The text filtering ratio of gradient is kept, text filtering processing is carried out for the black sample in part in original black Sample Storehouse, then Original black Sample Storehouse is reconstructed respectively based on remaining black sample, the black Sample Storehouse after multiple reconstruct is obtained.
Fig. 3 is referred to, Fig. 3 is that a kind of social text in original black Sample Storehouse shown in the application is reconstructed Process chart.
In an initial condition, would generally be precipitated in content auditing platform it is substantial amounts of audit out comprising harmful content Social text, in order to make full use of the social text that these have audited completion, above computer equipment can put down content auditing These social texts that platform precipitates are as black sample, to create original black Sample Storehouse, then for the original black sample The social text of full dose in this storehouse is reconstructed.
As shown in figure 3, when the social text of full dose in for black Sample Storehouse is reconstructed, black sample can be directed to first The social text of full dose in storehouse carries out text word segmentation processing respectively;, wherein it is desired to explanation, to the social activity text in black Sample Storehouse The text participle that this progress word segmentation processing is obtained, generally can be to carry out the text that word segmentation processing is obtained for above-mentioned general Sample Storehouse The subset of this participle.
After the completion of word segmentation processing, above computer equipment can also further filter the punctuation mark in text participle with And stop words, or other filtering policys progress text participle filterings are further introduced into, concrete implementation process is repeated no more.
Continuing with referring to Fig. 3, text participle is obtained after for black Sample Storehouse progress word segmentation processing and completes further literary After the filtering of this participle, now above computer equipment can be based on default filtering policy, according to multiple holdings of pre-configured completion Part text point in the text filtering ratio of gradient, the text participle obtained for above-mentioned original black Sample Storehouse word segmentation processing Word carries out text filtering processing respectively, and is based respectively on the centrifugal pump of remaining text participle to complete the weight of above-mentioned black Sample Storehouse Structure.Wherein, in this case, the black Sample Storehouse that reconstruct is completed, different text filtering ratios will be corresponded to respectively.
In a kind of embodiment shown, because each text participle in general Sample Storehouse has quantified phase in advance For the significance level of general sample, and the weighted value for the significance level that can characterize each text participle is calculated;Moreover, right For the text participle that word segmentation processing is obtained is carried out for the social text in original black Sample Storehouse, typically for above-mentioned Social text in general Sample Storehouse carries out the subset for the text participle that word segmentation processing is obtained;Therefore, for original black sample For each social text in storehouse, there is a weighted value relative to general Sample Storehouse.
In this case, when setting above-mentioned default drop policy, specifically it may be referred in original black Sample Storehouse The corresponding weighted value of each text participle is selectively filtered, to complete the reconstruct for being directed to original black Sample Storehouse.
In a kind of embodiment shown, above-mentioned default filtering policy can specifically include appointing in following drop policy One:
Only abandon weighted value highest text participle;
Only abandon the minimum text participle of weighted value;
Weighted value highest and minimum text participle are abandoned simultaneously.
In this application, due to being the text filtering ratio using text participle, to characterize the text and black sample of new typing This text similarity, therefore the ratio shared by the text participle finally discarded, will influence final text to a certain degree The result of similarity.
For the text participle that weighted value is minimum, its significance level is minimum, the low text participle pair of this part significance level The influence of final similarity result is minimum, if preferentially filtering out the minimum text participle of significance level, contributes to lifting most The precision of whole text similarity result;But precision is too high may to cause content auditing platform final similar based on text Degree judges that hit-count when whether the social text of new typing hits the text participle in black Sample Storehouse declines, and content auditing is put down Platform for the recall rate of the social text comprising harmful content it is too low the problem of.Therefore, in this case, if this area skill Art personnel focus more on the degree of accuracy of final calculation result, then can be set to above-mentioned default filtering policy " to abandon weight The minimum text participle of value ".
Similar, due to weighted value highest text participle, significance level highest, this part significance level high text point Influence of the word to final similarity result is maximum, therefore preferentially filters out significance level highest text participle, can cause most The precision of whole text similarity result is relatively low, causes content auditing platform final and is judging new typing based on text similarity Hit-count of social text when whether hitting the text participle in black Sample Storehouse rise, content auditing platform is not for comprising The problem of recall rate of the social text of good content is too high;Therefore, in this case, if those skilled in the art are more closed Content auditing platform is noted for the recall rate in the social text comprising harmful content, then can be by above-mentioned default filtering policy It is set to " only abandoning weighted value highest text participle ".
Certainly, in actual applications, content auditing platform usually requires to take into account the degree of accuracy of text similarity result, and For the recall rate of the social text comprising harmful content;Therefore, in this case, those skilled in the art can will be above-mentioned Default filtering policy is set to " while abandoning weighted value highest and minimum text participle ";For example, as shown in figure 3, showing in Fig. 3 The filtering policy gone out is " while abandoning weighted value highest and minimum text participle ".
In a kind of embodiment shown, the particular number of the text filtering ratio of above-mentioned multiple holding gradients, and Grad between each gating rate, in this application without being particularly limited to, those skilled in the art can be based on actual Demand is configured, or can also be configured based on engineering experience;For example, in one implementation, above-mentioned multiple guarantors The quantity for holding the default gating rate of gradient is the gating rate of 10%, 20%, 40% and 50% etc. four holding gradient.
Continuing with referring to Fig. 3, it is assumed that it is above-mentioned it is multiple keep gradients text filtering ratio, be 10%, 20%, 40% and 50% grade four keeps the text filtering ratio of 10% growth gradient, and above computer equipment can be by four text filterings Ratio, is chosen to be goal filtering ratio successively, then according to above-mentioned default drop policy, according to the selected goal filtering ratio Example, carries out the part text participle in the text participle that word segmentation processing is obtained for the black Sample Storehouse and carries out text participle discarding, Then the centrifugal pump (such as hash values) of remaining text participle is calculated respectively, and based on remaining text in original black Sample Storehouse The centrifugal pump of this participle, to recreate centrifugal pump Sample Storehouse (centrifugal pump sample corresponding to above-mentioned goal filtering ratio Storehouse is the black Sample Storehouse after reconstruct).
Wherein, in a kind of embodiment shown, above computer equipment is in presetting above-mentioned multiple holding gradients Gating rate, when being chosen to be goal filtering ratio successively, specifically can successively be selected according to the order of gating rate from low to high For goal filtering ratio.
With continued reference to Fig. 3, exemplified by each text participle is characterized using IDF values and corresponds to the significance level of general Sample Storehouse, When realizing, above computer equipment first can according to 10% gating rate, discard and carried out for above-mentioned black Sample Storehouse In the text participle that text participle is obtained, IDF values are higher than the text participle of 95% point of position (i.e. IDF values highest 5%), and low In the text participle of 5% point of position (i.e. IDF values minimum 5%), the centrifugal pump of remaining text participle is then calculated respectively, is based on The centrifugal pump of each remaining text participle calculated, generates the first centrifugal pump Sample Storehouse;
Further, after the first centrifugal pump Sample Storehouse is generated, above computer equipment can continue the mistake according to 20% Filter ratio, discards and is carried out for above-mentioned black Sample Storehouse in the text participle that text participle is obtained, and IDF values are higher than 90% point of position Text participle, and less than the text participle of 10% point of position, the centrifugal pump of remaining text participle is then calculated respectively, based on meter The centrifugal pump of each remaining text participle calculated, generates the second centrifugal pump Sample Storehouse.
By that analogy, above computer equipment can subsequently continue the gating rate according to 40%, discard for above-mentioned Black Sample Storehouse is carried out in the text participle that text participle is obtained, and IDF values are higher than the text participle of 80% point of position, and less than 20% Divide the text participle of position, the centrifugal pump of remaining text participle is then calculated respectively, generate the 3rd centrifugal pump Sample Storehouse.And, The gating rate according to 50% can be continued, discard and carry out the text participle that text participle is obtained for above-mentioned black Sample Storehouse In, IDF values are higher than the text participle of 60% point of position, and less than the text participle of 30% point of position, then calculate respectively remaining The centrifugal pump of text participle, generates the 4th centrifugal pump Sample Storehouse.
As shown in figure 3, completion is reconstructed to above-mentioned black Sample Storehouse according to mode illustrated above in above computer equipment Afterwards, 4 centrifugal pump Sample Storehouses for corresponding to different gating rates respectively will can be reconstructed, now above computer equipment can be with Centrifugal pump record in the centrifugal pump Sample Storehouse for reconstructing completion is loaded into internal memory respectively.Now it is directed to above-mentioned original black sample The restructuring procedure in this storehouse terminates, original black Sample Storehouse according to different text filtering ratios, be reconstructed in order to it is multiple from Dissipate value Sample Storehouse.In the centrifugal pump Sample Storehouse completed due to final reconstruct, only including several based on the text in black Sample Storehouse The centrifugal pump of this participle, therefore above computer equipment needs the data volume loaded to substantially reduce.
Fig. 4 is referred to, Fig. 4 is that a kind of social text to new typing shown in the application performs the processing that similarity is given a mark Flow chart.
As shown in figure 4, above computer equipment is after social text of the user by the new typing of social networking application is extracted, can Based on gating rate corresponding with multiple centrifugal pump Sample Storehouses of reconstructed completion, to be entered successively using identical drop policy Compose a piece of writing this reconstruct.
First, above computer equipment can carry out text word segmentation processing for the social text for the new typing extracted, Obtain some text participles, and after the completion of word segmentation processing, can also further filter punctuation mark in text participle and Stop words, or other filtering policys progress text participle filterings are further introduced into, concrete implementation process is repeated no more.
Text participle, which is obtained, after for the social text of new typing progress text word segmentation processing completes further text After participle filtering, now multiple centrifugal pump Sample Storehouses after above-mentioned reconstruct can be chosen to be target successively by above computer equipment Sample Storehouse;
Wherein, in a kind of embodiment shown, above computer equipment by above-mentioned multiple centrifugal pump Sample Storehouses, according to It is secondary when being chosen to be target sample storehouse, specifically can by above-mentioned multiple centrifugal pump Sample Storehouses according to corresponding gating rate from low to high Order, target sample storehouse is chosen to be successively.
Elected to make behind target sample storehouse, above computer equipment can be based on identical filtering policy, according to selected The corresponding gating rate in target sample storehouse, is carried out for carrying out the part text participle in the text participle that word segmentation processing is obtained Text participle is filtered, and completes the first time reconstruct for the social text of new typing.
After the completion of first time reconstructs, remaining text participle can be chosen to be to target participle, and calculate selected successively The target participle centrifugal pump, then will calculate the obtained centrifugal pump of the target participle and the target sample loaded in internal memory Centrifugal pump in this storehouse is matched successively;If the centrifugal pump of the target participle and any centrifugal pump in the target sample storehouse Timing, then can be based on text filtering ratio corresponding with the target sample, black Sample Similarity is set for the target participle;
Wherein, in a kind of embodiment shown, based on text filtering ratio corresponding with the target sample, for this When target participle sets black Sample Similarity, text filtering ratio corresponding with above-mentioned target sample storehouse can be specifically converted to Target value, and the difference of 1 and the target value is calculated, then by the black Sample Similarity of the target participle, it is set greater than Equal to the difference;Such as, then can be by the target participle and above-mentioned black Sample Storehouse when the goal filtering ratio is 10% The similarity of black sample is set greater than being equal to 0.9.
Certainly, if the centrifugal pump of the target participle is mismatched with the centrifugal pump in above-mentioned target sample storehouse, now may be used So that next text participle is chosen to be into target participle, above procedure is re-executed, by that analogy, until all text participles Centrifugal pump complete to match with the centrifugal pump in above-mentioned target sample storehouse, now for the first time reconstruct after centrifugal pump matched Into.
After the centrifugal pump matching after completing to reconstruct for the first time, the text after the social text word segmentation processing of this stylish typing In participle, the text participle for being not provided with out similarity may possibly still be present.Therefore, in such a case, it is possible to which continue will be next Individual centrifugal pump Sample Storehouse is chosen to be target sample storehouse, according to the corresponding text filtering ratio in the target sample storehouse, shows more than The mode gone out carries out second to the social text of the new typing and reconstructed, and re-executes and illustrated above matched by centrifugal pump For each text participle score process, by that analogy, until by the social text of new typing according to above-mentioned multiple centrifugal pump samples The corresponding text filtering ratio in storehouse, is respectively completed reconstruct, and stop when completing after corresponding centrifugal pump matching process.
, wherein it is desired to explanation, for it is upper once reconstruct after it is configured go out similarity score text participle, such as Fruit in selected target Sample Storehouse using the ascending order of the gating rate of each centrifugal pump Sample Storehouse as selected order, that Text participle can be no longer participate in the similarity score process reconstruct next time after.
With continued reference to Fig. 4, to characterize the significance level that each text participle corresponds to general Sample Storehouse using IDF values, and It is reconstructed, obtains respectively according to the gating rate of 10%, 20%, 40% and 50% etc. four holding gradient for black Sample Storehouse To exemplified by four centrifugal pump Sample Storehouses;, can be according to the descending order of corresponding gating rate, by above-mentioned four when realizing Individual centrifugal pump Sample Storehouse is chosen to be target sample storehouse successively.
As described in Figure 4, the first centrifugal pump Sample Storehouse that corresponding gating rate is 10% can be chosen to be target first Sample Storehouse, and according to 10% gating rate, the social text filtered out for new typing carries out the text that text participle is obtained In participle, IDF values are higher than the text participle of 95% point of position (i.e. IDF values highest 5%), and less than 5% point position (i.e. IDF values Minimum text participle 5%), and the centrifugal pump of remaining text participle is calculated respectively;Then, by remaining each text participle Centrifugal pump be chosen to be target participle successively, and by the centrifugal pump of the target participle, with the first centrifugal pump Sample Storehouse from Scattered value is matched successively;If the centrifugal pump of the target participle is matched with any centrifugal pump in the first centrifugal pump Sample Storehouse When, then similarity that can be by the target participle relative to the black sample in above-mentioned black Sample Storehouse is set to be not less than 90%.
Certainly, if the centrifugal pump of the target participle is mismatched with the centrifugal pump in above-mentioned first centrifugal pump Sample Storehouse, Now next text participle can be chosen to be target participle, re-execute above procedure, by that analogy, until all texts The centrifugal pump of this participle completes to match with the centrifugal pump in above-mentioned first centrifugal pump Sample Storehouse.
Continuing with referring to Fig. 4, when the social text of new typing carries out all text participles that word segmentation processing is obtained Centrifugal pump completes to match with the centrifugal pump in above-mentioned first centrifugal pump Sample Storehouse, if now similar there are still being not provided with out The text participle of scoring is spent, now the second centrifugal pump Sample Storehouse that corresponding text filtering ratio is 20% mesh can be chosen to be Sample Storehouse is marked, and according to 20% text filtering ratio, the social text filtered out for new typing carries out text participle and obtained Text participle in, IDF values and less than the text participle of 10% point of position, and are counted respectively higher than the text participle of 90% point of position Calculate the centrifugal pump of remaining text participle;Then, the centrifugal pump of remaining each text participle is chosen to be target participle successively, and By the centrifugal pump of the target participle, matched successively with the centrifugal pump in the second centrifugal pump Sample Storehouse;If the target point , then can be by the target participle relative to upper when the centrifugal pump of word is matched with any centrifugal pump in the second centrifugal pump Sample Storehouse The similarity of the black sample in black Sample Storehouse is stated, is set to be not less than 80%.
If the centrifugal pump of the target participle is mismatched with the centrifugal pump in above-mentioned second centrifugal pump Sample Storehouse, now may be used So that next text participle is chosen to be into target participle, above procedure is re-executed, by that analogy, until all text participles Centrifugal pump complete to match with the centrifugal pump in above-mentioned second centrifugal pump Sample Storehouse.
It is similar, when the social text of new typing carry out the centrifugal pumps of all text participles that word segmentation processing is obtained with Centrifugal pump in above-mentioned second centrifugal pump Sample Storehouse completes matching, if now there are still be not provided with out similarity score Text participle, now can be chosen to be target sample storehouse by the 3rd centrifugal pump Sample Storehouse that corresponding gating rate is 40%, press According to 40% gating rate, the social text filtered out for new typing is carried out in the text participle that text participle is obtained, IDF values Higher than the text participle of 80% point of position, and less than the text participle of 20% point of position, iteration performs similarity illustrated above and commented Divide process;
Further, when the social text of new typing carries out the centrifugal pump for all text participles that word segmentation processing is obtained Complete to match with the centrifugal pump in above-mentioned 3rd centrifugal pump Sample Storehouse, if now there are still be not provided with out similarity score Text participle, now corresponding text filtering ratio for 50% the 4th centrifugal pump Sample Storehouse can be chosen to be target sample Storehouse, according to 50% text filtering ratio, the social text filtered out for new typing carries out the text point that text participle is obtained In word, IDF values are higher than the text participle of 60% point of position, and illustrated above less than the text participle iteration execution of 30% point of position Similarity score process, specific implementation procedure is repeated no more.
Certainly, in actual applications, when new typing social text carry out word segmentation processing after text participle, respectively according to The corresponding gating rate of above-mentioned multiple centrifugal pump Sample Storehouses, the text participle that part is filtered out respectively completes reconstruct, and owns The centrifugal pump of text participle is completed after matching with all centrifugal pumps in corresponding centrifugal pump Sample Storehouse, if now this is new Any text participle in the samples of text of typing, is chosen to be the target participle, and the centrifugal pump of text participle with , then can be by the black Sample Similarity of text participle, i.e., when centrifugal pump in above-mentioned multiple centrifugal pump Sample Storehouses is mismatched 0 is set with the similarity of the samples of text in above-mentioned black Sample Storehouse.
It can be seen that, by using the text filtering ratio of text participle, to characterize the social text and black sample of new typing Text similarity, and using centrifugal pump matching by the way of, in the social text of new typing each text participle set with it is black The similarity score of sample, it is possible to achieve in the way of accurately matching, completes the samples of text of new typing and obscuring for black sample Matching, with traditional based on the similarity algorithm such as editing distance or COS distance, come calculate the social text of new typing with it is black The mode of the fuzzy matching of sample is compared, and can be obviously improved computational efficiency.
In this example, when by the above-mentioned similarity score flow shown in Fig. 4, the above-mentioned social text for new typing is completed After the similarity score for each text participle that this progress word segmentation processing is obtained, above computer equipment can be based on the similarity Appraisal result, content auditing is carried out to the social text of the new typing.
Specifically, above computer equipment can be with one similarity threshold of preset value, then by the social activity text of the new typing Similarity score of each text participle is compared with the similarity threshold in this;If any in the social text of the new typing The similarity of text participle reaches the similarity threshold, text participle can be now defined as to sensitive keys word, and take Corresponding security measure (such as being shielded to text) is using the social text of above-mentioned new typing as comprising harmful content Black sample carry out real-time security.
Certainly, if the similarity score of the text participle in the social text of the new typing, it is below the similarity threshold Value, the social text of the now new typing is normal social text, can be without any processing.
In addition, it is necessary to explanation, when based on similarity score using the social text of new typing be used as black sample carry out phase After the security processing answered, it can update using the social text of the new typing as black sample and arrive above-mentioned original black sample In storehouse.In this way, can the result based on content auditing, constantly the black Sample Storehouse in original black Sample Storehouse is entered Data sample in row incremental update, and then the original black Sample Storehouse that can enrich constantly.
Corresponding with above method embodiment, present invention also provides the embodiment of device.
Fig. 5 is referred to, the application proposes a kind of computing device 50 of text similarity, and the computer equipment includes multiple Black Sample Storehouse;The multiple black Sample Storehouse is based on default filtering policy, for the part text sample in original black Sample Storehouse After this progress is filtered, created and obtained based on remaining samples of text;Wherein, the multiple black Sample Storehouse corresponds to different texts respectively This gating rate;Wherein, Fig. 6 is referred to, as involved by the computer equipment for the computing device 50 for carrying the text similarity And hardware structure in, generally include CPU, internal memory, nonvolatile memory, network interface and internal bus etc.;With software Exemplified by realization, the computing device 50 of the text similarity is generally understood that the computer program being carried in internal memory, leads to The logic device that the software and hardware formed after CPU operations is combined is crossed, described device 50 includes:
Word-dividing mode 501, the samples of text for new typing carries out word segmentation processing, obtains some text participles;
Filtering module 502, target sample storehouse is chosen to be by the multiple black Sample Storehouse successively, and based on the default filtering Strategy, according to the corresponding text filtering ratio in the target sample storehouse, for the part text in some text participles point Word is filtered;
Matching module 503, target text participle is chosen to be by remaining text participle in some text participles successively, And matched the target text participle successively with the text participle in the target sample storehouse;
Setup module 504, if the target text participle with when any text participle is matched in the target sample storehouse, It is that the target text participle sets black Sample Similarity based on text filtering ratio corresponding with the target sample storehouse.
In this example, the word-dividing mode 501 is further:
Word segmentation processing is carried out successively for the samples of text in the black Sample Storehouse;
The filtering module 502 is further:
By the text filtering ratio of default multiple holding gradients, goal filtering ratio is chosen to be successively;Based on described pre- If drop policy, according to the goal filtering ratio, carried out for the black Sample Storehouse in the text participle that word segmentation processing is obtained Part text participle filtered;
Described device 50 also includes:
Creation module 505 (not shown in Fig. 5), calculates the centrifugal pump of remaining text participle in the black Sample Storehouse, and Based on the centrifugal pump of the remaining text participle calculated, the black Sample Storehouse corresponding to the goal filtering ratio is created.
In this example, the corresponding text filtering ratio of the multiple black Sample Storehouse keeps gradient;The filtering module 502 enters One step:
Order by the multiple black Sample Storehouse according to corresponding text filtering ratio from low to high, is chosen to be target successively Sample Storehouse.
In this example, the default filtering policy includes any in following strategy:
Only abandon weighted value highest text participle;
Only abandon the minimum text participle of weighted value;
Weighted value highest and minimum text participle are abandoned simultaneously.
In this example, the weighted value is the IDF values that the text participle corresponds to general Sample Storehouse.
In this example, the setup module 504:
Text filtering ratio corresponding with the target sample storehouse is converted into target value;
Calculate the difference of 1 and the target value;
By the black Sample Similarity of the target text participle, it is set greater than being equal to the difference.
In this example, the setup module 504 is further:
When any text participle in the samples of text of the new typing, with the text participle in the multiple black Sample Storehouse When mismatching, the black Sample Similarity of text participle is set 0.
In this example, described device 50 also includes:
Protection module 506 (not shown in Fig. 5), when the black sample of any text participle in the samples of text of the new typing When this similarity reaches predetermined threshold value, the samples of text of the new typing is carried out in real time as the black sample comprising harmful content Security.
In this example, the samples of text is social text;Samples of text in the black Sample Storehouse is comprising bad interior The social text of appearance.
For device embodiment, because it corresponds essentially to embodiment of the method, so related part is real referring to method Apply the part explanation of example.Device embodiment described above is only schematical, wherein described be used as separating component The unit of explanation can be or may not be physically separate, and the part shown as unit can be or can also It is not physical location, you can with positioned at a place, or can also be distributed on multiple NEs.Can be according to reality Selection some or all of module therein is needed to realize the purpose of application scheme.Those of ordinary skill in the art are not paying In the case of going out creative work, you can to understand and implement.
System, device, module or unit that above-described embodiment is illustrated, can specifically be realized by computer chip or entity, Or realized by the product with certain function.A kind of typically to realize that equipment is computer, the concrete form of computer can To be personal computer, laptop computer, cell phone, camera phone, smart phone, personal digital assistant, media play In device, navigation equipment, E-mail receiver/send equipment, game console, tablet PC, wearable device or these equipment The combination of any several equipment.
Those skilled in the art will readily occur to its of the application after considering specification and putting into practice invention disclosed herein Its embodiment.The application is intended to any modification, purposes or the adaptations of the application, these modifications, purposes or Person's adaptations follow the general principle of the application and including the undocumented common knowledge in the art of the application Or conventional techniques.Description and embodiments are considered only as exemplary, and the true scope of the application and spirit are by following Claim is pointed out.
It should be appreciated that the precision architecture that the application is not limited to be described above and is shown in the drawings, and And various modifications and changes can be being carried out without departing from the scope.Scope of the present application is only limited by appended claim.
The preferred embodiment of the application is the foregoing is only, not to limit the application, all essences in the application God is with principle, and any modification, equivalent substitution and improvements done etc. should be included within the scope of the application protection.

Claims (18)

1. a kind of computational methods of text similarity, it is characterised in that applied to machine equipment is calculated, the computer equipment includes many Individual black Sample Storehouse;The multiple black Sample Storehouse is based on default filtering policy, for the part text in original black Sample Storehouse After sample is filtered, created and obtained based on remaining samples of text;Wherein, the multiple black Sample Storehouse corresponds to different respectively Text filtering ratio;Methods described includes:
Samples of text for new typing carries out word segmentation processing, obtains some text participles;
The multiple black Sample Storehouse is chosen to be target sample storehouse successively, and based on the default filtering policy, according to the mesh The corresponding text filtering ratio of Sample Storehouse is marked, is filtered for the part text participle in some text participles;
Remaining text participle in some text participles is chosen to be target text participle successively, and by the target text Participle is matched successively with the text participle in the target sample storehouse;
If the target text participle with when any text participle is matched in the target sample storehouse, based on the target sample The corresponding text filtering ratio in this storehouse, is that the target text participle sets black Sample Similarity.
2. according to the method described in claim 1, it is characterised in that methods described also includes:
Word segmentation processing is carried out successively for the samples of text in the black Sample Storehouse;
By the text filtering ratio of default multiple holding gradients, goal filtering ratio is chosen to be successively;
Based on the default drop policy, according to the goal filtering ratio, carry out word segmentation processing for the black Sample Storehouse and obtain To text participle in part text participle filtered;
The centrifugal pump of remaining text participle in the black Sample Storehouse is calculated, and based on the remaining text participle calculated Centrifugal pump, create corresponding to the goal filtering ratio black Sample Storehouse.
3. according to the method described in claim 1, it is characterised in that the corresponding text filtering ratio of the multiple black Sample Storehouse is protected Hold gradient;
It is described that the multiple black Sample Storehouse is chosen to be target sample storehouse successively, including:
Order by the multiple black Sample Storehouse according to corresponding text filtering ratio from low to high, is chosen to be target sample successively Storehouse.
4. method according to claim 1 or 2, it is characterised in that the default filtering policy is included in following strategy It is any:
Only abandon weighted value highest text participle;
Only abandon the minimum text participle of weighted value;
Weighted value highest and minimum text participle are abandoned simultaneously.
5. method according to claim 4, it is characterised in that the weighted value is that the text participle corresponds to general sample The IDF values in this storehouse.
6. according to the method described in claim 1, it is characterised in that described to be based on text mistake corresponding with the target sample storehouse Filter ratio, is that the target text participle sets black Sample Similarity, including:
Text filtering ratio corresponding with the target sample storehouse is converted into target value;
Calculate the difference of 1 and the target value;
By the black Sample Similarity of the target text participle, it is set greater than being equal to the difference.
7. according to the method described in claim 1, it is characterised in that methods described also includes:
When any text participle in the samples of text of the new typing, with the text participle in the multiple black Sample Storehouse not During matching, the black Sample Similarity of text participle is set 0.
8. according to the method described in claim 1, it is characterised in that methods described also includes:
, will be described when the black Sample Similarity of any text participle in the samples of text of the new typing reaches predetermined threshold value The samples of text of new typing carries out real-time security as the black sample comprising harmful content.
9. according to the method described in claim 1, it is characterised in that the samples of text is social text;The black Sample Storehouse In samples of text be the social text comprising harmful content.
10. a kind of computing device of text similarity, it is characterised in that applied to computer equipment, the computer equipment bag Include multiple black Sample Storehouses;The multiple black Sample Storehouse is based on default filtering policy, for the part in original black Sample Storehouse After samples of text is filtered, created and obtained based on remaining samples of text;Wherein, the multiple black Sample Storehouse is corresponded to not respectively Same text filtering ratio;Described device includes:
Word-dividing mode, the samples of text for new typing carries out word segmentation processing, obtains some text participles;
Filtering module, is chosen to be target sample storehouse, and based on the default filtering policy, press successively by the multiple black Sample Storehouse According to the corresponding text filtering ratio in the target sample storehouse, carried out for the part text participle in some text participles Filter;
Matching module, target text participle is chosen to be by remaining text participle in some text participles successively, and by institute Target text participle is stated to be matched successively with the text participle in the target sample storehouse;
Setup module, if the target text participle with when any text participle is matched in the target sample storehouse, based on The corresponding text filtering ratio in the target sample storehouse, is that the target text participle sets black Sample Similarity.
11. device according to claim 10, it is characterised in that the word-dividing mode is further:
Word segmentation processing is carried out successively for the samples of text in the black Sample Storehouse;
The filtering module is further:
By the text filtering ratio of default multiple holding gradients, goal filtering ratio is chosen to be successively;Lost based on described preset Strategy is abandoned, according to the goal filtering ratio, the portion in the text participle that word segmentation processing is obtained is carried out for the black Sample Storehouse This participle of single cent is filtered;
Described device also includes:
Creation module, calculates the centrifugal pump of remaining text participle in the black Sample Storehouse, and based on the residue calculated Text participle centrifugal pump, create corresponding to the goal filtering ratio black Sample Storehouse.
12. device according to claim 10, it is characterised in that the corresponding text filtering ratio of the multiple black Sample Storehouse Keep gradient;
The filtering module is further:
Order by the multiple black Sample Storehouse according to corresponding text filtering ratio from low to high, is chosen to be target sample successively Storehouse.
13. the device according to claim 10 or 11, it is characterised in that the default filtering policy is included in following strategy Any:
Only abandon weighted value highest text participle;
Only abandon the minimum text participle of weighted value;
Weighted value highest and minimum text participle are abandoned simultaneously.
14. device according to claim 13, it is characterised in that the weighted value is the text participle corresponding to general The IDF values of Sample Storehouse.
15. device according to claim 10, it is characterised in that the setup module:
Text filtering ratio corresponding with the target sample storehouse is converted into target value;
Calculate the difference of 1 and the target value;
By the black Sample Similarity of the target text participle, it is set greater than being equal to the difference.
16. device according to claim 10, it is characterised in that the setup module is further:
When any text participle in the samples of text of the new typing, with the text participle in the multiple black Sample Storehouse not During matching, the black Sample Similarity of text participle is set 0.
17. device according to claim 10, it is characterised in that described device also includes:
Protection module, when the black Sample Similarity of any text participle in the samples of text of the new typing reaches predetermined threshold value When, the samples of text of the new typing is subjected to real-time security as the black sample comprising harmful content.
18. device according to claim 10, it is characterised in that the samples of text is social text;The black sample Samples of text in storehouse is the social text comprising harmful content.
CN201710223484.XA 2017-04-07 2017-04-07 Text similarity calculation method and device Active CN107229605B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201710223484.XA CN107229605B (en) 2017-04-07 2017-04-07 Text similarity calculation method and device
CN202010419437.4A CN111611786B (en) 2017-04-07 2017-04-07 Text similarity calculation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710223484.XA CN107229605B (en) 2017-04-07 2017-04-07 Text similarity calculation method and device

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202010419437.4A Division CN111611786B (en) 2017-04-07 2017-04-07 Text similarity calculation method and device

Publications (2)

Publication Number Publication Date
CN107229605A true CN107229605A (en) 2017-10-03
CN107229605B CN107229605B (en) 2020-05-29

Family

ID=59934406

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201710223484.XA Active CN107229605B (en) 2017-04-07 2017-04-07 Text similarity calculation method and device
CN202010419437.4A Active CN111611786B (en) 2017-04-07 2017-04-07 Text similarity calculation method and device

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202010419437.4A Active CN111611786B (en) 2017-04-07 2017-04-07 Text similarity calculation method and device

Country Status (1)

Country Link
CN (2) CN107229605B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334567A (en) * 2018-01-16 2018-07-27 北京奇艺世纪科技有限公司 Rubbish text method of discrimination, device and server
CN108768840A (en) * 2018-06-12 2018-11-06 北京京东金融科技控股有限公司 A kind of method and apparatus of account management
CN108874777A (en) * 2018-06-11 2018-11-23 北京奇艺世纪科技有限公司 A kind of method and device of text anti-spam
CN109977668A (en) * 2017-12-27 2019-07-05 哈尔滨安天科技股份有限公司 The querying method and system of malicious code

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477544A (en) * 2009-01-12 2009-07-08 腾讯科技(深圳)有限公司 Rubbish text recognition method and system
CN103034726A (en) * 2012-12-18 2013-04-10 上海电机学院 Text filtering system and method
CN103257957A (en) * 2012-02-15 2013-08-21 深圳市腾讯计算机***有限公司 Chinese word segmentation based text similarity identifying method and device
CN103441924A (en) * 2013-09-03 2013-12-11 盈世信息科技(北京)有限公司 Method and device for spam filtering based on short text
CN105843950A (en) * 2016-04-12 2016-08-10 乐视控股(北京)有限公司 Sensitive word filtering method and device
US20170004128A1 (en) * 2015-07-01 2017-01-05 Institute for Sustainable Development Device and method for analyzing reputation for objects by data mining

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3844193B2 (en) * 2001-01-24 2006-11-08 Kddi株式会社 Information automatic filtering method, information automatic filtering system, and information automatic filtering program
JP4429356B2 (en) * 2007-12-26 2010-03-10 富士通株式会社 Attribute extraction processing method and apparatus
CN105302779A (en) * 2015-10-23 2016-02-03 北京慧点科技有限公司 Text similarity comparison method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477544A (en) * 2009-01-12 2009-07-08 腾讯科技(深圳)有限公司 Rubbish text recognition method and system
CN103257957A (en) * 2012-02-15 2013-08-21 深圳市腾讯计算机***有限公司 Chinese word segmentation based text similarity identifying method and device
CN103034726A (en) * 2012-12-18 2013-04-10 上海电机学院 Text filtering system and method
CN103441924A (en) * 2013-09-03 2013-12-11 盈世信息科技(北京)有限公司 Method and device for spam filtering based on short text
US20170004128A1 (en) * 2015-07-01 2017-01-05 Institute for Sustainable Development Device and method for analyzing reputation for objects by data mining
CN105843950A (en) * 2016-04-12 2016-08-10 乐视控股(北京)有限公司 Sensitive word filtering method and device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109977668A (en) * 2017-12-27 2019-07-05 哈尔滨安天科技股份有限公司 The querying method and system of malicious code
CN109977668B (en) * 2017-12-27 2021-05-04 哈尔滨安天科技集团股份有限公司 Malicious code query method and system
CN108334567A (en) * 2018-01-16 2018-07-27 北京奇艺世纪科技有限公司 Rubbish text method of discrimination, device and server
CN108874777A (en) * 2018-06-11 2018-11-23 北京奇艺世纪科技有限公司 A kind of method and device of text anti-spam
CN108874777B (en) * 2018-06-11 2023-03-07 北京奇艺世纪科技有限公司 Text anti-spam method and device
CN108768840A (en) * 2018-06-12 2018-11-06 北京京东金融科技控股有限公司 A kind of method and apparatus of account management

Also Published As

Publication number Publication date
CN111611786A (en) 2020-09-01
CN107229605B (en) 2020-05-29
CN111611786B (en) 2023-03-21

Similar Documents

Publication Publication Date Title
Golla et al. On the accuracy of password strength meters
JP6771751B2 (en) Risk assessment method and system
CN107423613B (en) Method and device for determining device fingerprint according to similarity and server
WO2019218699A1 (en) Fraud transaction determining method and apparatus, computer device, and storage medium
CN108229156A (en) URL attack detection methods, device and electronic equipment
CN107229605A (en) The computational methods and device of text similarity
WO2017143932A1 (en) Fraudulent transaction detection method based on sample clustering
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
CN106599155A (en) Method and system for classifying web pages
CN105488023B (en) A kind of text similarity appraisal procedure and device
CN106874253A (en) Recognize the method and device of sensitive information
CN103336766A (en) Short text garbage identification and modeling method and device
CN112801498B (en) Training method of risk identification model, risk identification method, device and equipment
CN103646074B (en) It is a kind of to determine the method and device that picture cluster describes text core word
CN111784040B (en) Optimization method and device for policy simulation analysis and computer equipment
CN109902157A (en) A kind of training sample validation checking method and device
CN106815201A (en) A kind of method and device of automatic judgement judgement document court verdict
CN110134876A (en) A kind of cyberspace Mass disturbance perception and detection method based on gunz sensor
CN106874322A (en) A kind of data table correlation method and device
CN112016317A (en) Sensitive word recognition method and device based on artificial intelligence and computer equipment
Hidayati et al. Development of conceptual framework for cyber fraud investigation
CN108628875A (en) A kind of extracting method of text label, device and server
CN111930885A (en) Method and device for extracting text topics and computer equipment
CN104715000B (en) Apparatus and method for supporting evaluation analysis
Tschuggnall et al. Reduce & attribute: Two-step authorship attribution for large-scale problems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200924

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Patentee after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Patentee before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20200924

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Patentee after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Patentee before: Alibaba Group Holding Ltd.

TR01 Transfer of patent right