CN107145555A

CN107145555A - A kind of fuzzy sentence searching method based on participle

Info

Publication number: CN107145555A
Application number: CN201710296379.9A
Authority: CN
Inventors: 常帅; 邓皓钟
Original assignee: Beijing An Information Technology Co Ltd
Current assignee: Beijing An Information Technology Co Ltd
Priority date: 2017-04-28
Filing date: 2017-04-28
Publication date: 2017-09-08
Anticipated expiration: 2037-04-28
Also published as: CN107145555B

Abstract

The invention discloses a kind of fuzzy sentence searching method based on participle, this method records participle original position by carrying out participle to original text；The word repeated is merged, the word original position repeated is recorded；Participle is carried out to keyword, the number that the number of keyword participle is designated as at least occurring keyword participle once in i, original text is designated as w；Occurrence rate p is calculated, occurrence rate p is more than preset value, scanned for the word segmentation result of keyword, obtain keyword participle position in original text, occurrence rate p is less than preset value, exits search；Calculate keyword participle in original text between position apart from d, compare the difference k between d and corresponding keyword its length whether in the number range allowed, difference k matches the result searched for generally in the number range allowed.The present invention can realize to did obscure, sentence of the word order with changing is retrieved, retrieval result is more accurate, improves recall precision.

Description

A kind of fuzzy sentence searching method based on participle

Technical field

The present invention relates to a kind of fuzzy sentence searching method based on participle, belong to field of information security technology.

Background technology

In today of network increasingly prosperity, advantageous information is with causing the information of destabilizing factor also increasingly to spread unchecked therewith Netizen's is ideologically healthy, also for the harmony of society, under many external public arenas, and some contents are will be by examining It can show.Initial stage is examined in network, is all by manual examination and verification, although this examination ＆ verification mode is accurate and intelligent, with net The speed that network word is produced is compared, and it is insignificant that its efficiency just shows！Although at this stage by keyword automatic fitration, Accuracy is very low, and being easily filled into itself is not the information of harmful content, while can not also ensure related harmful content All filter out.

At this stage, common word search algorithm has two kinds, and a kind of is the word search based on participle, and another is to be based on The word search of multimode matching.Based on participle search, i.e., first to original text carry out participle, then the keyword to be searched for is entered Row matching, but this algorithm can only be scanned for word, and sentence can not be matched；Search based on multimode matching, even if Matched with ACBM, Wu-Manber scheduling algorithm, both may search for word can also search statement, but semanteme can not be analyzed.

Both algorithms have a kind of situation can not match, e.g., by taking keyword " somewhere taxi stops doing business " as an example, but former In text content be " stop doing business somewhere taxi " or " taxi in somewhere stops doing business ", it is so just to search the keyword, Although can be scanned for by way of reducing or changing keyword, the confusion of search result is very big, and situation is too many, The keyword for reducing or changing may be imperfect, and can substantially reduce the speed of search.

It can be seen that the word search based on participle, two kinds of searching algorithm Shortcomings of word search based on multimode matching, no Can to did obscure, sentence of the word order with changing is scanned for.

The content of the invention

It is an object of the invention to provide a kind of fuzzy sentence searching method based on participle, to solve existing method not Can to did obscure, sentence of the word order with changing is scanned for the problem of.

The technical scheme that the present invention solves above-mentioned technical problem is as follows：A kind of fuzzy sentence searching method based on participle, Described searching method comprises the following steps：

Step one：The original text treated in range of search carries out participle, records original position of each participle in original text；

Step 2：Original text is carried out after participle, and the word repeated is merged, and records the word each repeated Original position of the language in original text；

Step 3：Participle is carried out to the keyword for needing to retrieve, the number of keyword participle is designated as in i, scope to be retrieved Original text at least there is keyword participle once number be designated as w；

Step 4：The occurrence rate p of original text of the keyword participle in range of search is calculated, p=w/i, occurrence rate p is more than pre- The numerical value of setting, the word segmentation result to original text is scanned for using the word segmentation result of keyword, is obtained each keyword participle and is existed The position occurred in original text, occurrence rate p is less than presetting numerical value, exits search；

Step 5：Calculate between the position that different keyword participles occur in original text apart from d, compare apart from d and phase The difference k between keyword its length is answered whether in the number range allowed, difference k in the number range allowed, It is fitted on the result searched for generally.

It should be noted that participle can use the existing segmentation methods such as segmenting method based on string matching, be based on The segmenting method of understanding and the segmenting method based on statistics, segmenting method belong to prior art.

In fuzzy sentence searching method based on participle, described original position refers to the initial of first word of word segmentation result Original text character number in character position, scope to be retrieved is since 0.By taking " stop doing business somewhere taxi " as an example, participle is " to stop Industry " " somewhere " " taxi " three results, the position of participle " stopping doing business " is 0.

In fuzzy sentence searching method based on participle, the key that original text, needs in described scope to be retrieved are retrieved Word uses UTF-8 coded formats.UTF-8 coded formats are that a kind of variable length character for Unicode is encoded, also known as ten thousand states Under code, UTF-8 coded formats, 1-6 byte is encoded into according to different Digital sizes, conventional English alphabet is encoded into 1 Individual byte, Chinese character is typically 3 bytes, and only very uncommon character can just be encoded into 4-6 byte.UTF-8 codings can be with Pass through mask bit and shifting function fast reading and writing.Strcmp () is identical with wcscmp () returning result during character string comparison, because This makes sequence become to be more prone to.

In fuzzy sentence searching method based on participle, keyword participle at least occurs in original text in described step three 1 is once designated as, it is keyword participle number once at least occur that keyword participle does not occur being designated as 0, w in original text.Need What is further illustrated is the number of times that w and non-keyword participle occur in original text, but the pass of situation is occurred in original text Key word participle number, by taking " stop doing business somewhere taxi " as an example, participle is " stopping doing business " " somewhere " " taxi " three results, to be checked If " stopping doing business " does not occur once in the range of the original text of rope, and " somewhere " " taxi " two words are occurred, no matter " certain Ground " " taxi " occurs several times in the range of original text to be retrieved, and w now is then 2.

In fuzzy sentence searching method based on participle, occurrence rate p is more than or equal to 0 and is less than or equal to 1 in described step four, Occurrence rate refers to that the number that original text of the keyword participle in scope to be retrieved occurred accounts for the ratio of keyword participle number, Keyword participle occurs repeatedly representing that the keyword participle occurs in original text in original text, and keyword participle is in scope to be retrieved The number record increase by 1 that interior original text occurred；Described presetting numerical value is is less than or equal to 1 value more than or equal to 0, in advance The numerical value of setting represents that occurrence rate of the keyword participle in original text is bigger closer to 1, and p is more than presetting numerical value presence pair The necessity that the word segmentation result of original text is scanned for using the word segmentation result of keyword.By taking " stop doing business somewhere taxi " as an example, point Word is " stopping doing business " " somewhere " " taxi " three results, if " stopping doing business " does not occur once in the range of original text to be retrieved, And " somewhere " " taxi " two words are occurred, w now is then 2, and keyword " stop doing business somewhere taxi " participle number is 3, now, occurrence rate p then be 2/3, due to w value can not possibly exceed participle number 3, therefore occurrence rate p be more than or equal to 0 be less than etc. In 1, if presetting numerical value is 3/4, occurrence rate p is more than 3/4 for 2/3, therefore uses key in the presence of the word segmentation result to original text The necessity that the word segmentation result of word is scanned for.If " stopping doing business " " somewhere " does not go out once in the range of original text to be retrieved It is existing, and only have " taxi " word to occur, now occurrence rate p is then 1/3, and occurrence rate p is 1/3 less than presetting numerical value 3/4, therefore the word segmentation result of original text need not be scanned for using the word segmentation result of keyword.It can be seen that presetting numerical value root Determined according to the accuracy of retrieval, the accuracy of presetting numerical value more overall search is higher.

In fuzzy sentence searching method based on participle, apart from d and corresponding keyword its length in described step five Between difference k exceed the number range allowed, exit the display of the retrieval result；The described number range allowed is according to difference Value k is determined with corresponding keyword its length, is not present in difference k explanation original texts equal with corresponding keyword its length mixed Confuse, word order phenomenon, there is numerical difference to represent that keyword exists in original text mixed in difference k and corresponding keyword its length Confuse, word order phenomenon, difference k and corresponding keyword its length exist numerical difference it is bigger represent to exist in original text obscure, word Language order phenomenon possibility is smaller.Difference k between d and corresponding keyword its length exceeds the number range allowed, Show textual content in scope to be retrieved be not present with the same or similar content of search key, retrieval knot is also just not present The displaying of fruit, the difference k between d and corresponding keyword its length shows model to be retrieved in the number range allowed Textual content in enclosing exist with the same or similar content of search key, retrieval result and doing is obscured, word order Result with changing also is retrieved one by one, and more detailed description will do further explanation in embodiment below.

In fuzzy sentence searching method based on participle, different keyword participles are referred to key in described step five Word carries out the different word segmentation results after participle, between the position that different word segmentation results occur in original text apart from d." to stop doing business Exemplified by somewhere taxi ", participle is " stopping doing business " " somewhere " " taxi " three results, and overmulling may be done in original text to be retrieved Confuse, word order is with changing, it is therefore desirable to calculate among " stopping doing business " " somewhere " " taxi " three going out in original text between any two The distance between existing position d.

In fuzzy sentence searching method based on participle, described participle refers to the fractionation to sentence, and sentence is split as Word, phrase.Fractionation to keyword is not limited to phrase, can be single word or multiple words.

The inventive method has the following advantages that：Participle is carried out by treating the original text in range of search, each participle is recorded Original position in original text；Original text is carried out after participle, and the word repeated is merged, and record each repeats Original position of the word in original text；Participle is carried out to the keyword for needing to retrieve, the number of keyword participle is designated as i, treated The number at least occurring keyword participle once in original text in range of search is designated as w；Keyword participle is calculated in retrieval model The occurrence rate p, p=w/i, occurrence rate p of original text in enclosing are more than presetting numerical value, and the word segmentation result to original text uses keyword Word segmentation result scan for, obtain the position that each keyword participle occurs in original text, occurrence rate p is less than presetting number Value, exits search；Calculate between the position that different keyword participles occur in original text apart from d, compare apart from d and corresponding pass Whether the difference k between keyword its length is in the number range allowed, difference k is matched in the number range allowed The result searched for generally.Can realize to did obscure, sentence of the word order with changing is retrieved, retrieval result is more accurate Really, improve recall precision, be effectively protected the safety of national information, promote the harmony of society with stably.

Brief description of the drawings

Fuzzy sentence searching method schematic flow sheets of the Fig. 1 based on participle；

The algorithm sketch of fuzzy sentence searching methods of the Fig. 2 based on participle.

Embodiment

Following examples are used to illustrate the present invention, but are not limited to the scope of the present invention.

As shown in figure 1, a kind of fuzzy sentence searching method based on participle, searching method comprises the following steps：

S1：The original text treated in range of search carries out participle, records original position of each participle in original text；

S2：Original text is carried out after participle, and the word repeated is merged, and the word that record each repeats exists Original position in original text；

S3：Participle is carried out to the keyword for needing to retrieve, the number of keyword participle is designated as the original in i, scope to be retrieved The number at least occurring keyword participle once in text is designated as w；

S4：The occurrence rate p of original text of the keyword participle in range of search is calculated, p=w/i, occurrence rate p is more than presetting Numerical value, the word segmentation result to original text scanned for using the word segmentation result of keyword, obtains each keyword participle in original text The position of middle appearance, occurrence rate p is less than presetting numerical value, exits search；

S5：Calculate between the position that different keyword participles occur in original text apart from d, compare apart from d and corresponding pass Whether the difference k between keyword its length is in the number range allowed, difference k is matched in the number range allowed The result searched for generally.

Below using original text as " stop doing business somewhere taxi, and I likes sun liter on somewhere scenic outlook, scenic outlook, the taxi in somewhere Car stops doing business ", search key be " somewhere taxi stops doing business " exemplified by more detailed elaboration is done to the inventive method.

The first step：To original text participle, word segmentation result is " [{ " word ":" stopping doing business ", " offset ":0},{"word":" certain Ground ", " offset ":6},{"word":" taxi ", " offset ":12},{"word":" hiring a car ", " offset ":15},{" word":" taxi ", " offset ":12},{"word":", ", " offset ":21},{"word":" I ", " offset ": 24},{"word":" love ", " offset ":27},{"word":" somewhere ", " offset ":30},{"word":" view ", " offset":36},{"word":" scenic outlook ", " offset ":36},{"word":", ", " offset ":45},{"word":" View ", " offset ":48},{"word":" scenic outlook ", " offset ":48},{"word":" upper ", " offset ":57},{" word":" sun ", " offset ":60},{"word":" sun liter ", " offset ":60},{"word":", ", " offset ": 69},{"word":" somewhere ", " offset ":72},{"word":" ", " offset ":78},{"word":" taxi ", " offset":81},{"word":" hiring a car ", " offset ":84},{"word":" taxi ", " offset ":81},{" word":" stopping doing business ", " offset ":90 }] ", wherein word is the result of participle, and offset represents position of the word in original text Put.

Second step：After to original text participle, some words occur repeatedly, and such as somewhere is in offset=6, offset= Multiple positions such as 72 are all occurred, so the result to participle does a processing, dittograph language is merged.Result is such as after merging Under：

{ hire out:[12,81],

Stop doing business:[0,90],

Love:[27],

On:[57],

Somewhere:[6,30,72],

Taxi:[12,81],

Hire a car:[15,84],

I:[24],

,:[21,45,69],

View:[36,48],

Scenic outlook:[36,48],

The sun:[60],

Sun liter:[60],

's:[78]

}

“：" left side be word, the right be offset arrays.

3rd step：Word segmentation processing is also carried out to keyword, word segmentation result is as follows：

[" stopping doing business ", " somewhere ", " taxi "]

Keyword need not record offset.

4th step：Word segmentation result to original text is scanned for using the word segmentation result of keyword, obtains following result：

[

Stop doing business:[0,90],

Somewhere:[6,30,72],

Taxi:[12,81]

]

The position that keyword word segmentation result occurs in original text can be drawn.An occurrence rate can be set herein：

The number of times that occurrence rate=keyword word segmentation result occurs in original text/keyword word segmentation result number, it is seen that occur Rate be it is big be equal to 0 it is small be equal to 1 number, only real occurrence rate is more than the occurrence rate set, just into next step.If will go out Now rate is set to 0.75, and occurrence rate is 3/3=1 in this example, it is clear that 1>0.75, next step can be entered and calculated.

Provided that keyword be " sell cigarette ", word segmentation result is [" sale ", " cigarette "], then [" goes out herein Sell ", " cigarette "] occurrence rate in given original text to be retrieved is 0/2=0, exit find.

5th step：Offset in 4th step result is compared, beeline is calculated.

Due to having used the length of several words in UTF8 codings, the 4th step result to be respectively：

[

Stop doing business:6,

Somewhere:6,

Taxi:9

]。

Comparing to obtain

[

Stop doing business:0

Somewhere:6

Taxi:12

], this three groups of offset beelines are 6, are met.So position is to match keyword at 0.

[

Stop doing business:90

Somewhere:72

Taxi:81

], " taxi " and " stopping doing business " minimum range are 9 to meet in this three groups, still " somewhere " and " taxi " most short distance From being but 9, actual difference should be 6 just right, and we can set one and allow the maximum that two words are differed to evade here This problem.If the value is set to 5, then 9-6=3<5, so meeting.Position is similarly matched to keyword at 72.

So the result matched is " stop doing business somewhere taxi " and " taxi in somewhere stops doing business ".Reach fuzzy matching Effect.So as to efficiently solve traditional algorithm can not retrieve did obscure, word order is with the search problem changed.

With further reference to Fig. 2, the thought of the inventive method can be more easily understood.It is original text participle, keyword first Participle, then search key participial construction whether in original text word segmentation result occur, i.e., by calculate occurrence rate come with preset Numeric ratio compared with, decide whether to enter next step search, into next step after by relatively different keyword participles in original text Whether the difference k between the distance between position of appearance d and corresponding keyword its length comes in the number range allowed Decide whether the result that display is searched.

Although above with general explanation and specific embodiment, the present invention is described in detail, at this On the basis of invention, it can be made some modifications or improvements, this will be apparent to those skilled in the art.Therefore, These modifications or improvements, belong to the scope of protection of present invention without departing from theon the basis of the spirit of the present invention.

Claims

1. a kind of fuzzy sentence searching method based on participle, it is characterised in that：Described searching method comprises the following steps：

Step 2：Original text is carried out after participle, and the word repeated is merged, and the word that record each repeats exists Original position in original text；

Step 3：Participle is carried out to the keyword for needing to retrieve, the number of keyword participle is designated as the original in i, scope to be retrieved The number at least occurring keyword participle once in text is designated as w；

Step 4：The occurrence rate p of original text of the keyword participle in range of search is calculated, p=w/i, occurrence rate p is more than presetting Numerical value, the word segmentation result to original text scanned for using the word segmentation result of keyword, obtains each keyword participle in original text The position of middle appearance, occurrence rate p is less than presetting numerical value, exits search；

Step 5：Calculate between the position that different keyword participles occur in original text apart from d, compare apart from d and corresponding pass Whether the difference k between keyword its length is in the number range allowed, difference k is matched in the number range allowed The result searched for generally.

2. a kind of fuzzy sentence searching method based on participle according to claim 1, it is characterised in that：Described starting Position refers to original text character number in the original character position of first word of word segmentation result, scope to be retrieved since 0.

3. a kind of fuzzy sentence searching method based on participle according to claim 1, it is characterised in that：Described is to be checked The keyword that original text, needs in the range of rope are retrieved uses UTF-8 coded formats.

4. a kind of fuzzy sentence searching method based on participle according to claim 1, it is characterised in that：Described step Keyword participle at least occurs once being designated as 1 in original text in three, and keyword participle does not occur being designated as 0, w for extremely in original text Few keyword participle number occurred once.

5. a kind of fuzzy sentence searching method based on participle according to claim 1, it is characterised in that：Described step Occurrence rate p is more than or equal to 0 and is less than or equal to 1 in four, and occurrence rate refers to that original text of the keyword participle in scope to be retrieved occurs The number crossed accounts for the ratio of keyword participle number, and keyword participle occurs repeatedly representing the keyword participle in original in original text Occur in text, the number record increase by 1 that original text of the keyword participle in scope to be retrieved occurred；Described is presetting Numerical value is is less than or equal to 1 value more than or equal to 0, and presetting numerical value represents keyword participle going out in original text closer to 1 Now rate is bigger, and p is more than presetting numerical value and has what the word segmentation result of original text was scanned for using the word segmentation result of keyword Necessity.

6. a kind of fuzzy sentence searching method based on participle according to claim 1, it is characterised in that：Described step Difference k in five between d and corresponding keyword its length exceeds the number range allowed, exits the aobvious of the retrieval result Show；The described number range allowed determines that difference k and corresponding keyword are certainly according to difference k and corresponding keyword its length Be not present in the equal explanation original text of body length obscure, word order phenomenon, there is numerical value in difference k and corresponding keyword its length Difference represent keyword exist in original text obscure, word order phenomenon, there is numerical difference in difference k and corresponding keyword its length It is bigger expression original text in exist obscure, word order phenomenon possibility it is smaller.

7. a kind of fuzzy sentence searching method based on participle according to claim 1, it is characterised in that：Described step Different keyword participles refer to carrying out keyword the different word segmentation results after participle in five, and different word segmentation results are in original text The distance between position of appearance d.

8. a kind of fuzzy sentence searching method based on participle according to claim 1, it is characterised in that：Described participle Refer to the fractionation to sentence, sentence is split as word, phrase.