CN109190014A - A kind of regular expression generation method, device and electronic equipment - Google Patents
A kind of regular expression generation method, device and electronic equipment Download PDFInfo
- Publication number
- CN109190014A CN109190014A CN201810695221.3A CN201810695221A CN109190014A CN 109190014 A CN109190014 A CN 109190014A CN 201810695221 A CN201810695221 A CN 201810695221A CN 109190014 A CN109190014 A CN 109190014A
- Authority
- CN
- China
- Prior art keywords
- search term
- similarity
- word
- regular expression
- bad
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention provides a kind of methods that regular expression generates, device and electronic equipment, scheme includes: to obtain known bad search term, bipartite graph is clicked based on search, it obtains and retrieves the identical each search term for clicking file with known bad search term, as similarity word, canonical segment is extracted to similarity word, obtain regular expression, as candidate regular expression, use candidate regular expression, each similarity word is matched respectively, based on candidate regular expression to the hit situation for participating in matched each similarity word, select the candidate regular expression of preset quantity as the regular expression for being used for filtered search word.Using scheme provided in an embodiment of the present invention, the candidate regular expression that can be generated according to similarity word realizes the continuous renewal to existing regular expression.
Description
Technical field
The present invention relates to Network Information Retrieval Techniques fields, more particularly to a kind of regular expression generation method, device
And electronic equipment.
Background technique
With enriching constantly for Internet resources, in order to provide a user more information, user is carrying out network information inspection
Suo Shi, search engine actively can recommend some search terms to user, these search terms may be that inquiry of the user in search is built
View, default search word etc., such as: inputting " quotation marks " in search engine, search engine default will appear " effects of quotation marks ", " quotation marks
Usage " etc. relevant recommendation, by this kind of search term be properly termed as recommend search term.Recommending search term that can facilitate, user's is defeated
Enter, provides interested content for user.But along with the universal of network, network user's coverage rate is more and more wider, searches at present
Index still have in the recommendation search term held up it is some be not suitable for recommending user, such as: it is some be related to pornographic, violence search term will
Adverse effect is caused to minor, it is this kind of dysgenic search term to be caused to be properly termed as bad search term to user.It searches
Index, which is held up, to be achieved the purpose that shield flame to recommending search term to be filtered.
But since network renewal speed is fast, some bad search terms will appear some variants after being filtered, these changes
Body can equally retrieve flame, and search engine needs persistently to be filtered for these bad search terms and its variant
Operation.
A kind of common method is regularization method at present, by one group of regular expression of manual maintenance, will use regular expressions
The search term that formula is matched to regards as bad search term, filters the bad search term, to achieve the purpose that shield flame.
The bad case of discovery is depended primarily on for the maintenance of regular expression in the prior art, according to the bad of discovery
Case extracts canonical segment, obtains regular expression, and then realize the update to existing regular expression.But due to not
The time of occurrence of good case, quantity are all not have rule that can seek, and therefore, when safeguarding existing regular expression, be can not achieve
Continuous renewal to existing regular expression.
Summary of the invention
The embodiment of the present invention is designed to provide a kind of regular expression generation method, device and electronic equipment, is used for
The continuous renewal to existing regular expression is realized according to similarity word.Specific technical solution is as follows:
The embodiment of the invention provides a kind of regular expression generation methods, which comprises
Obtain known bad search term;
Bipartite graph is clicked based on search, obtains and retrieves the identical each search for clicking file with the known bad search term
Word, as similarity word, wherein described search, which clicks bipartite graph, indicates that user clicks in the corresponding search result of search term
Click file between connection relationship;
Canonical segment is extracted to the similarity word, regular expression is obtained, as candidate regular expression;
Using the candidate regular expression, each similarity word is matched respectively;
For matched each similarity word is participated in, its other party is used based on matching result and the similarity word
Whether formula is determined as the judgement of bad search term as a result, determining whether hit for the similarity word;
Based on the candidate regular expression to the hit situation for participating in matched each similarity word, selection is default
The candidate regular expression of quantity is as the regular expression for being used for filtered search word.
Further, bipartite graph is clicked based on search, obtains and retrieves identical click text with the known bad search term
Each search term of part, as similarity word, comprising:
For known bad search term described in each, clicks in bipartite graph, obtain and the known bad search in search
Word is connected with identical click file and is not determined as each search term of bad search term, as search term to be screened;
For search term to be screened described in each, the phase of the search term to be screened with the known bad search term is calculated
Like degree;
According to the size of the similarity, similarity is selected to be greater than the search term to be screened of the first preset threshold, as phase
Like search term.
Further, bipartite graph is clicked based on search, obtains and retrieves identical click text with the known bad search term
Each search term of part, as similarity word, comprising:
It is clicked in bipartite graph in search, for known bad search term described in each, judges the known bad search term
Weight size, right to choose is great in search term bad known to the second preset threshold, as specific bad search term;
Bipartite graph is clicked based on described search, for specific bad search term described in each, acquisition has been connected thereto phase
It is not determined as with click file and each search term of bad search term, as search term to be screened;
For search term to be screened described in each, obtains the specific bad search term and the search term to be screened is common
Each click file of connection judges each weight size for clicking the side that file is connected with the search term to be screened, selects side
Weight be greater than third predetermined threshold value search term to be screened, as similarity word.
Further, each similarity word is matched respectively using the candidate regular expression described
Before, further includes:
Bipartite graph is clicked based on described search, calculates the undesirable level of the candidate regular expression, the undesirable level
Indicate the relevance between the candidate regular expression and the known bad search term;
It is described to use the candidate regular expression, each similarity word is matched respectively, comprising:
The candidate regular expression for meeting preset condition using the undesirable level, respectively to each similarity
Word is matched.
Further, bipartite graph is clicked based on described search, calculates the undesirable level of the candidate regular expression, packet
It includes:
For candidate's regular expression described in each, according to generating the corresponding each similarity of candidate's regular expression
Similarity degree between word and the known bad search term calculates candidate's regular expression and the known bad search term
Between relevance;
For i-th of candidate regular expression, undesirable level ZiIt indicates are as follows:
N indicates that the quantity of the corresponding similarity word of candidate's regular expression, m expression and the generation candidate can be generated
The corresponding similarity word of regular expression is connected with the quantity of the identical known bad search term for clicking file, SijIt indicates and the
I similarity word is connected with the similarity degree known to identical j-th for clicking file between bad search term, CjIndicate jth
The weight of a known bad search term;
Wherein, it is connected with known to identical j-th for clicking file between bad search term for i-th of similarity word
Similarity degree SijIt indicates are as follows:
P indicates the quantity for clicking file that the similarity word is connect jointly with the known bad search term, WikIndicating should
Similarity word and k-th of click file are correspondingly connected with the weight on side, WiIndicate all clicks text of similarity word connection
The sum of the weight on the side that part is correspondingly connected with, WjkIt indicates the known bad search term and clicks the side that file is correspondingly connected with for k-th
Weight, WjIndicate the weight on the side that all click files of the known bad search term connection are correspondingly connected with.
Further, for matched each similarity word is participated in, matching result and the similarity word quilt are based on
Whether it is determined as the judgement of bad search term using other modes as a result, determining whether hit for the similarity word, comprising:
If the matching result of the similarity word is to match, and the similarity word is used other modes and is determined as
Bad search term, it is determined that hit for the similarity word, otherwise, it determines being directed to the similarity word miss;
Alternatively, if the matching result of the similarity word is not match that, and the similarity word is used its other party
Formula is determined as non-bad search term, it is determined that hits for the similarity word, otherwise, it determines not ordering for the similarity word
In.
Further, for matched each similarity word is participated in, matching result and the similarity word quilt are based on
Whether it is determined as the judgement of bad search term using other modes as a result, determining whether hit for the similarity word, comprising:
If the matching result of the similarity word is to match, and the similarity word is used other modes and is determined as
Bad search term, it is determined that hit for the similarity word, if the matching result of the similarity word is to match, and be somebody's turn to do
Similarity word is used other modes and is determined as non-bad search term, it is determined that is directed to the similarity word miss.
The embodiment of the invention provides a kind of regular expression generating means, described device includes:
Bad search term obtains module, for obtaining known bad search term;
Similarity word obtains module, for clicking bipartite graph based on search, obtains and examines with the known bad search term
Rope to it is identical click file each search term, as similarity word, wherein described search click bipartite graph indicate search term with
It corresponds to the connection relationship clicked between file that user in search result clicks;
Regular expression generation module obtains regular expression, makees for extracting canonical segment to the similarity word
For candidate regular expression;
Matching module, for being matched to each similarity word respectively using the candidate regular expression;
Hit situation determining module, for based on matching result and being somebody's turn to do for matched each similarity word is participated in
Similarity word is used whether other modes are determined as the judgement of bad search term as a result, determine is for the similarity word
No hit;
Regular expression selected module, for based on the candidate regular expression to participating in matched each described similar search
The hit situation of rope word selects the candidate regular expression of preset quantity as the regular expression for being used for filtered search word.
Further, the similarity word obtains module, comprising:
Search term acquisition submodule to be screened, for clicking two in search for each known bad search term
In component, each search for being connected with identical click file with the known bad search term and not being determined as bad search term is obtained
Word, as search term to be screened;
Similarity calculation submodule, for be directed to each described search term to be screened, calculate the search term to be screened with
The similarity of the known bad search term;
Similarity selected ci poem takes submodule, for the size according to the similarity, selects similarity to be greater than first default
The search term to be screened of threshold value, as similarity word.
Further, the similarity word obtains module, comprising:
Specific bad search term acquisition submodule, for being clicked in bipartite graph in search, for described in each it is known not
Good search term judges that the size of the weight of the known bad search term, right to choose are great in bad known to the second preset threshold
Search term, as specific bad search term;
Search term acquisition submodule to be screened, for clicking bipartite graph based on described search, for specific described in each
Bad search term obtains each search term for being connected thereto identical click file and not being determined as bad search term, as wait sieve
Select search term;
Similarity selected ci poem takes submodule, for being directed to each described search term to be screened, obtains described specific bad
Each click file that search term is connect jointly with the search term to be screened judges each click file and the search term to be screened
The weight size on the side of connection selects the weight on side to be greater than the search term to be screened of third predetermined threshold value, as similarity word.
Further, above-mentioned apparatus, further includes:
Undesirable level computing module calculates the candidate regular expression for clicking bipartite graph based on described search
Undesirable level, the undesirable level indicate the relevance between the candidate regular expression and the known bad search term;
The matching module, specifically for meeting the candidate regular expressions of preset condition using the undesirable level
Formula respectively matches each similarity word.
Further, the undesirable level computing module is specifically used for for each candidate regular expression, root
According to the similarity degree generated between the corresponding each similarity word of candidate's regular expression and the known bad search term, meter
Calculate the relevance between candidate's regular expression and the known bad search term;
For i-th of candidate regular expression, undesirable level ZiIt indicates are as follows:
N indicates that the quantity of the corresponding similarity word of candidate's regular expression, m expression and the generation candidate can be generated
The corresponding similarity word of regular expression is connected with the quantity of the identical known bad search term for clicking file, SijIt indicates and the
I similarity word is connected with the similarity degree known to identical j-th for clicking file between bad search term, CjIndicate jth
The weight of a known bad search term;
Wherein, it is connected with known to identical j-th for clicking file between bad search term for i-th of similarity word
Similarity degree SijIt indicates are as follows:
P indicates the quantity for clicking file that the similarity word is connect jointly with the known bad search term, WikIndicating should
Similarity word and k-th of click file are correspondingly connected with the weight on side, WiIndicate all clicks text of similarity word connection
The sum of the weight on the side that part is correspondingly connected with, WjkIt indicates the known bad search term and clicks the side that file is correspondingly connected with for k-th
Weight, WjIndicate the weight on the side that all click files of the known bad search term connection are correspondingly connected with.
Further, the hit situation determining module, comprising:
First hit situation determines submodule, if the matching result for the similarity word is to match, and the phase
Other modes are used like search term and are determined as bad search term, it is determined that are hit for the similarity word, otherwise, it determines needle
To the similarity word miss;
Second hit situation determines submodule, if the matching result for the similarity word is not match that, and be somebody's turn to do
Similarity word is used other modes and is determined as non-bad search term, it is determined that hits for the similarity word, otherwise, really
Surely it is directed to the similarity word miss.
Further, the hit situation determining module, if the matching result specifically for the similarity word is phase
Matching, and the similarity word is used other modes and is determined as bad search term, it is determined that it is hit for the similarity word,
If the matching result of the similarity word is to match, and the similarity word is used other modes and is determined as non-bad search
Rope word, it is determined that be directed to the similarity word miss.
The embodiment of the invention provides a kind of electronic equipment, which is characterized in that including processor, communication interface, memory
And communication bus, wherein processor, communication interface, memory complete mutual communication by communication bus;
Memory, for storing computer program;
Processor when for executing the program stored on memory, realizes any of the above-described regular expression generation method
Step.
Present invention implementation additionally provides a kind of computer readable storage medium, storage in the computer readable storage medium
There is computer program, the computer program realizes the step of any of the above-described regular expression generation method when being executed by processor
Suddenly.
The embodiment of the invention also provides a kind of computer program products comprising instruction, when it runs on computers
When, so that computer executes any of the above-described regular expression generation method.
A kind of regular expression generation method, system and electronics provided in an embodiment of the present invention
Equipment clicks bipartite graph based on search, acquisition retrieves identical with known bad search term by obtaining known bad search term
Each search term for clicking file extracts canonical segment to similarity word, obtains regular expression, make as similarity word
Each similarity word is matched respectively using candidate regular expression for candidate regular expression, based on candidate canonical table
Up to formula to the hit situation for participating in matched each similarity word, really select the candidate regular expression of preset quantity as being used for
The regular expression of filtered search word.By the above method, can be realized according to similarity word to existing regular expression
It constantly updates.
Certainly, implement any of the products of the present invention or method it is not absolutely required at the same reach all the above excellent
Point.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described.
Fig. 1 is a kind of flow chart of regular expression generation method provided in an embodiment of the present invention;
Fig. 2 is the structural schematic diagram that bipartite graph is clicked in a kind of search provided in an embodiment of the present invention;
Fig. 3 is a kind of structural schematic diagram of regular expression generating means provided in an embodiment of the present invention;
Fig. 4 is a kind of electronic equipment structural schematic diagram provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention is described.
In scheme provided in an embodiment of the present invention, known bad search term is obtained, bipartite graph is clicked based on search, is obtained
The identical each search term for clicking file is retrieved with known bad search term, as similarity word, similarity word is extracted
Canonical segment, obtains regular expression, as candidate regular expression, using candidate regular expression, similar searches to each respectively
Rope word is matched, and based on candidate regular expression to the hit situation for participating in matched each similarity word, selects present count
The candidate regular expressions of amount are as the regular expression for being used for filtered search word.Therefore, the candidate generated according to similarity word
Regular expression realizes the continuous renewal to existing regular expression.
The embodiment of the invention provides a kind of regular expression generation methods, as shown in Figure 1, may comprise steps of:
Step S101 obtains known bad search term.
In this step, what the existing regular expression set in available search engine library and user searched for searches
Rope set of words, for each search term, by each of the search term and existing regular expression set regular expression
It is matched, the search term that can be mutually matched with existing regular expression is obtained, as known bad search term.
Further, it is assumed that an existing search term is " windows licensing i.e. by what if expired ", existing canonical table
It is " _ expired " up to formula, then in the search term, it may include " windows ", " licensing ", " license that canonical segment is extracted to it
Card _ expired " and " windows_ is expired " etc., if being matched using above-mentioned regular expression with the search term, since this is searched
Rope word meets the mode of regular expression restriction comprising " _ expired " this mode, it may be considered that the search term with it is above-mentioned
Regular expression is mutually matched.
Step S102 is based on search and clicks bipartite graph, obtains and retrieves identical click file with known bad search term
Each search term, as similarity word.
In this step, bad search term known to each is clicked search term in bipartite graph according to search and is clicked
The connection relationship of file can be chosen and be connected with identical click file with known bad search term and not be determined as bad search
Each search term of rope word, as similarity word.
Specifically, bipartite graph is clicked in search to be indicated: when user scans in a search engine, the search term of input
And the connection relationship clicked between file that user selects in the corresponding search result of the search term.Bipartite graph is clicked in search
In, each search term can be correspondingly connected with one or more and click file, and similarly, each clicks file, can correspond to
Connect one or more search term.So, bad search term known to each can be clicked in bipartite graph in search and be looked for
The identical search term for clicking file is connected thereto to one or more, removal wherein belongs to the search term of bad search term,
By remaining search term, as similarity word.
Further, it is illustrated by taking Fig. 2 as an example, query indicates that search term, doc indicate to click file in figure.With
For query5, if bad search term only has query5, by connection relationship in figure it is found that the click file being connect with query5 only
There is doc3, and the search term connecting with doc3 has query1, query4 and query5, then query1 and query4 can be
The corresponding similarity word of query5.
Step S103 extracts canonical segment to similarity word, regular expression is obtained, as candidate regular expression.
In this step, for each similarity word, one can be extracted to the similarity word according to candidate pattern
A or multiple canonical segments;For each canonical segment, can correspond to obtain a regular expression, then it can be according to phase
Like the set of search term, the set of a candidate regular expression is obtained.
Further, it is assumed that similarity word is " influence of the bio-diversity to environmental carrying capacity ", is as much as possible
Canonical segment is extracted, can specify that the regular pattern of preferential quantifier, as candidate pattern.With " * C1C2*”“*C1_C2C3* " in this way
Candidate pattern for, wherein C1、C2、C3Indicate different words, C1C2Indicate the word that two words are constituted, " * " and " _ " can be with
Indicate that any character, the character can be a word or word, possibly even there is nothing.It can be right according to candidate pattern
Similarity word extracts canonical segment, when candidate pattern is " * C1C2When * ", corresponding canonical segment can be " biology ", " multiplicity
Property _ influence " and " environment _ influence ";When candidate pattern is " * C1_C2C3When * ", corresponding canonical segment can be " multiplicity
Property ", " bearing capacity " and " on _ influence ".And each above-mentioned canonical segment can respectively correspond generation one candidate canonical
Expression formula.
It should be noted that above-mentioned candidate pattern can be since the structure of similarity word is generally relatively simple
According to the regular pattern for the simple structure that the structure of similarity word is extracted from the regular pattern library of search engine, it is also possible to
The regular pattern of the simple structure set according to similarity word structure.
Step S104 respectively matches each similarity word using candidate regular expression.
In this step, for each candidate regular expression, using candidate's regular expression, respectively to each similar
Search term is matched, and for each similarity word, extracts one or more canonical segment to the similarity word, if
Character in its canonical segment and the mode for meeting candidate regular expression restriction, then the two is mutually matched.
Step S105 is based on matching result and the similarity word quilt for matched each similarity word is participated in
Whether it is determined as the judgement of bad search term using other modes as a result, determining whether hit for the similarity word.
In this step, for each matched similarity word is participated in, its matching knot with candidate expression formula is obtained
Fruit and the similarity word are used whether other modes are determined as the judgement of bad search term as a result, according to matching result
With judgement as a result, determining whether candidate's regular expression is directed to similarity word hit.
Specifically, above-mentioned be judged as bad search term using other modes, it can be and manually similarity word is sentenced
Fixed, the similarity word judgment that will search flame is bad search term.
Step S106, based on candidate regular expression to the hit situation for participating in matched each similarity word, selection is pre-
If the candidate regular expression of quantity is used for the regular expression of filtered search word.
In this step, can according to candidate regular expression to the hit situation for participating in matched each similarity word,
The candidate regular expression for choosing preset quantity is updated into the set of existing regular expression, is used for filtered search word, for
Other candidate's regular expressions can then be added in blacklist, will be certain for the candidate regular expression in blacklist
It will not be by again as candidate regular expression in time.
Specifically, being determined candidate based on candidate regular expression to the hit situation for participating in matched each similarity word
Whether regular expression is selected as the regular expression for being used for filtered search word, can count candidate regular expression to phase
Like the quantity that search term is hit, according to the quantity of hit, the candidate regular expression for choosing preset quantity is used as to be searched for filtering
The regular expression of rope word.
In embodiments of the present invention, candidate regular expression can also be calculated to the hit rate of each similarity word, according to
The size of hit rate selects the candidate regular expression of preset quantity as the regular expression for being used for filtered search word, wherein
For each candidate regular expression, hit rate can be expressed as the quantity and similarity word of the similarity word of hit
The ratio of quantity.Such as: if there are two candidate regular expression, one of candidate's regular expression can hit 50 it is similar
Search term, another candidate regular expression can only hit 2 similarity words, then can will hit 50 similarity words pair
The candidate regular expression answered is chosen as the regular expression for filtered search word;Or calculate above-mentioned two candidate canonical
Expression formula to the hit rate of similarity word, choose wherein candidate regular expression of the hit rate greater than 70% as filtered search
The regular expression of word.
It can be seen from the above, above-mentioned steps can extract canonical segment based on similarity word, candidate regular expression is obtained,
Selectively existing regular expression can be updated from candidate regular expression.
In one embodiment of above-mentioned regular expression generation method, two are clicked based on search in above-mentioned steps S102
Component obtains and retrieves the identical each search term for clicking file with known bad search term, specific to handle as similarity word
Mode may is that
Bad search term known to each is clicked in bipartite graph in search, is obtained and is connected with the known bad search term
It is connected to identical click file and is not determined as each search term of bad search term, as search term to be screened;
For each search term to be screened, the similarity of the search term to be screened Yu known bad search term is calculated;
According to the size of similarity, selects similarity to be greater than the search term to be screened of the first preset threshold, searched as similar
Rope word.
Specifically, clicking in bipartite graph in search, each search term has the weight of its corresponding word, each search term
The weight when there is its corresponding connected with file is clicked, wherein the weight of word can be search time of the user for the word
Number, when the weight on side can be search corresponding search term, user is to the hits for clicking file.For example, a certain search term,
Corresponding click file only has file one and file two, and it is 200 times for the searching times of the word that now statistics, which obtains user, for
The hits of file one have 150 times, and the hits of file two have 80 times, then for the search term, the weight of word be can be
200, the weight on the side which connect with file one is 150, and the weight on the side connecting with file two is 80.
It is clicked according to search and searches plain word in bipartite graph and click the connection relationship between file, it is bad known to each
Search term is clicked in bipartite graph in search, and acquisition is connected with identical click file with the known bad search term and does not determine
For each search term of bad search term, as search term to be screened.
The weight of word and the weight on side in bipartite graph are clicked according to above-mentioned search, for each search term to be screened, meter
The similarity degree for calculating the search term to be screened and bad search term, as similarity.It is clicked in bipartite graph in search, because each
One or more available search term to be screened of a known bad search term, and each search term to be screened, Ke Yigen
It is obtained according to search term bad known to one or more, therefore, for i-th of search term to be screened, with known bad search
Similarity S between wordiIt can indicate are as follows:
M indicates the known bad search term quantity that identical click file is connected with the search term to be screened, and k indicates m
Bad search term known to k-th in known bad search term, p indicate bad search known to the search term to be screened and k-th
The quantity for the click file that word connects jointly, WikIt indicates the search term to be screened and clicks the power that file is correspondingly connected with side for k-th
Weight, WiIndicate the sum of the weight on the side that all click files of the search term connection to be screened are correspondingly connected with, WjkIndicate that this is known
Bad search term and the weight for clicking the side that file is correspondingly connected with for k-th, WjIndicate all of the known bad search term connection
Click the weight on the side that file is correspondingly connected with.
Similarity can be greater than by default value according to the size of its similarity for each search term to be screened
Search term to be screened, as similarity word, for example, similarity is greater than 70% search term to be screened;Alternatively, be also possible to by
Search term to be screened is arranged according to the sequence of similarity from big to small, the search term to be screened of predetermined quantity is chosen, as similar
Search term, for example, similarity comes preceding 20 corresponding search terms to be screened.
It can be seen from the above, by the above method, it can be according to the phase between known bad search term and search term to be screened
Like degree, screening obtains similarity word, provides a kind of implementation method for the determination of similarity word, meanwhile, it realizes to similar
Effective control of search term quantity, the hit situation determination for after provide convenience.
In one embodiment of above-mentioned regular expression generation method, two are clicked based on search in above-mentioned steps S102
Component obtains and retrieves the identical each search term for clicking file with known bad search term, specific to handle as similarity word
Mode is also possible to:
It is clicked in bipartite graph in search, bad search term known to each judges the power of the known bad search term
The size of weight, right to choose is great in search term bad known to the second preset threshold, as specific bad search term;
Bipartite graph is clicked based on search, for each specific bad search term, acquisition has been connected thereto identical click text
Part and each search term for not being determined as bad search term, as search term to be screened;
For each search term to be screened, the specific bad search term of acquisition connect each jointly with the search term to be screened
File is clicked, judge the weight size on side that each click file is connected with the search term to be screened, selects the weight on side greater than the
The search term to be screened of three preset thresholds, as similarity word.
Specifically, according to above-mentioned search click bipartite graph in word weight, can according to the weight of known bad search term,
The weight of preferential selection word is greater than the known bad search term of predetermined value, as specific bad search term, for example, the weight of word
Greater than search term bad known to 2000.
It for each specific bad search term, is clicked in bipartite graph in search, obtains and connect with the specific bad search term
It is connected to identical click file and is not determined as one or more search term of bad search term, as search term to be screened.
For each search term to be screened, clicked in bipartite graph in search, obtain specific bad search term with it is to be screened
The click file for searching for connection clicks file for each, preferentially the weight on the side connecting with the click file is selected to be greater than
The search term to be screened of threshold value, as similarity word.
Further, it is illustrated by taking Fig. 2 as an example, if known bad search term only has query3 and query4 in figure, figure
The weight on the side that middle query search term corresponding with the digital representation on doc line interconnected is connect with click file.If at this time
The weight 1 of the word of query3, the weight of the word of query4 are 10, then according to the size of the weight of word, can using query4 as
Specific bad search term.As seen from the figure, the click file of query4 connection has doc1, doc2 and doc3, then connects with query4
Being connected to the identical search term for clicking file has query1, query2, query3 and query5, because query3 is known bad searches
Rope word, therefore search term to be screened only has query1, query2 and query5.If at this time in search term to be screened, with click
The weight on the side of file connection is greater than 2, then is similarity word, then, as seen from the figure, the similarity word of query4 has
Query1, query2 and query5.
It can be seen from the above, by the above method, according to the weight of the weight on side and bad search term, determining similarity
Word, in quantity and control with the similarity degree of bad search term, hence it is evident that it is more accurate, and then feelings are hit convenient for after
The determination of condition.
In one embodiment of above-mentioned regular expression generation method, candidate canonical table is used in above-mentioned steps S104
Up to formula, before being matched respectively to each similarity word, can also include:
Bipartite graph is clicked based on search, calculates the undesirable level of candidate regular expression, undesirable level indicates candidate canonical
Relevance between expression formula and known bad search term;
Using candidate regular expression, each similarity word is matched respectively, comprising:
Meet the candidate regular expression of preset condition using undesirable level, each similarity word is matched respectively.
Specifically, clicked in bipartite graph in search, for search term bad known to each, available one or more
A similarity word, and one or more canonical segment can be generated in each similarity word, each canonical segment
A corresponding candidate regular expression.Therefore, for each candidate regular expression, with known bad search term it
Between there are certain relevance, which can be indicated with the undesirable level of candidate's regular expression.
According to the size of the undesirable level of each candidate regular expression, it can choose undesirable level and meet preset condition
Candidate regular expression, each similarity word is matched.Wherein, preset condition can be preset undesirable level
Size, if undesirable level is greater than 10 candidate regular expression, alternatively, being also possible to the number of preset candidate regular expression
Amount, such as relatively large first 10 candidate regular expressions of undesirable level size.
It can be seen from the above, the above method is mainly the process screened to candidate regular expression, main purpose is
The candidate regular expression big with known bad search term relevance is deleted and is elected, candidate regular expression is ordered convenient for after
The statistics of middle situation.
In one embodiment of above-mentioned regular expression generation method, bipartite graph is clicked based on search above-mentioned, is calculated
The undesirable level of candidate regular expression, specific embodiment may is that
For each candidate regular expression, according to generate the corresponding each similarity word of the candidate regular expression and
Similarity degree between known bad search term calculates being associated between candidate's regular expression and known bad search term
Property;
For i-th of candidate regular expression, undesirable level ZiIt indicates are as follows:
N indicates that the quantity of the corresponding similarity word of candidate's regular expression, m expression and the generation candidate can be generated
The corresponding similarity word of regular expression is connected with the quantity of the identical known bad search term for clicking file, SijIt indicates and the
I similarity word is connected with the similarity degree known to identical j-th for clicking file between bad search term, CjIndicate jth
The weight of a known bad search term;
Wherein, it is connected with known to identical j-th for clicking file between bad search term for i-th of similarity word
Similarity degree SijIt indicates are as follows:
P indicates the quantity for clicking file that the similarity word is connect jointly with the known bad search term, WikIndicating should
Similarity word and k-th of click file are correspondingly connected with the weight on side, WiIndicate all clicks text of similarity word connection
The sum of the weight on the side that part is correspondingly connected with, WjkIt indicates the known bad search term and clicks the side that file is correspondingly connected with for k-th
Weight, WjIndicate the weight on the side that all click files of the known bad search term connection are correspondingly connected with.
Further, it is illustrated by taking Fig. 2 as an example, it is assumed that known bad search term is query4, and similarity word is
Query1, query2, query3 and query5, existing one candidate regular expression, which is basis
Query2 and query5 extracts what canonical segment obtained, and wherein the weight of the corresponding word of query2 and query5 is 3, then the time
Select the specific calculating process of regular expression undesirable level Z as follows:
It can be seen from the above, about candidate regular expression, mainly according to its relevance between known bad search term
Size choose, relevance is bigger, and candidate regular expression may be better to the filter effect of flame.
It is matched for participating in above-mentioned steps S105 in one embodiment of above-mentioned regular expression generation method
Each similarity word is used whether other modes are determined as bad search term based on matching result and the similarity word
Judgement as a result, determine whether hit for the similarity word, a kind of embodiment may include:
If the matching result of the similarity word is to match, and the similarity word is used other modes and is determined as
Bad search term, it is determined that hit for the similarity word, otherwise, it determines being directed to the similarity word miss.
Alternatively, if the matching result of the similarity word is not match that, and the similarity word is used its other party
Formula is determined as non-bad search term, it is determined that hits for the similarity word, otherwise, it determines not ordering for the similarity word
In.
Specifically, candidate regular expression can be used, respectively to each similarity word carry out it is matched during, about
Similarity word then will appear following four situation: can be mutually matched, and be determined as bad search term using other modes
Situation;The case where cannot being mutually matched, and being determined as non-bad search term using other modes;It can be mutually matched, and use
Other modes are determined as the case where non-bad search term;It cannot be mutually matched, and be to be determined as bad search using other modes
The case where word.For each candidate regular expression, if there is the case where above-mentioned first two, then it is assumed that candidate's regular expressions
Formula hits corresponding similarity word;If there is above-mentioned latter two situation, then it is assumed that candidate's regular expression miss pair
The similarity word answered.
Whether can be for the matched each phase of participation it can be seen from the above, using the above method to determine candidate regular expression
Whether hit like search term comprising four kinds of situations can comprehensively summarize all situations being likely to occur, to hit situation
Determination it is more accurate.
It is matched for participating in above-mentioned steps S105 in one embodiment of above-mentioned regular expression generation method
Each similarity word is used whether other modes are determined as bad search term based on matching result and the similarity word
Judgement as a result, determining whether hit for the similarity word, another embodiment may include:
If the matching result of the similarity word is to match, and the similarity word is used other modes and is determined as
Bad search term, it is determined that hit for the similarity word, if the matching result of the similarity word is to match, and be somebody's turn to do
Similarity word is used other modes and is determined as non-bad search term, it is determined that is directed to the similarity word miss.
Above-mentioned second of embodiment carries out matched mistake to each similarity word respectively using candidate regular expression
Cheng Zhong determines whether be directed to similarity word only in the case where candidate regular expression and similarity word are mutually matched
Hit only will appear a kind of situation in this case, that is, be directed to each similarity word, in itself and candidate regular expression
In the case where being mutually matched, which is used other modes and is determined as bad search term, it is determined that similar for this
Search term hit, conversely, the similarity word is used its other party in the case where it is mutually matched with candidate regular expression
Formula is determined as non-bad search term, it is determined that is directed to the similarity word miss.
Whether can be for the matched each phase of participation it can be seen from the above, using the above method to determine candidate regular expression
Whether hit like search term, during determining to hit situation, for each candidate regular expression, only considering can be with
Whether its matched similarity word can be hit by candidate regular expression, and this determining method is more simple in practical operation
Folk prescription is just.
In conclusion the regular expression generation method provided according to embodiments of the present invention, can according to similarity word,
Regular expression is generated, the continuous renewal to existing regular expression may be implemented.
Based on the same inventive concept, the above-mentioned regular expression generation method provided according to embodiments of the present invention, the present invention
Embodiment additionally provides a kind of regular expression generating means, as shown in figure 3, comprising the following modules:
Bad search term obtains module 201, for obtaining known bad search term;
Similarity word obtains module 202, for clicking bipartite graph based on search, obtains and known bad search word and search
To the identical each search term for clicking file, as similarity word, wherein bipartite graph is clicked in search indicates that search term is corresponding
The connection relationship clicked between file that user clicks in search result;
Regular expression generation module 203 obtains regular expression, makees for extracting canonical segment to similarity word
For candidate regular expression;
Matching module 204, for being matched to each similarity word respectively using candidate regular expression;
Hit situation determining module 205, for for participating in matched each similarity word, based on matching result and
The similarity word is used whether other modes are determined as the judgement of bad search term as a result, determining for the similarity word
Whether hit;
Regular expression selected module 206, for being based on candidate regular expression to the matched each similarity word of participation
Hit situation, determine the candidate regular expression of preset quantity as the regular expression for being used for filtered search word.
Further, similarity word obtains module 202, may include:
Search term acquisition submodule to be screened, for clicking bipartite graph in search for each known bad search term
In, each search term for being connected with identical click file with the known bad search term and not being determined as bad search term is obtained, is made
For search term to be screened;
Similarity calculation submodule, for be directed to each search term to be screened, calculate the search term to be screened with it is known
The similarity of bad search term;
Similarity selected ci poem takes submodule, for the size according to similarity, similarity is selected to be greater than the first preset threshold
Search term to be screened, as similarity word.
Further, similarity word obtains module 202, also may include:
Specific bad search term acquisition submodule, it is bad known to each to search for being clicked in bipartite graph in search
Rope word judges that the size of the weight of the known bad search term, right to choose are great in search bad known to the second preset threshold
Word, as specific bad search term;
Search term acquisition submodule to be screened, for clicking bipartite graph based on search, for each specific bad search
Word obtains each search term for being connected thereto identical click file and not being determined as bad search term, as search to be screened
Word;
Similarity selected ci poem takes submodule, for being directed to each search term to be screened, obtain specific bad search term with
Each click file that the search term to be screened connects jointly judges the power on the side that each click file is connected with the search term to be screened
It is great small, select the weight on side to be greater than the search term to be screened of third predetermined threshold value, as similarity word.
Further, above-mentioned apparatus can also include:
Undesirable level computing module, for calculating the undesirable level of candidate regular expression based on search click bipartite graph,
Undesirable level indicates the relevance between candidate regular expression and known bad search term;
Matching module 204, specifically for meeting the candidate regular expression of preset condition using undesirable level, respectively to each
Similarity word is matched.
Further, above-mentioned undesirable level computing module is specifically used for for each candidate regular expression, according to life
At the similarity degree between the corresponding each similarity word of candidate's regular expression and known bad search term, the candidate is calculated
Relevance between regular expression and known bad search term;
For i-th of candidate regular expression, undesirable level ZiIt indicates are as follows:
N indicates that the quantity of the corresponding similarity word of candidate's regular expression, m expression and the generation candidate can be generated
The corresponding similarity word of regular expression is connected with the quantity of the identical known bad search term for clicking file, SijIt indicates and the
I similarity word is connected with the similarity degree known to identical j-th for clicking file between bad search term, CjIndicate jth
The weight of a known bad search term;
Wherein, it is connected with known to identical j-th for clicking file between bad search term for i-th of similarity word
Similarity degree SijIt indicates are as follows:
P indicates the quantity for clicking file that the similarity word is connect jointly with the known bad search term, WikIndicating should
Similarity word and k-th of click file are correspondingly connected with the weight on side, WiIndicate all clicks text of similarity word connection
The sum of the weight on the side that part is correspondingly connected with, WjkIt indicates the known bad search term and clicks the side that file is correspondingly connected with for k-th
Weight, WjIndicate the weight on the side that all click files of the known bad search term connection are correspondingly connected with.
Further, hit situation determining module 205 may include:
First hit situation determines submodule, if the matching result for the similarity word is to match, and the phase
Other modes are used like search term and are determined as bad search term, it is determined that are hit for the similarity word, otherwise, it determines needle
To the similarity word miss;
Second hit situation determines submodule, if the matching result for the similarity word is not match that, and be somebody's turn to do
Similarity word is used other modes and is determined as non-bad search term, it is determined that hits for the similarity word, otherwise, really
Surely it is directed to the similarity word miss.
Further, hit situation determining module 205, if the matching result specifically for the similarity word is phase
Match, and the similarity word is used other modes and is determined as bad search term, it is determined that hit for the similarity word, such as
The matching result of the fruit similarity word is to match, and the similarity word is used other modes and is determined as non-bad search
Word, it is determined that be directed to the similarity word miss.
The embodiment of the invention also provides a kind of electronic equipment, as shown in figure 4, include processor 401, communication interface 402,
Memory 403 and communication bus 404, wherein processor 401, communication interface 402, memory 403 are complete by communication bus 404
At mutual communication,
Memory 403, for storing computer program;
Processor 401 when for executing the program stored on memory 403, realizes following steps:
Obtain known bad search term;
Bipartite graph is clicked based on search, obtains and retrieves the identical each search term for clicking file with known bad search term,
As similarity word, wherein the click that bipartite graph indicates that user clicks in the corresponding search result of search term is clicked in search
Connection relationship between file;
Canonical segment is extracted to similarity word, regular expression is obtained, as candidate regular expression;
Using candidate regular expression, each similarity word is matched respectively;
For matched each similarity word is participated in, its other party is used based on matching result and the similarity word
Whether formula is determined as the judgement of bad search term as a result, determining whether hit for the similarity word;
Based on candidate regular expression to the hit situation for participating in matched each similarity word, the time of preset quantity is selected
Select regular expression as the regular expression for being used for filtered search word.
The communication bus that above-mentioned electronic equipment is mentioned can be Peripheral Component Interconnect standard (Peripheral Component
Interconnect, PCI) bus or expanding the industrial standard structure (Extended Industry Standard
Architecture, EISA) bus etc..The communication bus can be divided into address bus, data/address bus, control bus etc..For just
It is only indicated with a thick line in expression, figure, it is not intended that an only bus or a type of bus.
Communication interface is for the communication between above-mentioned electronic equipment and other equipment.
Memory may include random access memory (Random Access Memory, RAM), also may include non-easy
The property lost memory (Non-Volatile Memory, NVM), for example, at least a magnetic disk storage.Optionally, memory may be used also
To be storage device that at least one is located remotely from aforementioned processor.
Above-mentioned processor can be general processor, including central processing unit (Central Processing Unit,
CPU), network processing unit (Network Processor, NP) etc.;It can also be digital signal processor (Digital Signal
Processing, DSP), it is specific integrated circuit (Application Specific Integrated Circuit, ASIC), existing
It is field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete
Door or transistor logic, discrete hardware components.
In another embodiment provided by the invention, a kind of computer readable storage medium is additionally provided, which can
It reads to be stored with computer program in storage medium, the computer program realizes any of the above-described expressing when being executed by processor
The step of formula generation method.
In another embodiment provided by the invention, a kind of computer program product comprising instruction is additionally provided, when it
When running on computers, so that computer executes any regular expression generation method in above-described embodiment.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real
It is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.The computer program
Product includes one or more computer instructions.When loading on computers and executing the computer program instructions, all or
It partly generates according to process or function described in the embodiment of the present invention.The computer can be general purpose computer, dedicated meter
Calculation machine, computer network or other programmable devices.The computer instruction can store in computer readable storage medium
In, or from a computer readable storage medium to the transmission of another computer readable storage medium, for example, the computer
Instruction can pass through wired (such as coaxial cable, optical fiber, number from a web-site, computer, server or data center
User's line (DSL)) or wireless (such as infrared, wireless, microwave etc.) mode to another web-site, computer, server or
Data center is transmitted.The computer readable storage medium can be any usable medium that computer can access or
It is comprising data storage devices such as one or more usable mediums integrated server, data centers.The usable medium can be with
It is magnetic medium, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk
Solid State Disk (SSD)) etc..
It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality
Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation
In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to
Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those
Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment
Intrinsic element.In the absence of more restrictions, the element limited by sentence " including one ... ", it is not excluded that
There is also other identical elements in the process, method, article or apparatus that includes the element.
Each embodiment in this specification is all made of relevant mode and describes, same and similar portion between each embodiment
Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device,
For the embodiments such as electronic equipment, since it is substantially similar to the method embodiment, so being described relatively simple, related place ginseng
See the part explanation of embodiment of the method.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all
Any modification, equivalent replacement, improvement and so within the spirit and principles in the present invention, are all contained in protection scope of the present invention
It is interior.
Claims (15)
1. a kind of regular expression generation method characterized by comprising
Obtain known bad search term;
Bipartite graph is clicked based on search, obtains and retrieves the identical each search term for clicking file with the known bad search term,
As similarity word, wherein described search, which clicks bipartite graph, indicates user's click in the corresponding search result of search term
Click the connection relationship between file;
Canonical segment is extracted to the similarity word, regular expression is obtained, as candidate regular expression;
Using the candidate regular expression, each similarity word is matched respectively;
For matched each similarity word is participated in, being used other modes based on matching result and the similarity word is
The no judgement for being determined as bad search term is as a result, determine whether hit for the similarity word;
Based on the candidate regular expression to the hit situation for participating in matched each similarity word, preset quantity is selected
Candidate regular expression as be used for filtered search word regular expression.
2. the method according to claim 1, wherein it is described based on search click bipartite graph, obtain with it is described
Know that bad search term retrieves the identical each search term for clicking file, as similarity word, comprising:
It for known bad search term described in each, is clicked in bipartite graph in search, obtains and connect with the known bad search term
It is connected to identical click file and is not determined as each search term of bad search term, as search term to be screened;
For search term to be screened described in each, it is similar to the known bad search term to calculate the search term to be screened
Degree;
According to the size of the similarity, selects similarity to be greater than the search term to be screened of the first preset threshold, searched as similar
Rope word.
3. the method according to claim 1, wherein it is described based on search click bipartite graph, obtain with it is described
Know that bad search term retrieves the identical each search term for clicking file, as similarity word, comprising:
It is clicked in bipartite graph in search, for known bad search term described in each, judges the power of the known bad search term
The size of weight, right to choose is great in search term bad known to the second preset threshold, as specific bad search term;
Bipartite graph is clicked based on described search, for specific bad search term described in each, acquisition has been connected thereto identical point
It hits file and is not determined as each search term of bad search term, as search term to be screened;
For search term to be screened described in each, obtains the specific bad search term and connect jointly with the search term to be screened
Each click file, judge each weight size on side clicking file and connecting with the search term to be screened, select the power on side
The great search term to be screened in third predetermined threshold value, as similarity word.
4. right respectively the method according to claim 1, wherein described using the candidate regular expression
Before each similarity word is matched, further includes:
Bipartite graph is clicked based on described search, calculates the undesirable level of the candidate regular expression, the undesirable level indicates
Relevance between candidate's regular expression and the known bad search term;
It is described to use the candidate regular expression, each similarity word is matched respectively, comprising:
Meet the candidate regular expression of preset condition using the undesirable level, respectively to each similarity word into
Row matching.
5. according to the method described in claim 4, it is characterized in that, described click bipartite graph based on described search, described in calculating
The undesirable level of candidate regular expression, comprising:
For candidate's regular expression described in each, according to generate the corresponding each similarity word of the candidate regular expression and
Similarity degree between the known bad search term, calculates between candidate's regular expression and the known bad search term
Relevance;
For i-th of candidate regular expression, undesirable level ZiIt indicates are as follows:
N indicates that the quantity of the corresponding similarity word of candidate's regular expression, m expression and generation candidate's canonical can be generated
The corresponding similarity word of expression formula is connected with the quantity of the identical known bad search term for clicking file, SijIt indicates and i-th
Similarity word is connected with the similarity degree known to identical j-th for clicking file between bad search term, CjIndicate j-th
Know the weight of bad search term;
Wherein, for the phase being connected with i-th of similarity word known to identical j-th for clicking file between bad search term
Like degree SijIt indicates are as follows:
P indicates the quantity for clicking file that the similarity word is connect jointly with the known bad search term, WikIndicate that this is similar
Search term and k-th of click file are correspondingly connected with the weight on side, WiIndicate all click files pair of similarity word connection
The sum of the weight on the side that should be connected, WjkIt indicates the known bad search term and clicks the weight on the side that file is correspondingly connected with for k-th,
WjIndicate the weight on the side that all click files of the known bad search term connection are correspondingly connected with.
6. the method according to claim 1, wherein described for the matched each similarity word of participation, base
It is used whether other modes are determined as the judgement of bad search term as a result, determining needle in matching result and the similarity word
Whether the similarity word is hit, comprising:
If the matching result of the similarity word be match, and the similarity word be used other modes be determined as it is bad
Search term, it is determined that hit for the similarity word, otherwise, it determines being directed to the similarity word miss;
Alternatively, if the matching result of the similarity word is not match that, and the similarity word is used other modes and sentences
It is set to non-bad search term, it is determined that hit for the similarity word, otherwise, it determines being directed to the similarity word miss.
7. the method according to claim 1, wherein described for the matched each similarity word of participation, base
It is used whether other modes are determined as the judgement of bad search term as a result, determining needle in matching result and the similarity word
Whether the similarity word is hit, comprising:
If the matching result of the similarity word be match, and the similarity word be used other modes be determined as it is bad
Search term, it is determined that hit for the similarity word, if the matching result of the similarity word is to match, and this is similar
Search term is used other modes and is determined as non-bad search term, it is determined that is directed to the similarity word miss.
8. a kind of regular expression generating means characterized by comprising
Bad search term obtains module, for obtaining known bad search term;
Similarity word obtains module, and for clicking bipartite graph based on search, acquisition is retrieved with the known bad search term
The identical each search term for clicking file, as similarity word, wherein described search, which clicks bipartite graph, indicates that search term is right with it
Answer the connection relationship clicked between file that user clicks in search result;
Regular expression generation module obtains regular expression, as time for extracting canonical segment to the similarity word
Select regular expression;
Matching module, for being matched to each similarity word respectively using the candidate regular expression;
Hit situation determining module, for being based on matching result and this being similar for matched each similarity word is participated in
Search term is used whether other modes are determined as the judgement of bad search term as a result, determining whether order for the similarity word
In;
Regular expression selected module, for being based on the candidate regular expression to the matched each similarity word of participation
Hit situation, select the candidate regular expression of preset quantity as be used for filtered search word regular expression.
9. device according to claim 8, which is characterized in that the similarity word obtains module, comprising:
Search term acquisition submodule to be screened, for clicking bipartite graph in search for each known bad search term
In, each search term for being connected with identical click file with the known bad search term and not being determined as bad search term is obtained, is made
For search term to be screened;
Similarity calculation submodule, for be directed to each described search term to be screened, calculate the search term to be screened with it is described
The similarity of known bad search term;
Similarity selected ci poem takes submodule, for the size according to the similarity, similarity is selected to be greater than the first preset threshold
Search term to be screened, as similarity word.
10. device according to claim 8, which is characterized in that the similarity word obtains module, comprising:
Specific bad search term acquisition submodule bad is searched for clicking in bipartite graph in search for known described in each
Rope word judges that the size of the weight of the known bad search term, right to choose are great in search bad known to the second preset threshold
Word, as specific bad search term;
Search term acquisition submodule to be screened, for clicking bipartite graph based on described search, for specific bad described in each
Search term obtains each search term for being connected thereto identical click file and not being determined as bad search term, searches as to be screened
Rope word;
Similarity selected ci poem takes submodule, for being directed to each described search term to be screened, obtains the specific bad search
Each click file that word is connect jointly with the search term to be screened judges that each click file is connected with the search term to be screened
Side weight size, select side weight be greater than third predetermined threshold value search term to be screened, as similarity word.
11. device according to claim 8, which is characterized in that further include:
Undesirable level computing module calculates the bad of the candidate regular expression for clicking bipartite graph based on described search
Degree, the undesirable level indicate the relevance between the candidate regular expression and the known bad search term;
The matching module, specifically for meeting the candidate regular expression of preset condition using the undesirable level, point
It is other that each similarity word is matched.
12. device according to claim 11, which is characterized in that the undesirable level computing module, specifically for being directed to
Each described candidate regular expression, according to generate the corresponding each similarity word of candidate's regular expression with it is described known
Similarity degree between bad search term calculates being associated between candidate's regular expression and the known bad search term
Property;
For i-th of candidate regular expression, undesirable level ZiIt indicates are as follows:
N indicates that the quantity of the corresponding similarity word of candidate's regular expression, m expression and generation candidate's canonical can be generated
The corresponding similarity word of expression formula is connected with the quantity of the identical known bad search term for clicking file, SijIt indicates and i-th
Similarity word is connected with the similarity degree known to identical j-th for clicking file between bad search term, CjIndicate j-th
Know the weight of bad search term;
Wherein, for the phase being connected with i-th of similarity word known to identical j-th for clicking file between bad search term
Like degree SijIt indicates are as follows:
P indicates the quantity for clicking file that the similarity word is connect jointly with the known bad search term, WikIndicate that this is similar
Search term and k-th of click file are correspondingly connected with the weight on side, WiIndicate all click files pair of similarity word connection
The sum of the weight on the side that should be connected, WjkIt indicates the known bad search term and clicks the weight on the side that file is correspondingly connected with for k-th,
WjIndicate the weight on the side that all click files of the known bad search term connection are correspondingly connected with.
13. device according to claim 8, which is characterized in that the hit situation determining module, comprising:
First hit situation determines submodule, if the matching result for the similarity word is to match, and this similar is searched
Rope word is used other modes and is determined as bad search term, it is determined that hits for the similarity word, otherwise, it determines for should
Similarity word miss;
Second hit situation determines submodule, if the matching result for the similarity word is not match that, and this is similar
Search term is used other modes and is determined as non-bad search term, it is determined that hits for the similarity word, otherwise, it determines needle
To the similarity word miss.
14. device according to claim 8, which is characterized in that the hit situation determining module, it should if be specifically used for
The matching result of similarity word is to match, and the similarity word is used other modes and is determined as bad search term, then
It determines and is hit for the similarity word, if the matching result of the similarity word is to match, and the similarity word quilt
It is determined as non-bad search term using other modes, it is determined that be directed to the similarity word miss.
15. a kind of electronic equipment, which is characterized in that including processor, communication interface, memory and communication bus, wherein processing
Device, communication interface, memory complete mutual communication by communication bus;
Memory, for storing computer program;
Processor when for executing the program stored on memory, realizes any method and step of claim 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810695221.3A CN109190014B (en) | 2018-06-29 | 2018-06-29 | Regular expression generation method and device and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810695221.3A CN109190014B (en) | 2018-06-29 | 2018-06-29 | Regular expression generation method and device and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109190014A true CN109190014A (en) | 2019-01-11 |
CN109190014B CN109190014B (en) | 2021-11-26 |
Family
ID=64948682
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810695221.3A Active CN109190014B (en) | 2018-06-29 | 2018-06-29 | Regular expression generation method and device and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109190014B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110083758A (en) * | 2019-04-30 | 2019-08-02 | 闻康集团股份有限公司 | A kind of medical treatment search engine data platform system |
CN111292205A (en) * | 2019-12-17 | 2020-06-16 | 东方微银科技(北京)有限公司 | Judicial data analysis method, device, equipment and storage medium |
CN113343715A (en) * | 2021-06-29 | 2021-09-03 | 深圳前海微众银行股份有限公司 | Method, device and equipment for automatically generating regular expression and storage medium |
CN113656659A (en) * | 2021-08-31 | 2021-11-16 | 上海观安信息技术股份有限公司 | Data extraction method, device and system and computer readable storage medium |
CN113656538A (en) * | 2021-07-09 | 2021-11-16 | 深圳价值在线信息科技股份有限公司 | Method and device for generating regular expression, computing equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101847242A (en) * | 2010-05-27 | 2010-09-29 | 武汉大学 | Method and system for automatically acquiring aliases of contraband on line |
US20140136517A1 (en) * | 2012-11-10 | 2014-05-15 | Chian Chiu Li | Apparatus And Methods for Providing Search Results |
CN104809108A (en) * | 2015-05-20 | 2015-07-29 | 成都布林特信息技术有限公司 | Information monitoring and analyzing system |
CN106919603A (en) * | 2015-12-25 | 2017-07-04 | 北京奇虎科技有限公司 | The method and apparatus for calculating participle weight in query word pattern |
-
2018
- 2018-06-29 CN CN201810695221.3A patent/CN109190014B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101847242A (en) * | 2010-05-27 | 2010-09-29 | 武汉大学 | Method and system for automatically acquiring aliases of contraband on line |
US20140136517A1 (en) * | 2012-11-10 | 2014-05-15 | Chian Chiu Li | Apparatus And Methods for Providing Search Results |
CN104809108A (en) * | 2015-05-20 | 2015-07-29 | 成都布林特信息技术有限公司 | Information monitoring and analyzing system |
CN106919603A (en) * | 2015-12-25 | 2017-07-04 | 北京奇虎科技有限公司 | The method and apparatus for calculating participle weight in query word pattern |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110083758A (en) * | 2019-04-30 | 2019-08-02 | 闻康集团股份有限公司 | A kind of medical treatment search engine data platform system |
CN111292205A (en) * | 2019-12-17 | 2020-06-16 | 东方微银科技(北京)有限公司 | Judicial data analysis method, device, equipment and storage medium |
CN111292205B (en) * | 2019-12-17 | 2021-05-25 | 东方微银科技股份有限公司 | Judicial data analysis method, device, equipment and storage medium |
CN113343715A (en) * | 2021-06-29 | 2021-09-03 | 深圳前海微众银行股份有限公司 | Method, device and equipment for automatically generating regular expression and storage medium |
CN113656538A (en) * | 2021-07-09 | 2021-11-16 | 深圳价值在线信息科技股份有限公司 | Method and device for generating regular expression, computing equipment and storage medium |
CN113656659A (en) * | 2021-08-31 | 2021-11-16 | 上海观安信息技术股份有限公司 | Data extraction method, device and system and computer readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN109190014B (en) | 2021-11-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109190014A (en) | A kind of regular expression generation method, device and electronic equipment | |
Hotho et al. | Information retrieval in folksonomies: Search and ranking | |
US20080114755A1 (en) | Identifying sources of media content having a high likelihood of producing on-topic content | |
CN107862022B (en) | Culture resource recommendation system | |
CN109885770A (en) | A kind of information recommendation method, device, electronic equipment and storage medium | |
US20130110839A1 (en) | Constructing an analysis of a document | |
CN109189990B (en) | Search word generation method and device and electronic equipment | |
US20140189525A1 (en) | User behavior models based on source domain | |
CN109684483A (en) | Construction method, device, computer equipment and the storage medium of knowledge mapping | |
Huang et al. | Topic detection from large scale of microblog stream with high utility pattern clustering | |
US10152478B2 (en) | Apparatus, system and method for string disambiguation and entity ranking | |
WO2014056408A1 (en) | Information recommending method, device and server | |
WO2011008848A2 (en) | Activity based users' interests modeling for determining content relevance | |
Tibély et al. | Extracting tag hierarchies | |
Avarikioti et al. | Structure and content of the visible Darknet | |
CN104899215A (en) | Data processing method, recommendation source information organization, information recommendation method and information recommendation device | |
CN103678710A (en) | Information recommendation method based on user behaviors | |
Schinas et al. | Mgraph: multimodal event summarization in social media using topic models and graph-based ranking | |
CN112989118B (en) | Video recall method and device | |
CN109933691A (en) | Method, apparatus, equipment and storage medium for content retrieval | |
CN112836126A (en) | Recommendation method and device based on knowledge graph, electronic equipment and storage medium | |
CN107944001A (en) | Hot news detection method and device and electronic equipment | |
Vandic et al. | A semantic-based approach for searching and browsing tag spaces | |
Tuomchomtam et al. | Community recommendation for text post in social media: A case study on Reddit | |
Giummolè et al. | A study on microblog and search engine user behaviors: How twitter trending topics help predict *** hot queries |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |