CN109190014A - A kind of regular expression generation method, device and electronic equipment - Google Patents

A kind of regular expression generation method, device and electronic equipment Download PDF

Info

Publication number
CN109190014A
CN109190014A CN201810695221.3A CN201810695221A CN109190014A CN 109190014 A CN109190014 A CN 109190014A CN 201810695221 A CN201810695221 A CN 201810695221A CN 109190014 A CN109190014 A CN 109190014A
Authority
CN
China
Prior art keywords
search term
similarity
word
regular expression
bad
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810695221.3A
Other languages
Chinese (zh)
Other versions
CN109190014B (en
Inventor
黄腾玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN201810695221.3A priority Critical patent/CN109190014B/en
Publication of CN109190014A publication Critical patent/CN109190014A/en
Application granted granted Critical
Publication of CN109190014B publication Critical patent/CN109190014B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a kind of methods that regular expression generates, device and electronic equipment, scheme includes: to obtain known bad search term, bipartite graph is clicked based on search, it obtains and retrieves the identical each search term for clicking file with known bad search term, as similarity word, canonical segment is extracted to similarity word, obtain regular expression, as candidate regular expression, use candidate regular expression, each similarity word is matched respectively, based on candidate regular expression to the hit situation for participating in matched each similarity word, select the candidate regular expression of preset quantity as the regular expression for being used for filtered search word.Using scheme provided in an embodiment of the present invention, the candidate regular expression that can be generated according to similarity word realizes the continuous renewal to existing regular expression.

Description

A kind of regular expression generation method, device and electronic equipment
Technical field
The present invention relates to Network Information Retrieval Techniques fields, more particularly to a kind of regular expression generation method, device And electronic equipment.
Background technique
With enriching constantly for Internet resources, in order to provide a user more information, user is carrying out network information inspection Suo Shi, search engine actively can recommend some search terms to user, these search terms may be that inquiry of the user in search is built View, default search word etc., such as: inputting " quotation marks " in search engine, search engine default will appear " effects of quotation marks ", " quotation marks Usage " etc. relevant recommendation, by this kind of search term be properly termed as recommend search term.Recommending search term that can facilitate, user's is defeated Enter, provides interested content for user.But along with the universal of network, network user's coverage rate is more and more wider, searches at present Index still have in the recommendation search term held up it is some be not suitable for recommending user, such as: it is some be related to pornographic, violence search term will Adverse effect is caused to minor, it is this kind of dysgenic search term to be caused to be properly termed as bad search term to user.It searches Index, which is held up, to be achieved the purpose that shield flame to recommending search term to be filtered.
But since network renewal speed is fast, some bad search terms will appear some variants after being filtered, these changes Body can equally retrieve flame, and search engine needs persistently to be filtered for these bad search terms and its variant Operation.
A kind of common method is regularization method at present, by one group of regular expression of manual maintenance, will use regular expressions The search term that formula is matched to regards as bad search term, filters the bad search term, to achieve the purpose that shield flame.
The bad case of discovery is depended primarily on for the maintenance of regular expression in the prior art, according to the bad of discovery Case extracts canonical segment, obtains regular expression, and then realize the update to existing regular expression.But due to not The time of occurrence of good case, quantity are all not have rule that can seek, and therefore, when safeguarding existing regular expression, be can not achieve Continuous renewal to existing regular expression.
Summary of the invention
The embodiment of the present invention is designed to provide a kind of regular expression generation method, device and electronic equipment, is used for The continuous renewal to existing regular expression is realized according to similarity word.Specific technical solution is as follows:
The embodiment of the invention provides a kind of regular expression generation methods, which comprises
Obtain known bad search term;
Bipartite graph is clicked based on search, obtains and retrieves the identical each search for clicking file with the known bad search term Word, as similarity word, wherein described search, which clicks bipartite graph, indicates that user clicks in the corresponding search result of search term Click file between connection relationship;
Canonical segment is extracted to the similarity word, regular expression is obtained, as candidate regular expression;
Using the candidate regular expression, each similarity word is matched respectively;
For matched each similarity word is participated in, its other party is used based on matching result and the similarity word Whether formula is determined as the judgement of bad search term as a result, determining whether hit for the similarity word;
Based on the candidate regular expression to the hit situation for participating in matched each similarity word, selection is default The candidate regular expression of quantity is as the regular expression for being used for filtered search word.
Further, bipartite graph is clicked based on search, obtains and retrieves identical click text with the known bad search term Each search term of part, as similarity word, comprising:
For known bad search term described in each, clicks in bipartite graph, obtain and the known bad search in search Word is connected with identical click file and is not determined as each search term of bad search term, as search term to be screened;
For search term to be screened described in each, the phase of the search term to be screened with the known bad search term is calculated Like degree;
According to the size of the similarity, similarity is selected to be greater than the search term to be screened of the first preset threshold, as phase Like search term.
Further, bipartite graph is clicked based on search, obtains and retrieves identical click text with the known bad search term Each search term of part, as similarity word, comprising:
It is clicked in bipartite graph in search, for known bad search term described in each, judges the known bad search term Weight size, right to choose is great in search term bad known to the second preset threshold, as specific bad search term;
Bipartite graph is clicked based on described search, for specific bad search term described in each, acquisition has been connected thereto phase It is not determined as with click file and each search term of bad search term, as search term to be screened;
For search term to be screened described in each, obtains the specific bad search term and the search term to be screened is common Each click file of connection judges each weight size for clicking the side that file is connected with the search term to be screened, selects side Weight be greater than third predetermined threshold value search term to be screened, as similarity word.
Further, each similarity word is matched respectively using the candidate regular expression described Before, further includes:
Bipartite graph is clicked based on described search, calculates the undesirable level of the candidate regular expression, the undesirable level Indicate the relevance between the candidate regular expression and the known bad search term;
It is described to use the candidate regular expression, each similarity word is matched respectively, comprising:
The candidate regular expression for meeting preset condition using the undesirable level, respectively to each similarity Word is matched.
Further, bipartite graph is clicked based on described search, calculates the undesirable level of the candidate regular expression, packet It includes:
For candidate's regular expression described in each, according to generating the corresponding each similarity of candidate's regular expression Similarity degree between word and the known bad search term calculates candidate's regular expression and the known bad search term Between relevance;
For i-th of candidate regular expression, undesirable level ZiIt indicates are as follows:
N indicates that the quantity of the corresponding similarity word of candidate's regular expression, m expression and the generation candidate can be generated The corresponding similarity word of regular expression is connected with the quantity of the identical known bad search term for clicking file, SijIt indicates and the I similarity word is connected with the similarity degree known to identical j-th for clicking file between bad search term, CjIndicate jth The weight of a known bad search term;
Wherein, it is connected with known to identical j-th for clicking file between bad search term for i-th of similarity word Similarity degree SijIt indicates are as follows:
P indicates the quantity for clicking file that the similarity word is connect jointly with the known bad search term, WikIndicating should Similarity word and k-th of click file are correspondingly connected with the weight on side, WiIndicate all clicks text of similarity word connection The sum of the weight on the side that part is correspondingly connected with, WjkIt indicates the known bad search term and clicks the side that file is correspondingly connected with for k-th Weight, WjIndicate the weight on the side that all click files of the known bad search term connection are correspondingly connected with.
Further, for matched each similarity word is participated in, matching result and the similarity word quilt are based on Whether it is determined as the judgement of bad search term using other modes as a result, determining whether hit for the similarity word, comprising:
If the matching result of the similarity word is to match, and the similarity word is used other modes and is determined as Bad search term, it is determined that hit for the similarity word, otherwise, it determines being directed to the similarity word miss;
Alternatively, if the matching result of the similarity word is not match that, and the similarity word is used its other party Formula is determined as non-bad search term, it is determined that hits for the similarity word, otherwise, it determines not ordering for the similarity word In.
Further, for matched each similarity word is participated in, matching result and the similarity word quilt are based on Whether it is determined as the judgement of bad search term using other modes as a result, determining whether hit for the similarity word, comprising:
If the matching result of the similarity word is to match, and the similarity word is used other modes and is determined as Bad search term, it is determined that hit for the similarity word, if the matching result of the similarity word is to match, and be somebody's turn to do Similarity word is used other modes and is determined as non-bad search term, it is determined that is directed to the similarity word miss.
The embodiment of the invention provides a kind of regular expression generating means, described device includes:
Bad search term obtains module, for obtaining known bad search term;
Similarity word obtains module, for clicking bipartite graph based on search, obtains and examines with the known bad search term Rope to it is identical click file each search term, as similarity word, wherein described search click bipartite graph indicate search term with It corresponds to the connection relationship clicked between file that user in search result clicks;
Regular expression generation module obtains regular expression, makees for extracting canonical segment to the similarity word For candidate regular expression;
Matching module, for being matched to each similarity word respectively using the candidate regular expression;
Hit situation determining module, for based on matching result and being somebody's turn to do for matched each similarity word is participated in Similarity word is used whether other modes are determined as the judgement of bad search term as a result, determine is for the similarity word No hit;
Regular expression selected module, for based on the candidate regular expression to participating in matched each described similar search The hit situation of rope word selects the candidate regular expression of preset quantity as the regular expression for being used for filtered search word.
Further, the similarity word obtains module, comprising:
Search term acquisition submodule to be screened, for clicking two in search for each known bad search term In component, each search for being connected with identical click file with the known bad search term and not being determined as bad search term is obtained Word, as search term to be screened;
Similarity calculation submodule, for be directed to each described search term to be screened, calculate the search term to be screened with The similarity of the known bad search term;
Similarity selected ci poem takes submodule, for the size according to the similarity, selects similarity to be greater than first default The search term to be screened of threshold value, as similarity word.
Further, the similarity word obtains module, comprising:
Specific bad search term acquisition submodule, for being clicked in bipartite graph in search, for described in each it is known not Good search term judges that the size of the weight of the known bad search term, right to choose are great in bad known to the second preset threshold Search term, as specific bad search term;
Search term acquisition submodule to be screened, for clicking bipartite graph based on described search, for specific described in each Bad search term obtains each search term for being connected thereto identical click file and not being determined as bad search term, as wait sieve Select search term;
Similarity selected ci poem takes submodule, for being directed to each described search term to be screened, obtains described specific bad Each click file that search term is connect jointly with the search term to be screened judges each click file and the search term to be screened The weight size on the side of connection selects the weight on side to be greater than the search term to be screened of third predetermined threshold value, as similarity word.
Further, above-mentioned apparatus, further includes:
Undesirable level computing module calculates the candidate regular expression for clicking bipartite graph based on described search Undesirable level, the undesirable level indicate the relevance between the candidate regular expression and the known bad search term;
The matching module, specifically for meeting the candidate regular expressions of preset condition using the undesirable level Formula respectively matches each similarity word.
Further, the undesirable level computing module is specifically used for for each candidate regular expression, root According to the similarity degree generated between the corresponding each similarity word of candidate's regular expression and the known bad search term, meter Calculate the relevance between candidate's regular expression and the known bad search term;
For i-th of candidate regular expression, undesirable level ZiIt indicates are as follows:
N indicates that the quantity of the corresponding similarity word of candidate's regular expression, m expression and the generation candidate can be generated The corresponding similarity word of regular expression is connected with the quantity of the identical known bad search term for clicking file, SijIt indicates and the I similarity word is connected with the similarity degree known to identical j-th for clicking file between bad search term, CjIndicate jth The weight of a known bad search term;
Wherein, it is connected with known to identical j-th for clicking file between bad search term for i-th of similarity word Similarity degree SijIt indicates are as follows:
P indicates the quantity for clicking file that the similarity word is connect jointly with the known bad search term, WikIndicating should Similarity word and k-th of click file are correspondingly connected with the weight on side, WiIndicate all clicks text of similarity word connection The sum of the weight on the side that part is correspondingly connected with, WjkIt indicates the known bad search term and clicks the side that file is correspondingly connected with for k-th Weight, WjIndicate the weight on the side that all click files of the known bad search term connection are correspondingly connected with.
Further, the hit situation determining module, comprising:
First hit situation determines submodule, if the matching result for the similarity word is to match, and the phase Other modes are used like search term and are determined as bad search term, it is determined that are hit for the similarity word, otherwise, it determines needle To the similarity word miss;
Second hit situation determines submodule, if the matching result for the similarity word is not match that, and be somebody's turn to do Similarity word is used other modes and is determined as non-bad search term, it is determined that hits for the similarity word, otherwise, really Surely it is directed to the similarity word miss.
Further, the hit situation determining module, if the matching result specifically for the similarity word is phase Matching, and the similarity word is used other modes and is determined as bad search term, it is determined that it is hit for the similarity word, If the matching result of the similarity word is to match, and the similarity word is used other modes and is determined as non-bad search Rope word, it is determined that be directed to the similarity word miss.
The embodiment of the invention provides a kind of electronic equipment, which is characterized in that including processor, communication interface, memory And communication bus, wherein processor, communication interface, memory complete mutual communication by communication bus;
Memory, for storing computer program;
Processor when for executing the program stored on memory, realizes any of the above-described regular expression generation method Step.
Present invention implementation additionally provides a kind of computer readable storage medium, storage in the computer readable storage medium There is computer program, the computer program realizes the step of any of the above-described regular expression generation method when being executed by processor Suddenly.
The embodiment of the invention also provides a kind of computer program products comprising instruction, when it runs on computers When, so that computer executes any of the above-described regular expression generation method.
A kind of regular expression generation method, system and electronics provided in an embodiment of the present invention Equipment clicks bipartite graph based on search, acquisition retrieves identical with known bad search term by obtaining known bad search term Each search term for clicking file extracts canonical segment to similarity word, obtains regular expression, make as similarity word Each similarity word is matched respectively using candidate regular expression for candidate regular expression, based on candidate canonical table Up to formula to the hit situation for participating in matched each similarity word, really select the candidate regular expression of preset quantity as being used for The regular expression of filtered search word.By the above method, can be realized according to similarity word to existing regular expression It constantly updates.
Certainly, implement any of the products of the present invention or method it is not absolutely required at the same reach all the above excellent Point.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described.
Fig. 1 is a kind of flow chart of regular expression generation method provided in an embodiment of the present invention;
Fig. 2 is the structural schematic diagram that bipartite graph is clicked in a kind of search provided in an embodiment of the present invention;
Fig. 3 is a kind of structural schematic diagram of regular expression generating means provided in an embodiment of the present invention;
Fig. 4 is a kind of electronic equipment structural schematic diagram provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention is described.
In scheme provided in an embodiment of the present invention, known bad search term is obtained, bipartite graph is clicked based on search, is obtained The identical each search term for clicking file is retrieved with known bad search term, as similarity word, similarity word is extracted Canonical segment, obtains regular expression, as candidate regular expression, using candidate regular expression, similar searches to each respectively Rope word is matched, and based on candidate regular expression to the hit situation for participating in matched each similarity word, selects present count The candidate regular expressions of amount are as the regular expression for being used for filtered search word.Therefore, the candidate generated according to similarity word Regular expression realizes the continuous renewal to existing regular expression.
The embodiment of the invention provides a kind of regular expression generation methods, as shown in Figure 1, may comprise steps of:
Step S101 obtains known bad search term.
In this step, what the existing regular expression set in available search engine library and user searched for searches Rope set of words, for each search term, by each of the search term and existing regular expression set regular expression It is matched, the search term that can be mutually matched with existing regular expression is obtained, as known bad search term.
Further, it is assumed that an existing search term is " windows licensing i.e. by what if expired ", existing canonical table It is " _ expired " up to formula, then in the search term, it may include " windows ", " licensing ", " license that canonical segment is extracted to it Card _ expired " and " windows_ is expired " etc., if being matched using above-mentioned regular expression with the search term, since this is searched Rope word meets the mode of regular expression restriction comprising " _ expired " this mode, it may be considered that the search term with it is above-mentioned Regular expression is mutually matched.
Step S102 is based on search and clicks bipartite graph, obtains and retrieves identical click file with known bad search term Each search term, as similarity word.
In this step, bad search term known to each is clicked search term in bipartite graph according to search and is clicked The connection relationship of file can be chosen and be connected with identical click file with known bad search term and not be determined as bad search Each search term of rope word, as similarity word.
Specifically, bipartite graph is clicked in search to be indicated: when user scans in a search engine, the search term of input And the connection relationship clicked between file that user selects in the corresponding search result of the search term.Bipartite graph is clicked in search In, each search term can be correspondingly connected with one or more and click file, and similarly, each clicks file, can correspond to Connect one or more search term.So, bad search term known to each can be clicked in bipartite graph in search and be looked for The identical search term for clicking file is connected thereto to one or more, removal wherein belongs to the search term of bad search term, By remaining search term, as similarity word.
Further, it is illustrated by taking Fig. 2 as an example, query indicates that search term, doc indicate to click file in figure.With For query5, if bad search term only has query5, by connection relationship in figure it is found that the click file being connect with query5 only There is doc3, and the search term connecting with doc3 has query1, query4 and query5, then query1 and query4 can be The corresponding similarity word of query5.
Step S103 extracts canonical segment to similarity word, regular expression is obtained, as candidate regular expression.
In this step, for each similarity word, one can be extracted to the similarity word according to candidate pattern A or multiple canonical segments;For each canonical segment, can correspond to obtain a regular expression, then it can be according to phase Like the set of search term, the set of a candidate regular expression is obtained.
Further, it is assumed that similarity word is " influence of the bio-diversity to environmental carrying capacity ", is as much as possible Canonical segment is extracted, can specify that the regular pattern of preferential quantifier, as candidate pattern.With " * C1C2*”“*C1_C2C3* " in this way Candidate pattern for, wherein C1、C2、C3Indicate different words, C1C2Indicate the word that two words are constituted, " * " and " _ " can be with Indicate that any character, the character can be a word or word, possibly even there is nothing.It can be right according to candidate pattern Similarity word extracts canonical segment, when candidate pattern is " * C1C2When * ", corresponding canonical segment can be " biology ", " multiplicity Property _ influence " and " environment _ influence ";When candidate pattern is " * C1_C2C3When * ", corresponding canonical segment can be " multiplicity Property ", " bearing capacity " and " on _ influence ".And each above-mentioned canonical segment can respectively correspond generation one candidate canonical Expression formula.
It should be noted that above-mentioned candidate pattern can be since the structure of similarity word is generally relatively simple According to the regular pattern for the simple structure that the structure of similarity word is extracted from the regular pattern library of search engine, it is also possible to The regular pattern of the simple structure set according to similarity word structure.
Step S104 respectively matches each similarity word using candidate regular expression.
In this step, for each candidate regular expression, using candidate's regular expression, respectively to each similar Search term is matched, and for each similarity word, extracts one or more canonical segment to the similarity word, if Character in its canonical segment and the mode for meeting candidate regular expression restriction, then the two is mutually matched.
Step S105 is based on matching result and the similarity word quilt for matched each similarity word is participated in Whether it is determined as the judgement of bad search term using other modes as a result, determining whether hit for the similarity word.
In this step, for each matched similarity word is participated in, its matching knot with candidate expression formula is obtained Fruit and the similarity word are used whether other modes are determined as the judgement of bad search term as a result, according to matching result With judgement as a result, determining whether candidate's regular expression is directed to similarity word hit.
Specifically, above-mentioned be judged as bad search term using other modes, it can be and manually similarity word is sentenced Fixed, the similarity word judgment that will search flame is bad search term.
Step S106, based on candidate regular expression to the hit situation for participating in matched each similarity word, selection is pre- If the candidate regular expression of quantity is used for the regular expression of filtered search word.
In this step, can according to candidate regular expression to the hit situation for participating in matched each similarity word, The candidate regular expression for choosing preset quantity is updated into the set of existing regular expression, is used for filtered search word, for Other candidate's regular expressions can then be added in blacklist, will be certain for the candidate regular expression in blacklist It will not be by again as candidate regular expression in time.
Specifically, being determined candidate based on candidate regular expression to the hit situation for participating in matched each similarity word Whether regular expression is selected as the regular expression for being used for filtered search word, can count candidate regular expression to phase Like the quantity that search term is hit, according to the quantity of hit, the candidate regular expression for choosing preset quantity is used as to be searched for filtering The regular expression of rope word.
In embodiments of the present invention, candidate regular expression can also be calculated to the hit rate of each similarity word, according to The size of hit rate selects the candidate regular expression of preset quantity as the regular expression for being used for filtered search word, wherein For each candidate regular expression, hit rate can be expressed as the quantity and similarity word of the similarity word of hit The ratio of quantity.Such as: if there are two candidate regular expression, one of candidate's regular expression can hit 50 it is similar Search term, another candidate regular expression can only hit 2 similarity words, then can will hit 50 similarity words pair The candidate regular expression answered is chosen as the regular expression for filtered search word;Or calculate above-mentioned two candidate canonical Expression formula to the hit rate of similarity word, choose wherein candidate regular expression of the hit rate greater than 70% as filtered search The regular expression of word.
It can be seen from the above, above-mentioned steps can extract canonical segment based on similarity word, candidate regular expression is obtained, Selectively existing regular expression can be updated from candidate regular expression.
In one embodiment of above-mentioned regular expression generation method, two are clicked based on search in above-mentioned steps S102 Component obtains and retrieves the identical each search term for clicking file with known bad search term, specific to handle as similarity word Mode may is that
Bad search term known to each is clicked in bipartite graph in search, is obtained and is connected with the known bad search term It is connected to identical click file and is not determined as each search term of bad search term, as search term to be screened;
For each search term to be screened, the similarity of the search term to be screened Yu known bad search term is calculated;
According to the size of similarity, selects similarity to be greater than the search term to be screened of the first preset threshold, searched as similar Rope word.
Specifically, clicking in bipartite graph in search, each search term has the weight of its corresponding word, each search term The weight when there is its corresponding connected with file is clicked, wherein the weight of word can be search time of the user for the word Number, when the weight on side can be search corresponding search term, user is to the hits for clicking file.For example, a certain search term, Corresponding click file only has file one and file two, and it is 200 times for the searching times of the word that now statistics, which obtains user, for The hits of file one have 150 times, and the hits of file two have 80 times, then for the search term, the weight of word be can be 200, the weight on the side which connect with file one is 150, and the weight on the side connecting with file two is 80.
It is clicked according to search and searches plain word in bipartite graph and click the connection relationship between file, it is bad known to each Search term is clicked in bipartite graph in search, and acquisition is connected with identical click file with the known bad search term and does not determine For each search term of bad search term, as search term to be screened.
The weight of word and the weight on side in bipartite graph are clicked according to above-mentioned search, for each search term to be screened, meter The similarity degree for calculating the search term to be screened and bad search term, as similarity.It is clicked in bipartite graph in search, because each One or more available search term to be screened of a known bad search term, and each search term to be screened, Ke Yigen It is obtained according to search term bad known to one or more, therefore, for i-th of search term to be screened, with known bad search Similarity S between wordiIt can indicate are as follows:
M indicates the known bad search term quantity that identical click file is connected with the search term to be screened, and k indicates m Bad search term known to k-th in known bad search term, p indicate bad search known to the search term to be screened and k-th The quantity for the click file that word connects jointly, WikIt indicates the search term to be screened and clicks the power that file is correspondingly connected with side for k-th Weight, WiIndicate the sum of the weight on the side that all click files of the search term connection to be screened are correspondingly connected with, WjkIndicate that this is known Bad search term and the weight for clicking the side that file is correspondingly connected with for k-th, WjIndicate all of the known bad search term connection Click the weight on the side that file is correspondingly connected with.
Similarity can be greater than by default value according to the size of its similarity for each search term to be screened Search term to be screened, as similarity word, for example, similarity is greater than 70% search term to be screened;Alternatively, be also possible to by Search term to be screened is arranged according to the sequence of similarity from big to small, the search term to be screened of predetermined quantity is chosen, as similar Search term, for example, similarity comes preceding 20 corresponding search terms to be screened.
It can be seen from the above, by the above method, it can be according to the phase between known bad search term and search term to be screened Like degree, screening obtains similarity word, provides a kind of implementation method for the determination of similarity word, meanwhile, it realizes to similar Effective control of search term quantity, the hit situation determination for after provide convenience.
In one embodiment of above-mentioned regular expression generation method, two are clicked based on search in above-mentioned steps S102 Component obtains and retrieves the identical each search term for clicking file with known bad search term, specific to handle as similarity word Mode is also possible to:
It is clicked in bipartite graph in search, bad search term known to each judges the power of the known bad search term The size of weight, right to choose is great in search term bad known to the second preset threshold, as specific bad search term;
Bipartite graph is clicked based on search, for each specific bad search term, acquisition has been connected thereto identical click text Part and each search term for not being determined as bad search term, as search term to be screened;
For each search term to be screened, the specific bad search term of acquisition connect each jointly with the search term to be screened File is clicked, judge the weight size on side that each click file is connected with the search term to be screened, selects the weight on side greater than the The search term to be screened of three preset thresholds, as similarity word.
Specifically, according to above-mentioned search click bipartite graph in word weight, can according to the weight of known bad search term, The weight of preferential selection word is greater than the known bad search term of predetermined value, as specific bad search term, for example, the weight of word Greater than search term bad known to 2000.
It for each specific bad search term, is clicked in bipartite graph in search, obtains and connect with the specific bad search term It is connected to identical click file and is not determined as one or more search term of bad search term, as search term to be screened.
For each search term to be screened, clicked in bipartite graph in search, obtain specific bad search term with it is to be screened The click file for searching for connection clicks file for each, preferentially the weight on the side connecting with the click file is selected to be greater than The search term to be screened of threshold value, as similarity word.
Further, it is illustrated by taking Fig. 2 as an example, if known bad search term only has query3 and query4 in figure, figure The weight on the side that middle query search term corresponding with the digital representation on doc line interconnected is connect with click file.If at this time The weight 1 of the word of query3, the weight of the word of query4 are 10, then according to the size of the weight of word, can using query4 as Specific bad search term.As seen from the figure, the click file of query4 connection has doc1, doc2 and doc3, then connects with query4 Being connected to the identical search term for clicking file has query1, query2, query3 and query5, because query3 is known bad searches Rope word, therefore search term to be screened only has query1, query2 and query5.If at this time in search term to be screened, with click The weight on the side of file connection is greater than 2, then is similarity word, then, as seen from the figure, the similarity word of query4 has Query1, query2 and query5.
It can be seen from the above, by the above method, according to the weight of the weight on side and bad search term, determining similarity Word, in quantity and control with the similarity degree of bad search term, hence it is evident that it is more accurate, and then feelings are hit convenient for after The determination of condition.
In one embodiment of above-mentioned regular expression generation method, candidate canonical table is used in above-mentioned steps S104 Up to formula, before being matched respectively to each similarity word, can also include:
Bipartite graph is clicked based on search, calculates the undesirable level of candidate regular expression, undesirable level indicates candidate canonical Relevance between expression formula and known bad search term;
Using candidate regular expression, each similarity word is matched respectively, comprising:
Meet the candidate regular expression of preset condition using undesirable level, each similarity word is matched respectively.
Specifically, clicked in bipartite graph in search, for search term bad known to each, available one or more A similarity word, and one or more canonical segment can be generated in each similarity word, each canonical segment A corresponding candidate regular expression.Therefore, for each candidate regular expression, with known bad search term it Between there are certain relevance, which can be indicated with the undesirable level of candidate's regular expression.
According to the size of the undesirable level of each candidate regular expression, it can choose undesirable level and meet preset condition Candidate regular expression, each similarity word is matched.Wherein, preset condition can be preset undesirable level Size, if undesirable level is greater than 10 candidate regular expression, alternatively, being also possible to the number of preset candidate regular expression Amount, such as relatively large first 10 candidate regular expressions of undesirable level size.
It can be seen from the above, the above method is mainly the process screened to candidate regular expression, main purpose is The candidate regular expression big with known bad search term relevance is deleted and is elected, candidate regular expression is ordered convenient for after The statistics of middle situation.
In one embodiment of above-mentioned regular expression generation method, bipartite graph is clicked based on search above-mentioned, is calculated The undesirable level of candidate regular expression, specific embodiment may is that
For each candidate regular expression, according to generate the corresponding each similarity word of the candidate regular expression and Similarity degree between known bad search term calculates being associated between candidate's regular expression and known bad search term Property;
For i-th of candidate regular expression, undesirable level ZiIt indicates are as follows:
N indicates that the quantity of the corresponding similarity word of candidate's regular expression, m expression and the generation candidate can be generated The corresponding similarity word of regular expression is connected with the quantity of the identical known bad search term for clicking file, SijIt indicates and the I similarity word is connected with the similarity degree known to identical j-th for clicking file between bad search term, CjIndicate jth The weight of a known bad search term;
Wherein, it is connected with known to identical j-th for clicking file between bad search term for i-th of similarity word Similarity degree SijIt indicates are as follows:
P indicates the quantity for clicking file that the similarity word is connect jointly with the known bad search term, WikIndicating should Similarity word and k-th of click file are correspondingly connected with the weight on side, WiIndicate all clicks text of similarity word connection The sum of the weight on the side that part is correspondingly connected with, WjkIt indicates the known bad search term and clicks the side that file is correspondingly connected with for k-th Weight, WjIndicate the weight on the side that all click files of the known bad search term connection are correspondingly connected with.
Further, it is illustrated by taking Fig. 2 as an example, it is assumed that known bad search term is query4, and similarity word is Query1, query2, query3 and query5, existing one candidate regular expression, which is basis Query2 and query5 extracts what canonical segment obtained, and wherein the weight of the corresponding word of query2 and query5 is 3, then the time Select the specific calculating process of regular expression undesirable level Z as follows:
It can be seen from the above, about candidate regular expression, mainly according to its relevance between known bad search term Size choose, relevance is bigger, and candidate regular expression may be better to the filter effect of flame.
It is matched for participating in above-mentioned steps S105 in one embodiment of above-mentioned regular expression generation method Each similarity word is used whether other modes are determined as bad search term based on matching result and the similarity word Judgement as a result, determine whether hit for the similarity word, a kind of embodiment may include:
If the matching result of the similarity word is to match, and the similarity word is used other modes and is determined as Bad search term, it is determined that hit for the similarity word, otherwise, it determines being directed to the similarity word miss.
Alternatively, if the matching result of the similarity word is not match that, and the similarity word is used its other party Formula is determined as non-bad search term, it is determined that hits for the similarity word, otherwise, it determines not ordering for the similarity word In.
Specifically, candidate regular expression can be used, respectively to each similarity word carry out it is matched during, about Similarity word then will appear following four situation: can be mutually matched, and be determined as bad search term using other modes Situation;The case where cannot being mutually matched, and being determined as non-bad search term using other modes;It can be mutually matched, and use Other modes are determined as the case where non-bad search term;It cannot be mutually matched, and be to be determined as bad search using other modes The case where word.For each candidate regular expression, if there is the case where above-mentioned first two, then it is assumed that candidate's regular expressions Formula hits corresponding similarity word;If there is above-mentioned latter two situation, then it is assumed that candidate's regular expression miss pair The similarity word answered.
Whether can be for the matched each phase of participation it can be seen from the above, using the above method to determine candidate regular expression Whether hit like search term comprising four kinds of situations can comprehensively summarize all situations being likely to occur, to hit situation Determination it is more accurate.
It is matched for participating in above-mentioned steps S105 in one embodiment of above-mentioned regular expression generation method Each similarity word is used whether other modes are determined as bad search term based on matching result and the similarity word Judgement as a result, determining whether hit for the similarity word, another embodiment may include:
If the matching result of the similarity word is to match, and the similarity word is used other modes and is determined as Bad search term, it is determined that hit for the similarity word, if the matching result of the similarity word is to match, and be somebody's turn to do Similarity word is used other modes and is determined as non-bad search term, it is determined that is directed to the similarity word miss.
Above-mentioned second of embodiment carries out matched mistake to each similarity word respectively using candidate regular expression Cheng Zhong determines whether be directed to similarity word only in the case where candidate regular expression and similarity word are mutually matched Hit only will appear a kind of situation in this case, that is, be directed to each similarity word, in itself and candidate regular expression In the case where being mutually matched, which is used other modes and is determined as bad search term, it is determined that similar for this Search term hit, conversely, the similarity word is used its other party in the case where it is mutually matched with candidate regular expression Formula is determined as non-bad search term, it is determined that is directed to the similarity word miss.
Whether can be for the matched each phase of participation it can be seen from the above, using the above method to determine candidate regular expression Whether hit like search term, during determining to hit situation, for each candidate regular expression, only considering can be with Whether its matched similarity word can be hit by candidate regular expression, and this determining method is more simple in practical operation Folk prescription is just.
In conclusion the regular expression generation method provided according to embodiments of the present invention, can according to similarity word, Regular expression is generated, the continuous renewal to existing regular expression may be implemented.
Based on the same inventive concept, the above-mentioned regular expression generation method provided according to embodiments of the present invention, the present invention Embodiment additionally provides a kind of regular expression generating means, as shown in figure 3, comprising the following modules:
Bad search term obtains module 201, for obtaining known bad search term;
Similarity word obtains module 202, for clicking bipartite graph based on search, obtains and known bad search word and search To the identical each search term for clicking file, as similarity word, wherein bipartite graph is clicked in search indicates that search term is corresponding The connection relationship clicked between file that user clicks in search result;
Regular expression generation module 203 obtains regular expression, makees for extracting canonical segment to similarity word For candidate regular expression;
Matching module 204, for being matched to each similarity word respectively using candidate regular expression;
Hit situation determining module 205, for for participating in matched each similarity word, based on matching result and The similarity word is used whether other modes are determined as the judgement of bad search term as a result, determining for the similarity word Whether hit;
Regular expression selected module 206, for being based on candidate regular expression to the matched each similarity word of participation Hit situation, determine the candidate regular expression of preset quantity as the regular expression for being used for filtered search word.
Further, similarity word obtains module 202, may include:
Search term acquisition submodule to be screened, for clicking bipartite graph in search for each known bad search term In, each search term for being connected with identical click file with the known bad search term and not being determined as bad search term is obtained, is made For search term to be screened;
Similarity calculation submodule, for be directed to each search term to be screened, calculate the search term to be screened with it is known The similarity of bad search term;
Similarity selected ci poem takes submodule, for the size according to similarity, similarity is selected to be greater than the first preset threshold Search term to be screened, as similarity word.
Further, similarity word obtains module 202, also may include:
Specific bad search term acquisition submodule, it is bad known to each to search for being clicked in bipartite graph in search Rope word judges that the size of the weight of the known bad search term, right to choose are great in search bad known to the second preset threshold Word, as specific bad search term;
Search term acquisition submodule to be screened, for clicking bipartite graph based on search, for each specific bad search Word obtains each search term for being connected thereto identical click file and not being determined as bad search term, as search to be screened Word;
Similarity selected ci poem takes submodule, for being directed to each search term to be screened, obtain specific bad search term with Each click file that the search term to be screened connects jointly judges the power on the side that each click file is connected with the search term to be screened It is great small, select the weight on side to be greater than the search term to be screened of third predetermined threshold value, as similarity word.
Further, above-mentioned apparatus can also include:
Undesirable level computing module, for calculating the undesirable level of candidate regular expression based on search click bipartite graph, Undesirable level indicates the relevance between candidate regular expression and known bad search term;
Matching module 204, specifically for meeting the candidate regular expression of preset condition using undesirable level, respectively to each Similarity word is matched.
Further, above-mentioned undesirable level computing module is specifically used for for each candidate regular expression, according to life At the similarity degree between the corresponding each similarity word of candidate's regular expression and known bad search term, the candidate is calculated Relevance between regular expression and known bad search term;
For i-th of candidate regular expression, undesirable level ZiIt indicates are as follows:
N indicates that the quantity of the corresponding similarity word of candidate's regular expression, m expression and the generation candidate can be generated The corresponding similarity word of regular expression is connected with the quantity of the identical known bad search term for clicking file, SijIt indicates and the I similarity word is connected with the similarity degree known to identical j-th for clicking file between bad search term, CjIndicate jth The weight of a known bad search term;
Wherein, it is connected with known to identical j-th for clicking file between bad search term for i-th of similarity word Similarity degree SijIt indicates are as follows:
P indicates the quantity for clicking file that the similarity word is connect jointly with the known bad search term, WikIndicating should Similarity word and k-th of click file are correspondingly connected with the weight on side, WiIndicate all clicks text of similarity word connection The sum of the weight on the side that part is correspondingly connected with, WjkIt indicates the known bad search term and clicks the side that file is correspondingly connected with for k-th Weight, WjIndicate the weight on the side that all click files of the known bad search term connection are correspondingly connected with.
Further, hit situation determining module 205 may include:
First hit situation determines submodule, if the matching result for the similarity word is to match, and the phase Other modes are used like search term and are determined as bad search term, it is determined that are hit for the similarity word, otherwise, it determines needle To the similarity word miss;
Second hit situation determines submodule, if the matching result for the similarity word is not match that, and be somebody's turn to do Similarity word is used other modes and is determined as non-bad search term, it is determined that hits for the similarity word, otherwise, really Surely it is directed to the similarity word miss.
Further, hit situation determining module 205, if the matching result specifically for the similarity word is phase Match, and the similarity word is used other modes and is determined as bad search term, it is determined that hit for the similarity word, such as The matching result of the fruit similarity word is to match, and the similarity word is used other modes and is determined as non-bad search Word, it is determined that be directed to the similarity word miss.
The embodiment of the invention also provides a kind of electronic equipment, as shown in figure 4, include processor 401, communication interface 402, Memory 403 and communication bus 404, wherein processor 401, communication interface 402, memory 403 are complete by communication bus 404 At mutual communication,
Memory 403, for storing computer program;
Processor 401 when for executing the program stored on memory 403, realizes following steps:
Obtain known bad search term;
Bipartite graph is clicked based on search, obtains and retrieves the identical each search term for clicking file with known bad search term, As similarity word, wherein the click that bipartite graph indicates that user clicks in the corresponding search result of search term is clicked in search Connection relationship between file;
Canonical segment is extracted to similarity word, regular expression is obtained, as candidate regular expression;
Using candidate regular expression, each similarity word is matched respectively;
For matched each similarity word is participated in, its other party is used based on matching result and the similarity word Whether formula is determined as the judgement of bad search term as a result, determining whether hit for the similarity word;
Based on candidate regular expression to the hit situation for participating in matched each similarity word, the time of preset quantity is selected Select regular expression as the regular expression for being used for filtered search word.
The communication bus that above-mentioned electronic equipment is mentioned can be Peripheral Component Interconnect standard (Peripheral Component Interconnect, PCI) bus or expanding the industrial standard structure (Extended Industry Standard Architecture, EISA) bus etc..The communication bus can be divided into address bus, data/address bus, control bus etc..For just It is only indicated with a thick line in expression, figure, it is not intended that an only bus or a type of bus.
Communication interface is for the communication between above-mentioned electronic equipment and other equipment.
Memory may include random access memory (Random Access Memory, RAM), also may include non-easy The property lost memory (Non-Volatile Memory, NVM), for example, at least a magnetic disk storage.Optionally, memory may be used also To be storage device that at least one is located remotely from aforementioned processor.
Above-mentioned processor can be general processor, including central processing unit (Central Processing Unit, CPU), network processing unit (Network Processor, NP) etc.;It can also be digital signal processor (Digital Signal Processing, DSP), it is specific integrated circuit (Application Specific Integrated Circuit, ASIC), existing It is field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete Door or transistor logic, discrete hardware components.
In another embodiment provided by the invention, a kind of computer readable storage medium is additionally provided, which can It reads to be stored with computer program in storage medium, the computer program realizes any of the above-described expressing when being executed by processor The step of formula generation method.
In another embodiment provided by the invention, a kind of computer program product comprising instruction is additionally provided, when it When running on computers, so that computer executes any regular expression generation method in above-described embodiment.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.The computer program Product includes one or more computer instructions.When loading on computers and executing the computer program instructions, all or It partly generates according to process or function described in the embodiment of the present invention.The computer can be general purpose computer, dedicated meter Calculation machine, computer network or other programmable devices.The computer instruction can store in computer readable storage medium In, or from a computer readable storage medium to the transmission of another computer readable storage medium, for example, the computer Instruction can pass through wired (such as coaxial cable, optical fiber, number from a web-site, computer, server or data center User's line (DSL)) or wireless (such as infrared, wireless, microwave etc.) mode to another web-site, computer, server or Data center is transmitted.The computer readable storage medium can be any usable medium that computer can access or It is comprising data storage devices such as one or more usable mediums integrated server, data centers.The usable medium can be with It is magnetic medium, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk Solid State Disk (SSD)) etc..
It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence " including one ... ", it is not excluded that There is also other identical elements in the process, method, article or apparatus that includes the element.
Each embodiment in this specification is all made of relevant mode and describes, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device, For the embodiments such as electronic equipment, since it is substantially similar to the method embodiment, so being described relatively simple, related place ginseng See the part explanation of embodiment of the method.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Any modification, equivalent replacement, improvement and so within the spirit and principles in the present invention, are all contained in protection scope of the present invention It is interior.

Claims (15)

1. a kind of regular expression generation method characterized by comprising
Obtain known bad search term;
Bipartite graph is clicked based on search, obtains and retrieves the identical each search term for clicking file with the known bad search term, As similarity word, wherein described search, which clicks bipartite graph, indicates user's click in the corresponding search result of search term Click the connection relationship between file;
Canonical segment is extracted to the similarity word, regular expression is obtained, as candidate regular expression;
Using the candidate regular expression, each similarity word is matched respectively;
For matched each similarity word is participated in, being used other modes based on matching result and the similarity word is The no judgement for being determined as bad search term is as a result, determine whether hit for the similarity word;
Based on the candidate regular expression to the hit situation for participating in matched each similarity word, preset quantity is selected Candidate regular expression as be used for filtered search word regular expression.
2. the method according to claim 1, wherein it is described based on search click bipartite graph, obtain with it is described Know that bad search term retrieves the identical each search term for clicking file, as similarity word, comprising:
It for known bad search term described in each, is clicked in bipartite graph in search, obtains and connect with the known bad search term It is connected to identical click file and is not determined as each search term of bad search term, as search term to be screened;
For search term to be screened described in each, it is similar to the known bad search term to calculate the search term to be screened Degree;
According to the size of the similarity, selects similarity to be greater than the search term to be screened of the first preset threshold, searched as similar Rope word.
3. the method according to claim 1, wherein it is described based on search click bipartite graph, obtain with it is described Know that bad search term retrieves the identical each search term for clicking file, as similarity word, comprising:
It is clicked in bipartite graph in search, for known bad search term described in each, judges the power of the known bad search term The size of weight, right to choose is great in search term bad known to the second preset threshold, as specific bad search term;
Bipartite graph is clicked based on described search, for specific bad search term described in each, acquisition has been connected thereto identical point It hits file and is not determined as each search term of bad search term, as search term to be screened;
For search term to be screened described in each, obtains the specific bad search term and connect jointly with the search term to be screened Each click file, judge each weight size on side clicking file and connecting with the search term to be screened, select the power on side The great search term to be screened in third predetermined threshold value, as similarity word.
4. right respectively the method according to claim 1, wherein described using the candidate regular expression Before each similarity word is matched, further includes:
Bipartite graph is clicked based on described search, calculates the undesirable level of the candidate regular expression, the undesirable level indicates Relevance between candidate's regular expression and the known bad search term;
It is described to use the candidate regular expression, each similarity word is matched respectively, comprising:
Meet the candidate regular expression of preset condition using the undesirable level, respectively to each similarity word into Row matching.
5. according to the method described in claim 4, it is characterized in that, described click bipartite graph based on described search, described in calculating The undesirable level of candidate regular expression, comprising:
For candidate's regular expression described in each, according to generate the corresponding each similarity word of the candidate regular expression and Similarity degree between the known bad search term, calculates between candidate's regular expression and the known bad search term Relevance;
For i-th of candidate regular expression, undesirable level ZiIt indicates are as follows:
N indicates that the quantity of the corresponding similarity word of candidate's regular expression, m expression and generation candidate's canonical can be generated The corresponding similarity word of expression formula is connected with the quantity of the identical known bad search term for clicking file, SijIt indicates and i-th Similarity word is connected with the similarity degree known to identical j-th for clicking file between bad search term, CjIndicate j-th Know the weight of bad search term;
Wherein, for the phase being connected with i-th of similarity word known to identical j-th for clicking file between bad search term Like degree SijIt indicates are as follows:
P indicates the quantity for clicking file that the similarity word is connect jointly with the known bad search term, WikIndicate that this is similar Search term and k-th of click file are correspondingly connected with the weight on side, WiIndicate all click files pair of similarity word connection The sum of the weight on the side that should be connected, WjkIt indicates the known bad search term and clicks the weight on the side that file is correspondingly connected with for k-th, WjIndicate the weight on the side that all click files of the known bad search term connection are correspondingly connected with.
6. the method according to claim 1, wherein described for the matched each similarity word of participation, base It is used whether other modes are determined as the judgement of bad search term as a result, determining needle in matching result and the similarity word Whether the similarity word is hit, comprising:
If the matching result of the similarity word be match, and the similarity word be used other modes be determined as it is bad Search term, it is determined that hit for the similarity word, otherwise, it determines being directed to the similarity word miss;
Alternatively, if the matching result of the similarity word is not match that, and the similarity word is used other modes and sentences It is set to non-bad search term, it is determined that hit for the similarity word, otherwise, it determines being directed to the similarity word miss.
7. the method according to claim 1, wherein described for the matched each similarity word of participation, base It is used whether other modes are determined as the judgement of bad search term as a result, determining needle in matching result and the similarity word Whether the similarity word is hit, comprising:
If the matching result of the similarity word be match, and the similarity word be used other modes be determined as it is bad Search term, it is determined that hit for the similarity word, if the matching result of the similarity word is to match, and this is similar Search term is used other modes and is determined as non-bad search term, it is determined that is directed to the similarity word miss.
8. a kind of regular expression generating means characterized by comprising
Bad search term obtains module, for obtaining known bad search term;
Similarity word obtains module, and for clicking bipartite graph based on search, acquisition is retrieved with the known bad search term The identical each search term for clicking file, as similarity word, wherein described search, which clicks bipartite graph, indicates that search term is right with it Answer the connection relationship clicked between file that user clicks in search result;
Regular expression generation module obtains regular expression, as time for extracting canonical segment to the similarity word Select regular expression;
Matching module, for being matched to each similarity word respectively using the candidate regular expression;
Hit situation determining module, for being based on matching result and this being similar for matched each similarity word is participated in Search term is used whether other modes are determined as the judgement of bad search term as a result, determining whether order for the similarity word In;
Regular expression selected module, for being based on the candidate regular expression to the matched each similarity word of participation Hit situation, select the candidate regular expression of preset quantity as be used for filtered search word regular expression.
9. device according to claim 8, which is characterized in that the similarity word obtains module, comprising:
Search term acquisition submodule to be screened, for clicking bipartite graph in search for each known bad search term In, each search term for being connected with identical click file with the known bad search term and not being determined as bad search term is obtained, is made For search term to be screened;
Similarity calculation submodule, for be directed to each described search term to be screened, calculate the search term to be screened with it is described The similarity of known bad search term;
Similarity selected ci poem takes submodule, for the size according to the similarity, similarity is selected to be greater than the first preset threshold Search term to be screened, as similarity word.
10. device according to claim 8, which is characterized in that the similarity word obtains module, comprising:
Specific bad search term acquisition submodule bad is searched for clicking in bipartite graph in search for known described in each Rope word judges that the size of the weight of the known bad search term, right to choose are great in search bad known to the second preset threshold Word, as specific bad search term;
Search term acquisition submodule to be screened, for clicking bipartite graph based on described search, for specific bad described in each Search term obtains each search term for being connected thereto identical click file and not being determined as bad search term, searches as to be screened Rope word;
Similarity selected ci poem takes submodule, for being directed to each described search term to be screened, obtains the specific bad search Each click file that word is connect jointly with the search term to be screened judges that each click file is connected with the search term to be screened Side weight size, select side weight be greater than third predetermined threshold value search term to be screened, as similarity word.
11. device according to claim 8, which is characterized in that further include:
Undesirable level computing module calculates the bad of the candidate regular expression for clicking bipartite graph based on described search Degree, the undesirable level indicate the relevance between the candidate regular expression and the known bad search term;
The matching module, specifically for meeting the candidate regular expression of preset condition using the undesirable level, point It is other that each similarity word is matched.
12. device according to claim 11, which is characterized in that the undesirable level computing module, specifically for being directed to Each described candidate regular expression, according to generate the corresponding each similarity word of candidate's regular expression with it is described known Similarity degree between bad search term calculates being associated between candidate's regular expression and the known bad search term Property;
For i-th of candidate regular expression, undesirable level ZiIt indicates are as follows:
N indicates that the quantity of the corresponding similarity word of candidate's regular expression, m expression and generation candidate's canonical can be generated The corresponding similarity word of expression formula is connected with the quantity of the identical known bad search term for clicking file, SijIt indicates and i-th Similarity word is connected with the similarity degree known to identical j-th for clicking file between bad search term, CjIndicate j-th Know the weight of bad search term;
Wherein, for the phase being connected with i-th of similarity word known to identical j-th for clicking file between bad search term Like degree SijIt indicates are as follows:
P indicates the quantity for clicking file that the similarity word is connect jointly with the known bad search term, WikIndicate that this is similar Search term and k-th of click file are correspondingly connected with the weight on side, WiIndicate all click files pair of similarity word connection The sum of the weight on the side that should be connected, WjkIt indicates the known bad search term and clicks the weight on the side that file is correspondingly connected with for k-th, WjIndicate the weight on the side that all click files of the known bad search term connection are correspondingly connected with.
13. device according to claim 8, which is characterized in that the hit situation determining module, comprising:
First hit situation determines submodule, if the matching result for the similarity word is to match, and this similar is searched Rope word is used other modes and is determined as bad search term, it is determined that hits for the similarity word, otherwise, it determines for should Similarity word miss;
Second hit situation determines submodule, if the matching result for the similarity word is not match that, and this is similar Search term is used other modes and is determined as non-bad search term, it is determined that hits for the similarity word, otherwise, it determines needle To the similarity word miss.
14. device according to claim 8, which is characterized in that the hit situation determining module, it should if be specifically used for The matching result of similarity word is to match, and the similarity word is used other modes and is determined as bad search term, then It determines and is hit for the similarity word, if the matching result of the similarity word is to match, and the similarity word quilt It is determined as non-bad search term using other modes, it is determined that be directed to the similarity word miss.
15. a kind of electronic equipment, which is characterized in that including processor, communication interface, memory and communication bus, wherein processing Device, communication interface, memory complete mutual communication by communication bus;
Memory, for storing computer program;
Processor when for executing the program stored on memory, realizes any method and step of claim 1-6.
CN201810695221.3A 2018-06-29 2018-06-29 Regular expression generation method and device and electronic equipment Active CN109190014B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810695221.3A CN109190014B (en) 2018-06-29 2018-06-29 Regular expression generation method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810695221.3A CN109190014B (en) 2018-06-29 2018-06-29 Regular expression generation method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN109190014A true CN109190014A (en) 2019-01-11
CN109190014B CN109190014B (en) 2021-11-26

Family

ID=64948682

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810695221.3A Active CN109190014B (en) 2018-06-29 2018-06-29 Regular expression generation method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN109190014B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110083758A (en) * 2019-04-30 2019-08-02 闻康集团股份有限公司 A kind of medical treatment search engine data platform system
CN111292205A (en) * 2019-12-17 2020-06-16 东方微银科技(北京)有限公司 Judicial data analysis method, device, equipment and storage medium
CN113343715A (en) * 2021-06-29 2021-09-03 深圳前海微众银行股份有限公司 Method, device and equipment for automatically generating regular expression and storage medium
CN113656659A (en) * 2021-08-31 2021-11-16 上海观安信息技术股份有限公司 Data extraction method, device and system and computer readable storage medium
CN113656538A (en) * 2021-07-09 2021-11-16 深圳价值在线信息科技股份有限公司 Method and device for generating regular expression, computing equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101847242A (en) * 2010-05-27 2010-09-29 武汉大学 Method and system for automatically acquiring aliases of contraband on line
US20140136517A1 (en) * 2012-11-10 2014-05-15 Chian Chiu Li Apparatus And Methods for Providing Search Results
CN104809108A (en) * 2015-05-20 2015-07-29 成都布林特信息技术有限公司 Information monitoring and analyzing system
CN106919603A (en) * 2015-12-25 2017-07-04 北京奇虎科技有限公司 The method and apparatus for calculating participle weight in query word pattern

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101847242A (en) * 2010-05-27 2010-09-29 武汉大学 Method and system for automatically acquiring aliases of contraband on line
US20140136517A1 (en) * 2012-11-10 2014-05-15 Chian Chiu Li Apparatus And Methods for Providing Search Results
CN104809108A (en) * 2015-05-20 2015-07-29 成都布林特信息技术有限公司 Information monitoring and analyzing system
CN106919603A (en) * 2015-12-25 2017-07-04 北京奇虎科技有限公司 The method and apparatus for calculating participle weight in query word pattern

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110083758A (en) * 2019-04-30 2019-08-02 闻康集团股份有限公司 A kind of medical treatment search engine data platform system
CN111292205A (en) * 2019-12-17 2020-06-16 东方微银科技(北京)有限公司 Judicial data analysis method, device, equipment and storage medium
CN111292205B (en) * 2019-12-17 2021-05-25 东方微银科技股份有限公司 Judicial data analysis method, device, equipment and storage medium
CN113343715A (en) * 2021-06-29 2021-09-03 深圳前海微众银行股份有限公司 Method, device and equipment for automatically generating regular expression and storage medium
CN113656538A (en) * 2021-07-09 2021-11-16 深圳价值在线信息科技股份有限公司 Method and device for generating regular expression, computing equipment and storage medium
CN113656659A (en) * 2021-08-31 2021-11-16 上海观安信息技术股份有限公司 Data extraction method, device and system and computer readable storage medium

Also Published As

Publication number Publication date
CN109190014B (en) 2021-11-26

Similar Documents

Publication Publication Date Title
CN109190014A (en) A kind of regular expression generation method, device and electronic equipment
Hotho et al. Information retrieval in folksonomies: Search and ranking
US20080114755A1 (en) Identifying sources of media content having a high likelihood of producing on-topic content
CN107862022B (en) Culture resource recommendation system
CN109885770A (en) A kind of information recommendation method, device, electronic equipment and storage medium
US20130110839A1 (en) Constructing an analysis of a document
CN109189990B (en) Search word generation method and device and electronic equipment
US20140189525A1 (en) User behavior models based on source domain
CN109684483A (en) Construction method, device, computer equipment and the storage medium of knowledge mapping
Huang et al. Topic detection from large scale of microblog stream with high utility pattern clustering
US10152478B2 (en) Apparatus, system and method for string disambiguation and entity ranking
WO2014056408A1 (en) Information recommending method, device and server
WO2011008848A2 (en) Activity based users' interests modeling for determining content relevance
Tibély et al. Extracting tag hierarchies
Avarikioti et al. Structure and content of the visible Darknet
CN104899215A (en) Data processing method, recommendation source information organization, information recommendation method and information recommendation device
CN103678710A (en) Information recommendation method based on user behaviors
Schinas et al. Mgraph: multimodal event summarization in social media using topic models and graph-based ranking
CN112989118B (en) Video recall method and device
CN109933691A (en) Method, apparatus, equipment and storage medium for content retrieval
CN112836126A (en) Recommendation method and device based on knowledge graph, electronic equipment and storage medium
CN107944001A (en) Hot news detection method and device and electronic equipment
Vandic et al. A semantic-based approach for searching and browsing tag spaces
Tuomchomtam et al. Community recommendation for text post in social media: A case study on Reddit
Giummolè et al. A study on microblog and search engine user behaviors: How twitter trending topics help predict *** hot queries

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant