CN107220333B - character search method based on Sunday algorithm - Google Patents

character search method based on Sunday algorithm Download PDF

Info

Publication number
CN107220333B
CN107220333B CN201710375615.6A CN201710375615A CN107220333B CN 107220333 B CN107220333 B CN 107220333B CN 201710375615 A CN201710375615 A CN 201710375615A CN 107220333 B CN107220333 B CN 107220333B
Authority
CN
China
Prior art keywords
string
characters
character
text
pattern string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710375615.6A
Other languages
Chinese (zh)
Other versions
CN107220333A (en
Inventor
刘小垒
张小松
牛伟纳
胡斌
王中晴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201710375615.6A priority Critical patent/CN107220333B/en
Publication of CN107220333A publication Critical patent/CN107220333A/en
Application granted granted Critical
Publication of CN107220333B publication Critical patent/CN107220333B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the field of computer software application, and discloses character searching methods based on a Sunday algorithm, which match a pattern string and a text pattern string by judging whether a combination of the last bit of the text window and the lower bit characters outside the text window appears in the pattern string, wherein if the matching is successful, a program is ended, if the matching is unsuccessful, the text window is moved, and the judgment is continued by using the method until the text window reaches the tail end of the text string or the matching is successful, the program is ended.

Description

character search method based on Sunday algorithm
Technical Field
The invention relates to character search, in particular to character search methods based on Sunday algorithm, which are used for searching characters in the field of computer networks.
Background
As the internet is gradually huge and information is more and more, how to quickly search information needed by the user from massive information is a hotspot of network search research, wherein a character string matching algorithm plays a very important role, efficient character string matching algorithms can greatly improve the efficiency and quality of search, and the character string matching has extensive applications in the network field, such as spelling check, language translation, data compression, search engines, network intrusion detection and the like.
The conventional common string matching algorithms include Brute force, KMP, Boyer Moore, Sunday, robin _ karp, bitap and the like, wherein the Sunday algorithm is matching schemes with higher average efficiency.
The core idea is that in the matching process, the pattern string is not required to be compared from left to right or from right to left by , and when a mismatch is found, the algorithm can skip as many characters as possible to carry out the matching in the next step, thereby improving the matching efficiency.
The Sunday algorithm idea is similar to the BM algorithm, and when the matching fails, the lower -bit character of the last character in the text string that is matched is concerned, if the character does not appear in the matching string, the character is skipped directly, i.e. the moving step length is equal to the length of the matching string +1, otherwise, the moving step length is equal to the distance from the character at the rightmost end in the matching string to the end +1 as in the BM algorithm .
When the pattern string is long (in many practical applications, the pattern string is long), the probability that characters appear in the pattern string outside the text window is high, which increases many invalid matches, and thus the efficiency of the Sunday algorithm is significantly reduced.
Disclosure of Invention
Based on the technical problems, the invention provides character searching methods based on the Sunday algorithm, thereby solving the technical problem of generating a plurality of invalid matches in the character searching.
The technical scheme of the invention is as follows:
character search method based on Sunday algorithm, comprising the following steps:
step 1: constructing an auxiliary array by using information of two adjacent characters in a pattern string, wherein the pattern string is a character string to be matched;
step 2: the method comprises the following steps that a text window is aligned with a text character string on the left, the text character string is a text to be searched, and the text window slides on the text character string; judging whether the last two characters in the text window appear in the pattern string by using the auxiliary array; if not, jumping to the step 4; if yes, aligning the characters appearing in the pattern string with the last two characters, judging whether the characters on the corresponding positions of the pattern string and the text character string are matched, and if yes, ending the matching success program; if not, jumping to the step 3;
step 3, adopting a Sunday algorithm to carry out matching and jumping for times, if the Sunday algorithm is successfully matched, ending the program, if the Sunday algorithm is unsuccessfully matched, judging whether the jumping length of the Sunday algorithm exceeds the length of a text window, if the jumping length of the Sunday algorithm exceeds the length of the text window, jumping to the step 4, and if the jumping length of the text window is not exceeded, repeating the step 3;
and 4, step 4: moving the text window to the right by J character length, taking the last character in the text window and the character outside the text window and adjacent to the last character as combined characters, and judging whether the combined characters appear in the pattern string by using an auxiliary array; if yes, jumping to the step 5; if not, judging whether the text window exceeds the text character string, if so, ending the matching failure program; if not, repeating the content of the step 4; the length of the mode string and the length of the text window are J characters;
and 5: aligning the characters appearing in the pattern string with the combined characters, and judging whether the characters on the corresponding positions of the pattern string and the text character string are matched or not; if the matching is successful, the matching procedure is ended; and if not, jumping to the step 3.
Step , the auxiliary array in step 1 is constructed as follows:
s201: calculating the hash value of two adjacent characters in the pattern string by using the ASCII code value of the characters in the pattern string and the set index value, and adopting the following formula:
H[i]=A[i]×index+A[i+1]×(index+1) (1)
wherein i represents the position serial number of the character in the pattern string; h [ i ] represents the hash value of two adjacent characters in the pattern string; a [ i ] represents an ASCII code value of the ith character;
s202: the maximum hash value generated by combining all characters in pairs is used as the number of elements in the auxiliary array, and the size of the hash value generated by combining all the characters in pairs is the position serial number of the elements;
s203: mapping the hash value calculated in S201 to the auxiliary array in the following specific manner:
the elements at the positions corresponding to the hash values Hi are 1, if the hash values Hi generated by two or more groups of adjacent character combinations in the pattern string are the same, the elements at the positions corresponding to the hash values Hi are set to be numbers larger than 1, and the elements at the rest positions in the auxiliary array are 0.
, the method for determining by using the auxiliary array includes calculating hash values of two adjacent characters to be determined in the text string, searching elements at corresponding positions in the auxiliary array by using the hash values, if the element is 1, indicating that the two adjacent characters in the text string match the pattern string or appear in the pattern string, if the element is 0, indicating that the two adjacent characters in the text string do not match the pattern string or do not appear in the pattern string, and if the element is a number greater than 1, indicating that the hash values of the character combination and the other character combinations are the same, and performing the content of step 3.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:
the ASCII value and the index value of the character are adopted to calculate the hash value, so that the method is simple and easy to implement; the matching is carried out by utilizing the combined character form, so that the probability of invalid matching can be effectively reduced, and the efficiency of the Sunday algorithm is improved.
The pattern string is determined, and the auxiliary arrays are obtained through preprocessing and then matched, so that the use efficiency in practical application is improved, for example, the detection of malicious codes in network intrusion, the search of large text in paper retrieval and virus multi-feature scanning are realized, and the pattern string of the application is very long, so that the application efficiency of the invention can be greatly improved.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a graph showing the results of the experiment of the present invention
Detailed Description
All features disclosed in this specification may be combined in any combination, except features and/or steps that are mutually exclusive.
The present invention will be described in detail with reference to the accompanying drawings.
character search method based on Sunday algorithm, comprising the following steps:
step 1: constructing an auxiliary array by using information of two adjacent characters in a pattern string, wherein the pattern string is a character string to be matched;
the auxiliary array is constructed as follows:
s201: calculating the hash value of two adjacent characters in the pattern string by using the ASCII code value of the characters in the pattern string and the set index value, and adopting the following formula:
H[i]=A[i]×index+A[i+1]×(index+1) (2)
wherein i represents the position serial number of the character in the pattern string; h [ i ] represents the hash value of two adjacent characters in the pattern string; a [ i ] represents an ASCII code value of the ith character;
s202: the maximum hash value generated by combining all characters in pairs is used as the number of elements in the auxiliary array, and the size of the hash value generated by combining all the characters in pairs is the position serial number of the elements;
s203: mapping the hash value calculated in S201 to the auxiliary array in the following specific manner:
the elements at the positions corresponding to the hash values Hi are 1, if the hash values Hi generated by two or more groups of adjacent character combinations in the pattern string are the same, the elements at the positions corresponding to the hash values Hi are set to be numbers larger than 1, and the elements at the rest positions in the auxiliary array are 0.
Step 2: the method comprises the following steps that a text window is aligned with a text character string on the left, the text character string is a text to be searched, and the text window slides on the text character string; calculating the hash values of the last two characters in the text window, and if the element at the corresponding position of the calculated hash value is 0 in the auxiliary array, indicating that the last two characters do not appear in the pattern string, and skipping to the step 4; if the element at the position corresponding to the calculated hash value is 1, representing that the last two characters appear in the pattern string, aligning the characters appearing in the pattern string with the last two characters, judging whether the characters at the positions corresponding to the pattern string and the text character string are matched, and if so, finishing the matching success program; if not, jumping to the step 3; and if the element at the position corresponding to the calculated hash value is a value larger than 1, the hash value representing that another combined character exists is the same as the hash value, and the step 3 is skipped to reduce the error rate.
Step 3, adopting a Sunday algorithm to carry out matching and jumping for times, if the Sunday algorithm is successfully matched, ending the program, if the Sunday algorithm is unsuccessfully matched, judging whether the jumping length of the Sunday algorithm exceeds the length of a text window, if the jumping length of the Sunday algorithm exceeds the length of the text window, jumping to the step 4, and if the jumping length of the text window is not exceeded, repeating the step 3;
and 4, step 4: moving the text window to the right by J character length, taking the last character in the text window and the character outside the text window and adjacent to the last character as combined characters, calculating the hash value of the combined characters, in the auxiliary array, if the element at the corresponding position of the calculated hash value is 0, representing that the last two characters do not appear in the pattern string, judging whether the text window exceeds the text character string, and if so, ending the matching failure program; if not, repeating the content of the step 4; if the element at the position corresponding to the calculated hash value is 1, representing that the last two characters appear in the pattern string, and skipping to the step 5; if the element at the position corresponding to the calculated hash value is a value larger than 1, the hash value representing that other combined characters exist is the same as the hash value, and in order to reduce the error rate, the step 3 is skipped; the length of the mode string and the length of the text window are J characters;
and 5: aligning the characters appearing in the pattern string with the combined characters, and judging whether the characters on the corresponding positions of the pattern string and the text character string are matched or not; if the matching is successful, the matching procedure is ended; and if not, jumping to the step 3.
The method has the working principle that an auxiliary array is constructed, a text window slides on a text character string, the sliding length is the length of a mode string, the last characters of the text window are combined with the next characters outside the text window, whether the character combination appears in the mode string is judged through the auxiliary array, and therefore the character string is matched.
The present invention will now be described in further detail in with reference to specific examples.
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
Detailed description of the preferred embodiment 1
The following is the implementation program of the C language of the invention:
specific example 2
The following is a graph of experimental results of the present invention, specifically a graph comparing efficiency with Sunday's algorithm.
Generating 10KB random pure English letters and numbers as text character strings, selecting the character strings with specific characteristics as mode strings, searching by using the original Sunday algorithm and the algorithm adopted by the invention, comparing the actual efficiency of the character strings with the actual efficiency of the original Sunday algorithm and the algorithm adopted by the invention, and repeatedly calling 1000 times to calculate the total time (unit: millisecond) to obtain the following statistical chart.
In the figure, a represents Sunday algorithm, and B represents the algorithm adopted by the present invention;
the mode cluster that cylinder 1 adopted does: abcdef;
the cylinder 2 adopts a pattern string as follows: abcdefghijklmnopqrstuvwyz;
the cylinder 3 adopts a pattern string as follows:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ;
the column 4 adopts a pattern string as follows:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567
the above description is an embodiment of the present invention. The present invention is not limited to the above embodiments, and any structural changes made under the teaching of the present invention shall fall within the protection scope of the present invention, which is similar or similar to the technical solutions of the present invention.

Claims (3)

1, character search method based on Sunday algorithm, which is characterized in that the method comprises the following steps:
step 1: constructing an auxiliary array by using information of two adjacent characters in a pattern string, wherein the pattern string is a character string to be matched;
step 2: the method comprises the following steps that a text window is aligned with a text character string on the left, the text character string is a text to be searched, and the text window slides on the text character string; judging whether the last two characters in the text window appear in the pattern string by using the auxiliary array; if not, jumping to the step 4; if yes, aligning the characters appearing in the pattern string with the last two characters, judging whether the characters on the corresponding positions of the pattern string and the text character string are matched, and if yes, ending the matching success program; if not, jumping to the step 3;
step 3, adopting a Sunday algorithm to carry out matching and jumping for times, if the Sunday algorithm is successfully matched, ending the program, if the Sunday algorithm is unsuccessfully matched, judging whether the jumping length of the Sunday algorithm exceeds the length of a text window, if the jumping length of the Sunday algorithm exceeds the length of the text window, jumping to the step 4, and if the jumping length of the text window is not exceeded, repeating the step 3;
and 4, step 4: moving the text window to the right by J character length, taking the last character in the text window and the character outside the text window and adjacent to the last character as combined characters, and judging whether the combined characters appear in the pattern string by using an auxiliary array; if yes, jumping to the step 5; if not, judging whether the text window exceeds the text character string, if so, ending the matching failure program; if not, repeating the content of the step 4; the length of the mode string and the length of the text window are J characters;
and 5: aligning the characters appearing in the pattern string with the combined characters, and judging whether the characters on the corresponding positions of the pattern string and the text character string are matched or not; if the matching is successful, the matching procedure is ended; and if not, jumping to the step 3.
2. The character searching method based on Sunday algorithm, as claimed in claim 1, wherein the auxiliary array is constructed in the following manner in step 1:
s201: calculating the hash value of two adjacent characters in the pattern string by using the ASCII code value of the characters in the pattern string and the set index value, and adopting the following formula:
H[i]=A[i]×index+A[i+1]×(index+1)
wherein i represents the position serial number of the character in the pattern string; h [ i ] represents the hash value of two adjacent characters in the pattern string; a [ i ] represents an ASCII code value of the ith character;
s202: the maximum hash value generated by combining all characters in pairs is used as the number of elements in the auxiliary array, and the size of the hash value generated by combining all the characters in pairs is the position serial number of the elements;
s203: mapping the hash value calculated in S201 to the auxiliary array in the following specific manner:
the elements at the positions corresponding to the hash values Hi are 1, if the hash values Hi generated by two or more groups of adjacent character combinations in the pattern string are the same, the elements at the positions corresponding to the hash values Hi are set to be numbers larger than 1, and the elements at the rest positions in the auxiliary array are 0.
3. The character searching method based on Sunday algorithm, as claimed in claim 2, wherein the method for determining by using the auxiliary array comprises calculating the hash value of two adjacent characters to be determined in the text string, searching the element at the corresponding position in the auxiliary array by using the hash value, if the element is 1, then representing that the two adjacent characters in the text string match the pattern string or appear in the pattern string, if the element is 0, then representing that the two adjacent characters in the text string do not match the pattern string or do not appear in the pattern string, and if the element is a number greater than 1, then representing that the hash value of the character combination is the same as that of the other character combinations, then proceeding with the content of step 3.
CN201710375615.6A 2017-05-24 2017-05-24 character search method based on Sunday algorithm Expired - Fee Related CN107220333B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710375615.6A CN107220333B (en) 2017-05-24 2017-05-24 character search method based on Sunday algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710375615.6A CN107220333B (en) 2017-05-24 2017-05-24 character search method based on Sunday algorithm

Publications (2)

Publication Number Publication Date
CN107220333A CN107220333A (en) 2017-09-29
CN107220333B true CN107220333B (en) 2020-01-31

Family

ID=59944598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710375615.6A Expired - Fee Related CN107220333B (en) 2017-05-24 2017-05-24 character search method based on Sunday algorithm

Country Status (1)

Country Link
CN (1) CN107220333B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109348304B (en) * 2018-09-30 2021-04-27 武汉斗鱼网络科技有限公司 Bullet screen data verification method and device and terminal
CN109977276B (en) * 2019-03-22 2020-12-22 华南理工大学 Sunday algorithm-based improved single-mode matching method
CN111814009B (en) * 2020-06-28 2022-03-01 四川长虹电器股份有限公司 Mode matching method based on search engine retrieval information
CN112671413B (en) * 2020-12-25 2022-09-06 浪潮云信息技术股份公司 Data compression method and system based on LZSS algorithm and Sunday algorithm

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20120063879A (en) * 2010-12-08 2012-06-18 서울대학교산학협력단 Method for searching string matching on multi-byte character set texts
CN104519056A (en) * 2014-12-15 2015-04-15 广东科学技术职业学院 Double-jump-based single mode matching method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20120063879A (en) * 2010-12-08 2012-06-18 서울대학교산학협력단 Method for searching string matching on multi-byte character set texts
CN104519056A (en) * 2014-12-15 2015-04-15 广东科学技术职业学院 Double-jump-based single mode matching method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Poster:A Y_C sunday algorithm based on improved sunday algorithm;Yang Z等;《International conference oncommunications & networking in china》;20151231;正文第680-681页 *
Sunday字符串匹配算法的效率改进;徐珊等;《计算机工程与应用》;20111231;正文第96-98页 *
改进的SUnday模式匹配算法;万晓榆等;《计算机工程》;20090430;正文第125-126页 *

Also Published As

Publication number Publication date
CN107220333A (en) 2017-09-29

Similar Documents

Publication Publication Date Title
CN107220333B (en) character search method based on Sunday algorithm
EP2585962B1 (en) Password checking
JP3889762B2 (en) Data compression method, program, and apparatus
CN109977276B (en) Sunday algorithm-based improved single-mode matching method
JP5862413B2 (en) Information conversion rule generation program, information conversion rule generation device, and information conversion rule generation method
EP3179383A1 (en) Device and method for error correction in data search
JP2013206187A (en) Information conversion device, information search device, information conversion method, information search method, information conversion program and information search program
CN111382298B (en) Image retrieval method and device based on picture content and electronic equipment
JP6447161B2 (en) Semantic structure search program, semantic structure search apparatus, and semantic structure search method
JP4114600B2 (en) Variable length character string search device, variable length character string search method and program
CN111801665A (en) Hierarchical Locality Sensitive Hash (LSH) partition indexing for big data applications
CN112163145A (en) Website retrieval method, device and equipment based on edit distance and cosine included angle
CN101685502A (en) Mode matching method and device
CN113626645B (en) Hierarchical optimization efficient ciphertext fuzzy retrieval method and related equipment
CN116975864A (en) Malicious code detection method and device, electronic equipment and storage medium
CN1243431C (en) Analysis of universal route platform command lines
CN110909214A (en) KMP matching algorithm-based rapid character string matching method
CN113065419B (en) Pattern matching algorithm and system based on flow high-frequency content
CN115088038A (en) Improved quality value compression framework in aligned sequencing data based on new context
US11250064B2 (en) System and method for generating filters for K-mismatch search
CN109710607A (en) A kind of hash query method solved based on weight towards higher-dimension big data
Petri et al. Efficient indexing algorithms for approximate pattern matching in text
CN109857264B (en) Pinyin error correction method and device based on spatial key positions
CN107798060B (en) Real-time streaming data processing application software feature recognition method
Rădescu et al. The Original Method of Fixed Constraints Transform for Lossless Text Compression

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200131

CF01 Termination of patent right due to non-payment of annual fee