CN112632343A

CN112632343A - Character string matching method, device and equipment and readable storage medium

Info

Publication number: CN112632343A
Application number: CN202011627370.XA
Authority: CN
Inventors: 黄运新; 朱庆春
Original assignee: Shenzhen Dapu Microelectronics Co Ltd
Current assignee: Shenzhen Dapu Microelectronics Co Ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2021-04-09
Anticipated expiration: 2040-12-30
Also published as: CN112632343B

Abstract

The application discloses a character string matching method, a device, equipment and a readable storage medium. The present application proposes, for each substring, the following matching information: index, character repetition number value range, length, number of preposed substrings, preposed mark, repeated mark and initial mark. The target result used for calculating the counting enable signal is the OR operation result of different target values; and the different target values include: and matching result values of each prepositive substring of the substring and the characters i-L, wherein i is the index of the character to be matched, and L is the length of the substring. Therefore, when the counting value range is calculated, the actual sub-character string length and the number of the preposed sub-character strings are used, so that the method is not limited by the preposed sub-character string and the condition that the sub-character string length is 1, the matching is more flexible, and more regular expressions are supported. The character string matching device, the equipment and the readable storage medium have the technical effects.

Description

Character string matching method, device and equipment and readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for matching a character string.

Background

At present, the search of character strings can be performed in a storage device in a hardware manner by using a bit-split model. The bit-split model can only process one character, i.e., one byte of data, per clock cycle. Therefore, a Hawk architecture is added on the basis of the bit-split model, so that data of multiple bytes can be processed in each clock cycle, but more hardware resources are consumed when data of one byte is processed. Also, the Hawk architecture does not support the handling of special characters, so that the application is relatively limited.

To support the handling of special characters, a HARE (hardware Accelerator for Regular expressions) architecture has emerged since then, which supports the handling of parts of special characters, such as "+", etc. The HARE architecture has many limitations in use. Such as: certain special character combinations are not supported and there is not enough flexibility in actual data matching. Specifically, the HARE architecture supports only the case of having one preceding substring per substring (component), and allows only the length of each substring to be 1. In an actual regular expression, a case where a certain substring has a plurality of preceding substrings often occurs. Therefore, the HARE architecture is not flexible enough in practical applications and supports fewer regular expression types.

The prefix substring means: for a certain substring, a substring appearing before it is permitted, which is not necessarily a substring arranged before a certain substring. For example: the regular expression [ ab ] [ bc ] + defc {2} is decomposed into 5 substrings, which are [ ab ], [ bc ] +, d? Ef, c {2 }. For substring ef, d? Indicates whether d is available or not (i.e., there may be no character of d in the regular expression), so [ bc ] + and d? Are the leading substrings of ef. So substring ef has two preceding substrings. The ef in this example is 2 in length, so the HARE architecture cannot be used to search for matches.

Therefore, how to improve flexibility and universality of character string matching is a problem to be solved by those skilled in the art.

Disclosure of Invention

In view of the above, an object of the present application is to provide a method, an apparatus, a device and a readable storage medium for string matching, so as to improve flexibility and universality of string matching. The specific scheme is as follows:

in a first aspect, the present application provides a method for matching a character string, including:

acquiring characters to be matched, sub character strings and matching information of the sub character strings; the matching information includes: index, character repetition frequency value range, length, number of preposed substrings, preposed mark, repeated mark and initial mark;

determining a count enable signal based on the initial marker, the target result, the repeat marker, and a historical count range of values;

determining a current counting value range according to the counting enabling signal;

if the current counting value range is overlapped with the character repetition frequency value range, determining that the character to be matched is successfully matched with the sub-character string;

wherein the target result is an OR operation result of different target values; the different target values include: matching result values of each prepositive substring of the substring and characters i-L, wherein i is the index of the character to be matched, and L is the length; the historical count value range is as follows: and counting value ranges corresponding to the matching results of the prepositive characters of the characters to be matched and the sub-character strings.

Preferably, the determining a count enable signal based on the initial flag, the target result, the repeat flag, and the historical count value range includes:

acquiring a minimum value and a maximum value in the historical counting value range;

respectively determining a first mark corresponding to the minimum numerical value and a second mark corresponding to the maximum numerical value;

performing an and operation on the first mark and the repeated mark to obtain a first and operation result; performing an and operation on the second mark and the repeated mark to obtain a second and operation result;

performing OR operation on the first AND operation result, the initial mark and the target result to obtain a first OR operation result; performing OR operation on the second AND operation result, the initial mark and the target result to obtain a second OR operation result;

determining the first or operation result as a minimum enable signal among the count enable signals, and determining the second or operation result as a maximum enable signal among the count enable signals.

Preferably, the determining the first mark corresponding to the minimum value and the second mark corresponding to the maximum value respectively includes:

if the minimum value is greater than 0, determining that the first flag is 1; otherwise, determining that the first flag is 0;

if the maximum value is greater than 0, determining that the second flag is 1; otherwise, determining that the second flag is 0.

Preferably, the determining a current count value range according to the count enable signal includes:

if the minimum enabling signal and the target vector are valid and the matching result of the character to be matched and the previous sub-character string of the sub-character string is invalid, determining that the minimum value in the current counting value range is the minimum numerical value increment 1; otherwise, determining that the minimum value in the current counting value range is 0; the target vector effectively represents that continuous L characters appear in the sub-character string in sequence, and L is the length; the L characters comprise the character to be matched and L-1 characters before the character to be matched;

if the maximum value enabling signal and the target vector are valid, determining that the maximum value in the current counting value range is the maximum value increment of 1; otherwise, determining that the maximum value in the current counting value range is 0.

Preferably, the method further comprises the following steps:

if the termination mark of the substring is invalid, performing the next round of matching;

and if the termination mark of the substring is valid, ending the matching process.

Preferably, the method further comprises the following steps:

and if the current counting value range is not overlapped with the character repeating time value range, determining that the matching of the character to be matched and the sub-character string is unsuccessful.

Preferably, the calculation formula of the target result is:

TMP_j＝RMV[i-L][j-1]||RMV[i-L][j-2]||…||RMV[i-L][j-k]；

wherein, L is the length, i is the index of the character to be matched, j is the index, and 1, 2 … k are the leading marks of each leading sub-character string of the sub-character strings;

TMP_jis the target result;

RMV [ i-L ] [ j-1] is a matching result value of a first leading sub-character string of the sub-character strings and the character i-L;

RMV [ i-L ] [ j-2] is a matching result value of a second leading sub-character string of the sub-character strings and the character i-L;

RMV [ i-L ] [ j-k ] is the matching result value of the kth preceding sub-character string of the sub-character strings and the characters i-L.

In a second aspect, the present application provides a character string matching apparatus, including:

the acquisition module is used for acquiring characters to be matched, sub character strings and matching information of the sub character strings; the matching information includes: index, character repetition frequency value range, length, number of preposed substrings, preposed mark, repeated mark and initial mark;

a determination module for determining a count enable signal based on the initial marker, a target result, the repeat marker, and a historical count value range;

the counting module is used for determining the current counting value range according to the counting enabling signal;

the matching module is used for determining that the character to be matched is successfully matched with the sub-character string if the current counting value range is overlapped with the character repetition frequency value range;

In a third aspect, the present application provides a character string matching apparatus, including:

a memory for storing a computer program;

a processor for executing the computer program to implement the character string matching method disclosed in the foregoing.

In a fourth aspect, the present application provides a readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the character string matching method disclosed in the foregoing.

According to the scheme, the application provides a character string matching method, which comprises the following steps: acquiring characters to be matched, sub character strings and matching information of the sub character strings; the matching information includes: index, character repetition frequency value range, length, number of preposed substrings, preposed mark, repeated mark and initial mark; determining a count enable signal based on the initial marker, the target result, the repeat marker, and a historical count range of values; determining a current counting value range according to the counting enabling signal; if the current counting value range is overlapped with the character repetition frequency value range, determining that the character to be matched is successfully matched with the sub-character string; wherein the target result is an OR operation result of different target values; the different target values include: matching result values of each prepositive substring of the substring and characters i-L, wherein i is the index of the character to be matched, and L is the length; the historical count value range is as follows: and counting value ranges corresponding to the matching results of the prepositive characters of the characters to be matched and the sub-character strings.

It can be seen that, in the process of matching the character strings, the following matching information is provided for each sub-character string: index, character repetition number value range, length, number of preposed substrings, preposed mark, repeated mark and initial mark. The target result used for calculating the counting enable signal is the OR operation result of different target values; and the different target values include: therefore, when the counting value range (namely MAX and MIN) is calculated, the method calculates the MAX and MIN according to the actual substring length of each substring and the number of the preposed substrings, and is not limited by the length of one preposed substring and the length of the substring being 1, so that the matching is more flexible. Therefore, the method and the device can support the search matching of a plurality of preposed substrings and support the condition that the length of the substring is greater than 1, so that the actual data search is more flexible, more regular expressions are supported, the flexibility and the universality of the string matching are improved, and the method and the device are closer to a real grep algorithm (a software search method).

Correspondingly, the character string matching device, the equipment and the readable storage medium provided by the application also have the technical effects.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a string matching method disclosed herein;

FIG. 2 is a diagram of a prior HARE architecture disclosed herein;

fig. 3 is a schematic diagram illustrating an IMV value disclosed in the present application;

fig. 4 is a schematic diagram of values of MIN _ EN and MAX _ EN disclosed in the present application;

FIG. 5 is a schematic diagram of MIN and MAX values disclosed herein;

FIG. 6 is a schematic diagram of RMV values disclosed herein;

FIG. 7 is a schematic diagram of a string matching apparatus disclosed in the present application;

fig. 8 is a schematic diagram of a character string matching apparatus disclosed in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Currently, the HARE architecture supports only one preceding substring for each substring, and only allows each substring to be 1 in length. In an actual regular expression, a case where a certain substring has a plurality of preceding substrings often occurs. Therefore, the HARE architecture is not flexible enough in practical applications and supports fewer regular expression types. Therefore, the character string matching scheme is provided, the searching and matching of a plurality of preposed sub-character strings can be supported, the condition that the length of the sub-character strings is larger than 1 is supported, the actual data searching is more flexible, more regular expressions are supported, and the flexibility and the universality of the character string matching are improved.

To more clearly introduce the present application, the matching process for the existing HARE architecture is now described as follows.

The program code for the existing HARE architecture to compute MIN _ EN [ i ] [ j ] and MAX _ EN [ i ] [ j ] is as follows:

for i ═ 1to W-1do (i is greater than 1 and smaller than W-1)

for j 1to S-1do (j is greater than 1 and smaller than S-1)

MIN_EN[i][j]＝RMV[i-1][j-1]||MIN[i-1][j]>0；

MAX_EN[i][j]＝RMV[i-1][j-1]||MAX[i-1][j]>0；

end for

It can be seen that in the existing matching process, MIN _ EN [ i ] [ j ] or MAX _ EN [ i ] [ j ] is the OR result of two intermediate results (RMV [ i-1] [ j-1] and MAX [ i-1] [ j ] > 0). And, RMV [ i-1] [ j-1] represents only the matching of character i-1 and sub-string j-1, namely: the substring is limited to a length of 1 and the default substring j has only one preceding substring. A 1 in i-1 indicates the length of the substring j. A1 in j-1 indicates the label of the preceding substring j-1 of substring j.

The program code for the existing HARE architecture to compute MIN [ i ] [ j ] is as follows:

for i＝1to W-1do

for j＝1to S-1do

if MIN EN[i][j]&IMV[i][j]then

MIN[i][j]＝RMV[i][j-1]？0:MIN[i-1][j]+1；

end if

end for

the program code for the existing HARE architecture to calculate MAX [ i ] [ j ] is as follows:

for i＝1to W-1do

for j＝1to S-1do

if MAX_EN[i][j]&IMV[i][j]then

MAX[i][j]＝MAX[i-1][j]+1；

end if

end for

it can be seen that, in the existing matching process, because the length of the calculated sub-string of MIN _ EN [ i ] [ j ] or MAX _ EN [ i ] [ j ] is 1 and limited by only one preceding sub-string, the calculation of MIN [ i ] [ j ] and MAX [ i ] [ j ] cannot take many situations into consideration, and is also limited by the length of the sub-string being 1 and only one preceding sub-string, so the HARE architecture is not flexible enough in practical application, and the supported regular expression types are few.

In fact, the present application is improved based on the existing HARE architecture and the min-max matching algorithm of CRU (Counter-based Reduction Unit), which is called ACMU (Advanced Counter-based Match Unit) matching algorithm.

For completeness of the present solution, the hardware structure of the HARE architecture is now described as follows. The HARE architecture includes two parts, compiler (compiler) and hardware (hardware). The compiler is used for analyzing and dividing the regular expression input by the user to obtain the substrings to be matched.

Referring to fig. 2, the hardware structure of the HARE architecture includes: a memory, a CCU (Character Class Unit) module, an algorithm module (Pattern Automatas), an IMU (intermediate Match Unit) module, and a CRU (Counter-based Reduction Unit) module.

A large number of characters to be matched are stored in the memory. Hardware firstly reads the analysis results of some characters to be matched and regular expressions to a memory so that a CCU module can group the characters to be matched, then processes the characters to be matched and substrings by using a bit-split state machine in an algorithm module, and inputs the results to an IMU module to obtain IMV vectors. And the IMV vector enters a CRU module, the final matching is finished in the CRU module according to a min-max matching algorithm, and a matching result is output.

The IMV vector is a matrix of S rows and W columns. IMV [ i ] [ j ] represents the match of character i and sub-string j. IMV [ i ] [ j ] equals 1 when j ends exactly at the position of character i, otherwise IMV [ i ] [ j ] equals 0. If the length of the substring j is 1, then when the character i appears in j, IMV [ i ] [ j ] is valid, taking the value of 1. If the length of the sub-string j is greater than 1, the IMV [ i ] [ j ] is valid only if a plurality of continuous characters to be matched respectively appear in the sub-string j. For example: for the sub-character string ef, when two consecutive characters to be matched are e and f, respectively, and the current index i corresponds to the character f, and the index j corresponds to the sub-character string ef, that is: successive e and f are both sequentially present in substring ef, IMV [ i ] [ j ] is active. Given that two consecutive characters to be matched are a and f, respectively, IMV [ i ] [ j ] is invalid for sub-string ef. Thus IMV [ i ] [ j ] effectively represents: the current character to be matched and the continuous L-1 characters before the current character to be matched are sequentially appeared in the current sub-character string, wherein L is the number of characters in the current sub-character string.

The MIN-MAX matching algorithm defines two counter matrices, MIN and MAX, and corresponding enable signal matrices MIN _ EN and MAX _ EN for the two counter matrices. MIN _ EN [ i ] [ j ] equals 1, indicating MIN [ i ] [ j ] count is allowed. MAX _ EN [ i ] [ j ] equals 1, indicating MAX [ i ] [ j ] counts are allowed. An RMV matching vector matrix is also defined and is used for representing the matching condition of a certain character and a certain substring.

For pipelining, the front module (IMU) inputs an IMU matrix of S rows and W columns to the CRU module every clock cycle. In this clock cycle, the calculation of all matrix values such as MIN _ EN, MAX _ EN, MIN, MAX, and RMV must be completed, so that the matching result is output in real time. The method and the device improve the calculation process of MIN _ EN, MAX _ EN, MIN, MAX, RMV and the like.

Referring to fig. 1, an embodiment of the present application discloses a character string matching method, including:

s101, acquiring characters to be matched, sub character strings and matching information of the sub character strings; the matching information includes: index, character repetition number value range, length, number of preposed substrings, preposed mark, repeated mark and initial mark.

It should be noted that the character to be matched may be read from a hardware memory, and the sub-character strings are obtained by dividing from a regular expression (i.e. a character string to be matched) input by a user. After any regular expression is divided, at least one substring can be obtained. For each substring, corresponding match information may be determined as per table 1. The index is a label for each substring.

The initial mark is used for marking whether a certain substring is the first substring decomposed by the regular expression to be searched. The repetition flag is used to indicate whether a certain substring is allowed to occur repeatedly. The length of the substring is: the number of characters in the substring.

Specifically, the matching information of the substring may further include other information, which may specifically refer to table 1.

TABLE 1

In table 1, the regular expression [ ab ] [ bc ] + defc {2} is divided into 5 substrings, which are [ ab ], [ bc ] +, d? Ef, c {2 }. Wherein [ ab ] is the first substring of the regular expression, so the initial flag is valid and is marked as 1; the end flag is invalid and is marked as 0; [ ab ] represents any one of ab and, therefore, has a length of 1; the first substring [ ab ] of the regular expression has no preceding substring, so the number of preceding substrings takes on the value 0. (BL, BU) is the numeric area of the number of times of repetition of the character, regarding [ ab ] this sub-string, no matter whether the character to be matched is a or b, it is considered to be matched, so regarding this sub-string, the numeric area of the number of times of repetition is (1, 1).

For substring ef, d? Either none or none, so are [ bc ] + and d? Is its leading substring, so substring ef has two leading substrings, and ef is 2 in length. For substring c {2}, it means that c repeats twice, so its repeat flag is valid, denoted as 1; and c {2} is the last substring of the regular expression [ ab ] [ bc ] + defc {2}, so the initial mark is invalid and is marked as 0; the end flag is valid, and is marked as 1; there is only one c character at a time, so c 2 is 1 in length.

The rest of the substrings in table 1 are analogized and will not be described herein.

In this embodiment, multiple regular expressions can be search matched in parallel. As shown in table 1, j-0 to j-4 belong to the first regular expression. j 5 to j 7 belong to the second regular expression. The two regular expressions share the character to be matched. Referring to FIGS. 3-6, for the first regular expression matching process, it starts the matching from the first substring until its end-marker occurs. For the second regular expression matching process, it also starts matching from its first substring until its end-marker occurs. The two matching processes are completely independent, but the entered character (index i) is fed to both matching processes separately, i.e. parallel search.

S102, determining a counting enabling signal based on the initial mark, the target result, the repeated mark and the historical counting value range.

And S103, determining the current counting value range according to the counting enable signal.

Wherein, the target result is an OR operation result of different target values; the different target values include: matching result values of each prepositive substring of the substring and characters i-L, wherein i is the index of the character to be matched, and L is the length; the historical count value range is as follows: and counting value range corresponding to the matching result of the prefix character of the character to be matched and the sub-character string.

The substring is denoted by j and the character to be matched is denoted by i, then in one embodiment, the calculation formula of the target result is:

TMP_j＝RMV[i-L][j-1]||RMV[i-L][j-2]||…||RMV[i-L][j-k]；

wherein, L is the length of the sub-character string j, i is the index of the character to be matched, j is the index, and 1, 2 … k are the prepositive marks of each prepositive sub-character string of the sub-character string; TMP_jIs a target result; RMV [ i-L ]][j-1]The matching result value of the first leading substring of the substring and the character i-L; RMV [ i-L ]][j-2]The matching result value of the second leading substring of the substring and the character i-L; RMV [ i-L ]][j-k]The matching result value of the k-th leading sub-string of the sub-string and the character i-L. The characters i-L are: and indexing the characters to be matched with i-L.

In the embodiment, the actual sub-string length and the number of the preceding sub-strings are used to determine the count enable signal, so that MAX and MIN can be calculated according to the actual sub-string length and the number of the preceding sub-strings possessed by each sub-string, and the method is not limited by the number of the preceding sub-strings and the length of the sub-strings being 1, so that the matching is more flexible.

Specifically, the counter needs to count the maximum value and the minimum value, so the count enable signal includes: a minimum enable signal and a maximum enable signal. Correspondingly, the current counting value range refers to an interval formed by a maximum value and a minimum value.

In one embodiment, determining a count enable signal based on the initial marker, the target outcome, the repeat marker, and the historical count value range includes: acquiring a minimum value and a maximum value in a historical counting value range; respectively determining a first mark corresponding to the minimum numerical value and a second mark corresponding to the maximum numerical value; performing AND operation on the first mark and the repeated mark to obtain a first AND operation result; performing AND operation on the second mark and the repeated mark to obtain a second AND operation result; performing OR operation on the first AND operation result, the initial mark and the target result to obtain a first OR operation result; performing OR operation on the second AND operation result, the initial mark and the target result to obtain a second OR operation result; the first or operation result is determined as a minimum enable signal among the count enable signals, and the second or operation result is determined as a maximum enable signal among the count enable signals.

In one embodiment, determining the first mark corresponding to the minimum value and the second mark corresponding to the maximum value respectively comprises: if the minimum value (i.e., MIN [ i-1] [ j ]) is greater than 0, then the first flag is determined to be 1; otherwise, determining that the first flag is 0; if the maximum value (i.e., MAX [ i-1] [ j ]) is greater than 0, then determining that the second flag is 1; otherwise, the second flag is determined to be 0.

For example: the substring is denoted by j, the character to be matched is denoted by i, and in the process of matching i and j, the minimum enable signal is denoted by MIN _ EN [ i ] [ j ], and the maximum enable signal is denoted by MAX _ EN [ i ] [ j ], so MIN _ EN [ i ] [ j ] and MAX _ EN [ i ] [ j ] can be calculated according to the following procedures.

for i＝0to W-1do

for j＝0to S-1do

TMP＝0；

for k＝1to CS[j]do

TMP＝TMP||RMV[i-L[j]][j-k]；

End for

MIN_EN[i][j]＝IV[j]||TMP||(MIN[i-1][j]>0&RV[j])；

MAX_EN[i][j]＝IV[j]||TMP||(MAX[i-1][j]>0&RV[j])；

end for

Wherein, W is the number of the characters to be matched read by each clock, and S is the total number of the sub-character strings generated in the compiling stage. Target result TMP ═ 0 denotes: the initial target result TMP is 0, and since the first substring has no preceding substring, the first substring has no corresponding RMV [ i-L [ j ] ] [ j-k ], and thus the target result TMP is assigned a value of 0. L [ j ] is the length of the sub-string j, and k is a value between 1 and CS [ j ]. IV [ j ] is the initial label of the substring j, and RV [ j ] is the repeat label of the substring j. And | is an OR operator and the & is an AND operator. MAX [ i-1] [ j ] is the maximum value in the counting value range obtained by calculation in the matching process of the characters i-1 and j. MIN [ i-1] [ j ] is the minimum numerical value in the counting value range obtained by calculation in the matching process of the characters i-1 and j.

Based on MIN _ EN [ i ] [ j ] ═ IV [ j ] | TMP | (MIN [ i-1] [ j ] >0& RV [ j ]), when the initial flag IV [ j ] is valid, MIN _ EN [ i ] [ j ] is valid regardless of the values of TMP and (MIN [ i-1] [ j ] >0& RV [ j ]), which can be recorded as 1. If MIN _ EN [ i ] [ j ] is invalid, it can be noted as 0.

Similarly, based on MAX _ EN [ i ] [ j ] ═ IV [ j ] | TMP | (MAX [ i-1] [ j ] >0& RV [ j ]), it is known that: when the initial flag IV [ j ] is valid, the MAX _ EN [ i ] [ j ] is valid regardless of the values of TMP and (MIN [ i-1] [ j ] >0& RV [ j ]), which can be marked as 1. If MAX _ EN [ i ] [ j ] is invalid, it can be noted as 0.

If MIN _ EN [ i ] [ j ] equals 1 or MAX _ EN [ i ] [ j ] equals 1, this indicates that character i may be consumed by j, i.e.: i and j match. There are 3 cases that indicate that i and j match. So MIN _ EN [ i ] [ j ] or MAX _ EN [ i ] [ j ] is equal to the logical OR of IV [ j ], TMP, and MIN [ i-1] [ j ] >0& RV [ j ].

In the first case: j is the first substring decomposed by the regular expression to be searched, and the initial flag IV [ j ] is valid, which indicates that: the matching can now be started with any character in memory to be matched. MIN _ EN [ i ] [ j ] and MAX _ EN [ i ] [ j ] are valid at this time.

In the second case: all possible leading substrings of substring j, at least one leading substring is matched in the position of character i-L [ j ], that is: at least one preceding substring of j and the character i-L [ j ] completes the match. Then character i may be the first character that the substring j needs to consume. Assuming that the sub-string j has a length of L [ j ] and CS [ j ] preceding sub-strings exist, it is necessary to determine the matching condition of each preceding sub-string of the sub-string j with the characters i-L [ j ]. Namely: RMV [ i-L ] [ j-1], RMV [ i-L ] [ j-2] … RMV [ i-L ] [ j-k ] are calculated respectively. These RMVs are logically ored, i.e. the current TMP is obtained, which represents: the OR operation result of each leading substring of the substring j and the matching condition of the characters i-L [ j ].

In the third case: the character i-1 has been consumed by substring j (i.e., i-1 matches j) and substring j is allowed to repeat, i.e., RV [ j ] is valid, noted 1. Then the character i may be a repeated match of the substring j.

After determining whether MIN _ EN [ i ] [ j ] and MAX _ EN [ i ] [ j ] are valid, the current counting value range can be determined based on the valid conditions of MIN _ EN [ i ] [ j ] and MAX _ EN [ i ] [ j ]. In one embodiment, determining the current count value range according to the count enable signal includes: if the minimum value enabling signal and the target vector (namely IMV [ i ] [ j ]) are effective and the matching result of the character to be matched and the previous sub-character string of the sub-character string is invalid, determining that the minimum value in the current counting value range is the minimum value and is increased by 1; otherwise, determining that the minimum value in the current counting value range is 0; the target vector effectively represents that consecutive L characters all appear in the sub-character string in sequence, L being the length (i.e. the number of characters in the sub-character string); the L characters comprise the character to be matched and L-1 characters before the character to be matched; if the maximum value enabling signal and the target vector are valid, determining that the maximum value in the current counting value range is the maximum value increment of 1; otherwise, determining that the maximum value in the current counting value range is 0.

For example: and the sub-character string is represented by j, the character to be matched is represented by i, and in the process of matching i and j, the minimum value in the current counting value range is represented by MIN [ i ] [ j ], so that MIN [ i ] [ j ] can be calculated according to the following program.

for i＝0to W-1do

for j＝0to S-1do

if IV[j]then

MIN[i][j]＝0；

else if MIN_EN[i][j]&IMV[i][j]then

MIN[i][j]＝RMV[i][j-1]？0:MIN[i-1][j]+1；

else

MIN[i][j]＝0；

end if

end for

Wherein, if MIN _ EN [ i ] [ j ] and IMV [ i ] [ j ] are valid and RMV [ i ] [ j-1] is invalid, MIN [ i ] [ j ] is increased by 1 on the basis of MIN [ i-1] [ j ]. If RMV [ i ] [ j-1] is valid, MIN [ i ] [ j ] takes a value of 0. RMV [ i ] [ j-1] effectively represents: the character i and the substring j-1 have been matched, i.e., the character i is consumed by the substring j-1, so the character i cannot be consumed again by the substring j, and therefore MIN [ i ] [ j ] is set to 0.

For example: and the substring is represented by j, the character to be matched is represented by i, and in the process of matching i and j, the maximum value in the current counting value range is represented by MAX [ i ] [ j ], so that MAX [ i ] [ j ] can be calculated according to the following program.

for i＝0to W-1do

for j＝0to S-1do

if MAX_EN[i][j]&IMV[i][j]then

MAX[i][j]＝MAX[i-1][j]+1；

else

MAX[i][j]＝0；

end if

end for

Wherein if MAX _ EN [ i ] [ j ] and IMV [ i ] [ j ] are valid, then MAX [ i ] [ j ] is increased by 1 on the basis of MAX [ i-1] [ j ]; otherwise, MAX [ i ] [ j ] is set to 0.

And S104, if the current counting value range is overlapped with the character repetition frequency value range, determining that the matching between the character to be matched and the sub-character string is successful.

An example of codes for calculating RMV in the present embodiment is as follows:

for i＝1to W-1do

for j＝1to S do

RMV[i][j]＝MAX[i][j]>＝BL[i][j]&MIN[i][j]<＝BU[i][j]；

end for

wherein, when the range formed by MIN [ i ] [ j ] and MAX [ i ] [ j ] has intersection with the range formed by BL and BU, RMV [ i ] [ j ] is valid, which indicates that j and i complete one-time matching. And when the RMV of the last sub-character string in all the sub-character strings decomposed by the regular expression to be searched is effective, the searching and matching of the regular expression are finished.

It should be noted that, in S101-S104, a matching process of any character and any substring is described.

In a specific embodiment, the method further comprises the following steps: and if the current counting value range is not overlapped with the character repetition frequency value range, determining that the matching of the character to be matched and the sub-character string is unsuccessful.

After the current counting value range is determined, whether the current counting value range and the character repetition frequency value range are overlapped or not can be judged, if so, the i and the j are successfully matched; otherwise, it is determined that i and j match unsuccessfully.

After completing the matching of i and j, it can be determined whether the current regular expression completes the search based on the termination flag of the substring. In one embodiment, if the end marker of the substring is invalid, the next round of matching is performed; and if the termination mark of the substring is valid, ending the matching process of the current regular expression. The next round of matching can be matching of i and j +1, and can also be matching of i +1 and j.

Therefore, in the process of matching the character strings, the following matching information is provided for each sub-character string in the embodiment of the application: index, character repetition number value range, length, number of preposed substrings, preposed mark, repeated mark and initial mark. The target result used for calculating the counting enable signal is the OR operation result of different target values; and the different target values include: therefore, when the counting value range (namely MAX and MIN) is calculated, the method calculates the MAX and MIN according to the actual substring length of each substring and the number of the preposed substrings, and is not limited by the length of one preposed substring and the length of the substring being 1, so that the matching is more flexible. Therefore, the method and the device can support the search matching of a plurality of preposed substrings and support the condition that the length of the substring is greater than 1, so that the method and the device are more flexible in actual data search, support more regular expressions, improve the flexibility and the universality of string matching, and are closer to a real grep algorithm.

As with the above embodiments, the architecture used by the present application for regular expression search using ACMU algorithm is similar to the existing HARE architecture and needs to be completed by a compiler and hardware components. The structure of the hardware part can refer to fig. 2. The compiler can parse the regular expression to be searched into each sub-string and corresponding matching information as shown in table 1, and then can perform the calculation of MIN _ EN, MAX _ EN, MIN, MAX, RMV, etc. according to the structure shown in fig. 2.

In the ACMU algorithm, one substring consumes one character, so if there are repeated characters in one substring, a corresponding number of characters to be matched are required, and the substring cannot be matched with one character. For example: substring c 2 indicates that 2 c occur, thus consuming 2 characters c in succession, indicating a successful match of c 2.

Referring to the two regular expressions listed in table 1, let W be 4, i.e., 4 characters are processed per clock cycle. Assuming that the input character text to be matched is abcefccgtrak, the first clock inputs abce, the 2 nd clock inputs fccg and the 3 rd clock inputs trak. After the CCU and IMU processes, the input of each clock enters ACMU and is corresponding IMV vector.

The values of each IMV at these three clocks are shown in fig. 3. In fig. 3, the IMV value when a certain character recorded in each lattice matches a certain substring is taken. For example: for character a and substring [ ab ], its IMV ([ ab ]) is 1. For character a and substring [ bc ], its IMV ([ bc ]) is 0.

Others may be analogized.

In the first clock cycle, since the substring [ ab ] and the substring tr are the first substrings decomposed by the regular expressions [ ab ] [ bc ] + defc {2} and tr [ ae ] ce, i.e., IV [0] (0 is the index of [ ab ]) and IV [5] (5 is the index of tr) are both 1, MIN _ EN [ i ] [0], MIN _ EN [ i ] [5], MAX _ EN [ i ] [0], MAX _ EN [ i ] [5] are all 1, as shown in FIG. 4.

In fig. 4, MIN _ EN and MAX _ EN values when a certain character recorded in each lattice matches a certain sub-string are taken. For example: the initial tag of the substring [ ab ] is valid, so MIN _ EN and MAX _ EN are both valid, so both take the value of 1, and thus MIN _ EN and MAX _ EN take the value of 1 regardless of which character the substring [ ab ] matches.

MIN _ EN [1] [1] (MIN _ EN when character 1 matches substring 1) and MIN _ EN [2] [1] (MIN _ EN when character 2 matches substring 1) are 1 because substring [ bc ] is 1 in length and its preceding substring is only one, and RMV [0] [0] and RMV [1] [0] are 1, indicating that its preceding substring [ ab ] matches character 0 successfully and [ ab ] matches character 1 successfully. Then the current character may be the first character that the current substring needs to consume. Similarly, MAX _ EN [1] [1] and MAX _ EN [2] [1] are 1, as shown in FIG. 4.

In fig. 5, MIN and MAX values when a certain character recorded in each lattice matches a certain substring are taken. For example: and when the sub-character string [ ab ] is matched with the character a, the MIN value is 0, and the MAX value is 1.

In the second clock cycle, since the substring ef is 2 in length, there are two prefix vectors, which can be preceded by either substring [ bc ] or substring d? In calculating EN (ef), MIN _ EN [0] [3] and MAX _ EN [0] [3], it is necessary to traverse the RMV values at i-2 for its two preceding substrings. In fact, when calculating the matching position of the prefix substring at the position where i is 0 of the 2 nd clock, i is 4, i.e. w should be added, which is equivalent to the RMV value of the previous w window (clock).

The result of traversing the RMV of the two possible preceding substrings of substring ef is RMV [2] [2] | RMV [2] [1], and en (ef) equals (1,1) since RMV [2] [1] equals 1, i.e., RMV ([ bc ]) equals 1 at i ═ 2. When EN (c) is calculated at i ═ 2, EN (c) equals (1,1) at i ═ 2, i.e. MIN _ EN [2] [4] and MAX _ EN [2] [4] are both equal to 1, since MIN [1] [4] is greater than 0 and RV [4] is greater than 1, i.e. repetition is allowed. Since en (c) and imv (c) are valid at both i ═ 1 and i ═ 2, count (c) is continuously incremented at i ═ 1 and i ═ 2, count (c) is equal to (2,2) at i ═ 2, as shown in fig. 5. This range falls exactly in the (BL, BU) range of substring c. RMV (c) is therefore equal to 1 at i ═ 2, i.e. RMV [2] [4] is equal to 1, as shown in fig. 6. And the substring c is the last substring decomposed by the regular expression [ ab ] [ bc ] + defc {2}, i.e. MP [4] equals 1 in Table 1. Therefore [ ab ] [ bc ] + defc {2} completes the matching of one regular expression at this time.

In fig. 6, the values of RMV when a certain character recorded in each lattice matches a certain substring are taken. For example: when the substring [ ab ] is matched with the character a, the RMV takes 1.

During the third clock cycle, RMV (tr) and RMV ([ ae ]) are equal to 1 at i ═ 1 and i ═ 2, respectively, i.e., RMV [1] [5] and RMV [2] [6] are equal to 1, as shown in fig. 6. But since RMV (ce) equals 0 at i ═ 3, i.e., RMV [3] [7] equals 0, the regular expression tr [ ae ] ce does not complete the match.

It can be seen that, in the embodiment, by extending the MIN-MAX algorithm in the fire architecture, when the MIN _ EN/MAX _ EN is calculated, the initial mark, the repeated mark, the number CS of the preceding substrings, and the like are introduced, so that the matching situation of all the preceding substrings can be traversed. In addition, in calculating the MIN, an initial flag is introduced such that the value of the MIN of the first sub-string resolved from the regular expression to be searched is fixed to 0. The present embodiment can support the case of having a plurality of preceding substrings, and the length of the supporting substring is greater than 1. The method has the advantages that the method is more flexible in actual data search, can support the combined use of special characters in more scenes, supports more regular expressions and is closer to a real grep algorithm.

In the following, a character string matching device provided by an embodiment of the present application is introduced, and a character string matching device described below and a character string matching method described above may be referred to each other.

Referring to fig. 7, an embodiment of the present application discloses a character string matching apparatus, including:

an obtaining module 701, configured to obtain matching information of a character to be matched, a sub-character string, and a sub-character string; the matching information includes: index, character repetition frequency value range, length, number of preposed substrings, preposed mark, repeated mark and initial mark;

a determination module 702 configured to determine a count enable signal based on the initial marker, the target result, the repeat marker, and the historical count value range;

a counting module 703, configured to determine a current counting value range according to the counting enable signal;

the matching module 704 is configured to determine that the matching between the character to be matched and the sub-character string is successful if the current counting value range overlaps with the character repetition number value range;

In one embodiment, the determining module comprises: :

the acquisition unit is used for acquiring a minimum value and a maximum value in a historical counting value range;

the mark determining unit is used for respectively determining a first mark corresponding to the minimum numerical value and a second mark corresponding to the maximum numerical value;

the AND operation unit is used for carrying out AND operation on the first mark and the repeated mark to obtain a first AND operation result; performing AND operation on the second mark and the repeated mark to obtain a second AND operation result;

the OR operation unit is used for carrying out OR operation on the first AND operation result, the initial mark and the target result to obtain a first OR operation result; performing OR operation on the second AND operation result, the initial mark and the target result to obtain a second OR operation result;

and an enable signal determination unit for determining the first or operation result as a minimum enable signal among the count enable signals and the second or operation result as a maximum enable signal among the count enable signals.

In an embodiment, the mark determination unit is specifically configured to:

if the minimum value is greater than 0, determining that the first marker is 1; otherwise, determining that the first flag is 0;

if the maximum value is greater than 0, determining that the second mark is 1; otherwise, the second flag is determined to be 0.

In one embodiment, the counting module is specifically configured to:

if the minimum enable signal and the target vector are valid and the matching result of the character to be matched and the previous sub-character string of the sub-character strings is invalid, determining that the minimum value in the current counting value range is the minimum numerical value and is incremented by 1; otherwise, determining that the minimum value in the current counting value range is 0; the target vector effectively represents that continuous L characters appear in the sub-character string in sequence, and L is the length; the L characters comprise the character to be matched and L-1 characters before the character to be matched;

In a specific embodiment, the method further comprises the following steps:

the iteration module is used for performing the next round of matching if the termination mark of the substring is invalid;

and the termination module is used for ending the matching process if the termination mark of the substring is valid.

In a specific embodiment, the method further comprises the following steps:

and the output module is used for determining that the matching of the character to be matched and the sub-character string is unsuccessful if the current counting value range is not overlapped with the character repetition frequency value range.

In one embodiment, the target result is calculated by the formula:

TMPj＝RMV[i-L][j-1]||RMV[i-L][j-2]||…||RMV[i-L][j-k]；

wherein, L is the length, i is the index of the character to be matched, j is the index, and 1, 2 … k are the prepositive marks of each prepositive substring of the substring;

TMPj is a target result;

RMV [ i-L ] [ j-1] is a matching result value of a first leading sub-string of the sub-strings and the character i-L;

RMV [ i-L ] [ j-2] is a matching result value of a second leading sub-string of the sub-string and the character i-L;

RMV [ i-L ] [ j-k ] is the matching result value of the kth preceding sub-string of the sub-string and the character i-L.

For more specific working processes of each module and unit in this embodiment, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not described here again.

Therefore, the embodiment provides a character string matching device, which can support search matching with a plurality of pre-positioned sub-character strings, and support the condition that the length of the sub-character strings is greater than 1, so that the device is more flexible in actual data search, supports more regular expressions, improves flexibility and universality of character string matching, and is closer to a real grep algorithm.

In the following, a character string matching device provided by an embodiment of the present application is introduced, and a character string matching device described below and a character string matching method and device described above may refer to each other.

Referring to fig. 8, an embodiment of the present application discloses a character string matching apparatus, including:

a memory 801 for storing a computer program;

a processor 802 for executing the computer program to implement the method disclosed by any of the above embodiments.

In the following, a readable storage medium provided by an embodiment of the present application is introduced, and a readable storage medium described below and a character string matching method, apparatus, and device described above may be referred to each other.

A readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the character string matching method disclosed in the foregoing embodiments. For the specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, which are not described herein again.

References in this application to "first," "second," "third," "fourth," etc., if any, are intended to distinguish between similar elements and not necessarily to describe a particular order or sequence. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, or apparatus.

It should be noted that the descriptions in this application referring to "first", "second", etc. are for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present application.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of readable storage medium known in the art.

The principle and the implementation of the present application are explained herein by applying specific examples, and the above description of the embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for string matching, comprising:

2. The string matching method according to claim 1, wherein determining a count enable signal based on the initial token, the target result, the repeat token, and a historical count value range comprises:

3. The method according to claim 2, wherein the determining the first label corresponding to the minimum value and the second label corresponding to the maximum value respectively comprises:

4. The method according to claim 3, wherein the determining a current count value range according to the count enable signal includes:

5. The character string matching method according to any one of claims 1to 4, characterized by further comprising:

6. The character string matching method according to any one of claims 1to 4, characterized by further comprising:

7. The character string matching method according to any one of claims 1to 4, wherein the calculation formula of the target result is:

TMP_j＝RMV[i-L][j-1]||RMV[i-L][j-2]||…||RMV[i-L][j-k]；

TMP_jis the target result;

8. A character string matching apparatus, comprising:

9. A character string matching apparatus, characterized by comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the string matching method as claimed in any one of claims 1to 7.

10. A readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the string matching method according to any one of claims 1to 7.