Detailed Description
The embodiment of the specification provides a method, a device and equipment for identifying character strings generated in batches.
In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.
Fig. 1 is a schematic flow chart of a method for identifying a batch of character strings according to an embodiment of the present disclosure, where the schematic flow chart includes:
step 105, receiving character strings to be identified generated in batches;
in the embodiment of the present disclosure, account numbers of each large network platform are taken as an example, and the account numbers are character strings formed by splicing characters. The account number automatically generated by the machine is a random character string formed by character stitching, such as 'iehfdjksyneyg', and most of account numbers registered by common users adopt character strings with a certain meaning, such as 'ilekkobe', and the randomness degree of the character strings is far greater than that of the account numbers registered by the common users.
As in step 220 of fig. 2, a string (a string to be recognized generated in batch) is input, and in the embodiment of the present disclosure, the string input in step 220 is received, for example, the string "ak, ti od e dgza" is received.
Step 110, segmenting the character string to be identified to obtain at least one sub-character string of the character string to be identified;
preferably, the character string "ak, ti odoe dgza" to be identified received in step 105 is preprocessed, and characters which cannot be used by the account numbers such as space and punctuation marks are removed, so that the preprocessed character string is "aktiodoedza"; the pre-processed string is re-segmented to obtain at least one substring, as shown in step 225 of FIG. 2.
In this embodiment of the present disclosure, the preprocessed string is divided at intervals of a preset character length, for example, the preprocessed string is divided once every two character pairs and/or the preprocessed string is divided once every three character pairs, so as to obtain at least one sub-string.
In the embodiment of the present disclosure, if n=2 of the N-gram model is taken, the preprocessed string "aktiodoedza" is divided to obtain substrings "ak", "ti", "od", "oe", "dg" and "za"; if n=3 of the N-gram model is taken, the preprocessed string "aktiodoedgza" is split to obtain substrings "akt", "iod", "oed" and "gza".
Step 115, determining the occurrence probability of at least one sub-character string of the character string to be recognized, and determining the randomness degree of the character string to be recognized according to the occurrence probability of the sub-character string;
in the embodiment of the present specification, the probability dictionary is first used to match the probabilities of occurrence of the substrings "ak", "ti", "od", "oe", "dg" and "za" of the character string "ak, ti odoe dgza" to be recognized. According to the occurrence probability of the sub-character strings, the occurrence probability of the character strings "ak, ti odoe dgza" to be identified is calculated, and the randomness degree R of the character strings "ak, ti odoe dgza" to be identified is further determined, as shown in step 230 in fig. 2; wherein the probability dictionary contains correspondence between sample substrings and probabilities of the sample substrings. Specifically, in the case where the probabilities of occurrence of the substrings "ak", "ti", "od", "oe", "dg", and "za" are obtained as 0.79, 0.59, 0.63, 0.71, 0.56, and 0.68, respectively, the geometric average values of 0.79, 0.59, 0.63, 0.71, 0.56, and 0.68 are calculated as 0.66 as the probability P of occurrence of the character string to be recognized "ak, ti odoe dgza", and further, the degree of randomness r=1 to P of the character string to be recognized "ak, ti odoe dgza", the degree of randomness R is 0.34; or under the condition that the probability that at least two adjacent sub-strings in the sub-strings 'ak', 'ti', 'od', 'oe', 'dg' and 'za' are simultaneously appeared is obtained, taking the geometric average value of the probability that the adjacent at least two sub-strings are simultaneously appeared as the probability P of the character string to be identified. Taking the following example of obtaining the probability that two adjacent substrings "ak" and "ti", "ti" and "od", "od" and "oe", "oe" and "dg" and "za" occur simultaneously, respectively, calculating the geometric average value of 0.69, 0.63, 0.71 and 0.66 as 0.68 as the probability P that the character string "ak, ti od dgza" to be recognized occurs, and further, the randomness degree r=1-P of the character string "ak, ti od dgza" to be recognized, then the randomness degree R is 0.32; or under the condition that the probability of single occurrence of the sub-character strings of the character strings 'ak, ti odoe dgza' to be identified and the probability of simultaneous occurrence of two adjacent sub-character strings are obtained simultaneously, taking the arithmetic average value of the probability geometric average value of single occurrence of the sub-character strings and the probability geometric average value of simultaneous occurrence of the two adjacent sub-character strings as the probability P of occurrence of the character strings 'ak, ti odoe dgza' to be identified, wherein the probability P is 0.67. And determining the randomness degree R of the character string to be recognized to be 0.33 according to the probability 0.67 of the character string to be recognized.
It should be noted that, before the probabilities of occurrence of the sub-strings "ak", "ti", "od", "oe", "dg" and "za" of the character strings "ak, ti odoe dgza" to be recognized are matched by using the probability dictionary, the probability dictionary is obtained. In the embodiment of the present specification, the type of the sample string data is the same as the type of the batch-generated character string to be recognized. Therefore, taking an english magazine, an english web page, or other english articles that can be normally obtained as sample string data, step 205 in fig. 2 is taken as an example. Further, the sample character string data is segmented to obtain a plurality of sample substrings; as shown in step 210 of fig. 2, the number of times that a plurality of sample substrings occur individually and/or the number of times that at least two adjacent sample substrings occur simultaneously is counted; calculating the probability of the single occurrence of the plurality of sample substrings and/or the probability of the simultaneous occurrence of the adjacent at least two sample substrings to obtain a probability dictionary, as shown in step 215 in fig. 2; the probability dictionary comprises a plurality of sample substrings and the probability that the sample substrings appear independently and/or comprises at least two adjacent sample substrings and the probability that the sample substrings appear simultaneously.
And step 120, judging whether the character string to be recognized is a randomly generated character string according to the randomness degree of the character string to be recognized.
In the embodiment of the present disclosure, as shown in step 235 in fig. 2, the randomness degree R is determined to be equal to the preset random threshold. As shown in step 240 in fig. 2, if the randomness R of the character string "ak, ti odoe dza" is greater than the preset randomness threshold, it is determined that the character string "ak, ti odoe dza" is a randomly generated character string. The preset random threshold=1-a preset probability threshold; the preset probability threshold is the median of the probabilities of the single occurrence of a plurality of sample substrings in the probability dictionary; or the median of the probability of simultaneous occurrence of at least two adjacent sample substrings in the probability dictionary; or an arithmetic average of the median of probabilities of the individual occurrence of the plurality of sample substrings in the probability dictionary and the median of probabilities of the simultaneous occurrence of at least two adjacent sample substrings in the probability dictionary. Taking the preset probability threshold value of 0.7 as an example, the preset random threshold value of 0.3 is obtained. The randomness R of the character string "ak, ti odoe dgza" to be identified obtained in the above step 115 is greater than a preset random threshold value of 0.3. Thus, the character string "ak, ti odoe dgza" to be recognized is a randomly generated character string. As shown in step 245 in fig. 2, in the case that the randomness degree R of the character string is not greater than the preset randomness threshold, the character string is a normal character string.
Further, in the embodiment of the present disclosure, the randomly generated character strings "ak, ti odoe dgza" are controlled in a key manner, specifically, the authority of the character strings "ak, ti odoe dgza" is limited, or the verification is enhanced on the character strings "ak, ti odoe dgza" or the character strings "ak, ti odoe dgza" are forbidden to log on the network platform.
Compared with the prior art, the technical scheme adopted by the embodiment of the specification can achieve the following beneficial effects: determining the randomness degree of the character string by determining the occurrence probability of the sub-character string of the character string, and further judging whether the character string is a randomly generated character string or not, wherein a large amount of training data is not required to be marked manually in the whole process, and the labor cost is saved; aiming at the type of the character string to be identified, sample character string data can be selected in a targeted manner; the effect of recognizing the character strings with smaller overall length is improved.
Fig. 3 is a schematic structural diagram of an apparatus for identifying a batch of character strings according to an embodiment of the present disclosure, where the schematic structural diagram includes: a receiving module 305, a dividing module 310, a determining module 315 and a judging module 320;
the receiving module 305 is configured to receive a batch of generated character strings to be identified;
the segmentation module 310 is configured to segment the character string to be identified to obtain at least one sub-character string of the character string to be identified;
the determining module 315 is configured to determine a probability of occurrence of at least one sub-string of the character string to be identified, and determine a degree of randomness of the character string to be identified according to the probability of occurrence of the sub-string;
the judging module 320 is configured to judge whether the character string to be recognized is a randomly generated character string according to the randomness degree of the character string to be recognized.
Preferably, the determining module 315 is specifically configured to match probabilities of occurrence of sub-strings of the character string to be identified using a probability dictionary, where the probability dictionary includes correspondence between probabilities of sample sub-strings and sample sub-strings; and determining the randomness degree of the character string to be identified according to the occurrence probability of the sub-character string.
Preferably, the apparatus further comprises: the probability dictionary obtaining module is used for dividing the sample character string data to obtain a plurality of sample sub-character strings; counting the number of times that a plurality of sample substrings occur independently and/or the number of times that at least two adjacent sample substrings occur simultaneously; calculating the probability of the independent occurrence of the plurality of sample substrings and/or the probability of the simultaneous occurrence of the adjacent at least two sample substrings to obtain a probability dictionary; the probability dictionary comprises a plurality of sample substrings and the probability that the sample substrings appear independently and/or comprises at least two adjacent sample substrings and the probability that the sample substrings appear simultaneously.
Preferably, the type of the sample character string data is the same as the type of the character string to be identified generated in batch.
Preferably, the determining module 315 is further specifically configured to determine the probability of occurrence of the character string to be identified according to the probability of occurrence of the substring; and determining the randomness degree of the character strings to be identified according to the occurrence probability of the character strings to be identified.
More preferably, the determining module 315 is further specifically configured to, in a case where a probability that the substring of the character string to be recognized appears alone is obtained, use a geometric average of probabilities that the substring appears alone as the probability P of the character string to be recognized; or under the condition of obtaining the probability of the simultaneous occurrence of at least two adjacent substrings of the character string to be identified, taking the geometric mean value of the probability of the simultaneous occurrence of the at least two adjacent substrings as the probability P of the occurrence of the character string to be identified; or under the condition that the probability of the independent occurrence of the sub-character string of the character string to be recognized and the probability of the simultaneous occurrence of at least two adjacent sub-character strings of the character string to be recognized are obtained, taking the arithmetic average value of the probability geometric average value of the independent occurrence of the sub-character string and the probability geometric average value of the simultaneous occurrence of at least two adjacent sub-character strings as the probability P of the occurrence of the character string to be recognized.
Further, the determining module 315 is further specifically configured to determine a randomness degree r=1 of the character string to be identified—a probability P of occurrence of the character string to be identified.
Preferably, the judging module 320 is specifically configured to, in a case where the randomness degree R of the character string to be identified is greater than a preset random threshold, randomly generate the character string.
Preferably, the preset random threshold = 1-a preset probability threshold; the preset probability threshold is the median of the probabilities of the single occurrence of a plurality of sample substrings in the probability dictionary; or the median of the probability of simultaneous occurrence of at least two adjacent sample substrings in the probability dictionary; or the median of the probabilities of the independent occurrence of a plurality of sample substrings in the probability dictionary and the median of the probabilities of the simultaneous occurrence of at least two adjacent sample substrings in the probability dictionary.
Preferably, the apparatus further comprises: the key prevention and control module is used for performing key prevention and control on the randomly generated character string under the condition that the character string to be identified is determined to be the randomly generated character string; wherein the emphasis prevention includes at least one of restricting rights, enforcing authentication, and/or disabling login.
The embodiment of the specification also provides a device for identifying character strings generated in batches, which comprises: a memory storing a program and configured to execute receiving, by the processor, a batch-generated character string to be recognized; dividing the character string to be identified to obtain at least one sub-character string of the character string to be identified; determining the occurrence probability of at least one sub-character string of the character string to be identified, and determining the randomness degree of the character string to be identified according to the occurrence probability of the sub-character string; judging whether the character string to be recognized is a randomly generated character string or not according to the randomness degree of the character string to be recognized.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.
The foregoing is merely an example of the present specification and is not intended to limit the present specification. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.