CN109359274B - Method, device and equipment for identifying character strings generated in batch - Google Patents

Method, device and equipment for identifying character strings generated in batch Download PDF

Info

Publication number
CN109359274B
CN109359274B CN201811074092.2A CN201811074092A CN109359274B CN 109359274 B CN109359274 B CN 109359274B CN 201811074092 A CN201811074092 A CN 201811074092A CN 109359274 B CN109359274 B CN 109359274B
Authority
CN
China
Prior art keywords
probability
character string
occurrence
character strings
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811074092.2A
Other languages
Chinese (zh)
Other versions
CN109359274A (en
Inventor
江大鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ANT Financial Hang Zhou Network Technology Co Ltd
Original Assignee
ANT Financial Hang Zhou Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ANT Financial Hang Zhou Network Technology Co Ltd filed Critical ANT Financial Hang Zhou Network Technology Co Ltd
Priority to CN201811074092.2A priority Critical patent/CN109359274B/en
Publication of CN109359274A publication Critical patent/CN109359274A/en
Application granted granted Critical
Publication of CN109359274B publication Critical patent/CN109359274B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)
  • Character Discrimination (AREA)

Abstract

The specification discloses a method, a device and equipment for identifying character strings generated in batches. The method comprises the following steps: receiving character strings to be identified which are generated in batches; dividing the character string to be identified to obtain at least one sub-character string of the character string to be identified; determining the occurrence probability of at least one sub-character string of the character string to be identified, and determining the randomness degree of the character string to be identified according to the occurrence probability of the sub-character string; judging whether the character string to be recognized is a randomly generated character string or not according to the randomness degree of the character string to be recognized.

Description

Method, device and equipment for identifying character strings generated in batch
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, and a device for identifying a character string generated in batch.
Background
With the development and popularization of internet technology, more and more character strings in network platforms are character strings automatically generated in batches by machines. Taking batch registration accounts as an example, these batch registration accounts may use various functions of the platform. Because ordinary users do not use such account numbers, a lot of garbage content is brought to the platform, and even resources are lost. For example, the criticizing water army of information application, a plurality of accounts express in a short time and similar views, guide public opinion trend and influence normal user experience. For another example, if the e-commerce site has a greedy and cheap person such as a 'wool party', the subsidy resource of the e-commerce site is obtained by using the batch registration account number, so that the marketing funds are seriously wasted, and the marketing effect is greatly discounted.
In the prior art, such accounts are identified by a supervised learning classification algorithm, such as LR, SVM, etc., to classify the account number. According to the algorithm, a large number of accounts are manually marked as common accounts or random accounts, training data training classification models are obtained, and then the input accounts are classified, so that the labor consumption is high. Moreover, because the character strings with smaller overall lengths contain too little information, the classification model has poor classification effect on the character strings with smaller overall lengths, and cannot be recognized well.
Disclosure of Invention
The embodiment of the specification provides a method, a device and equipment for identifying character strings generated in batches. The problem that manual labeling of a large number of accounts consumes large labor and the classification effect of the classification model on the character strings with smaller overall length is poor is solved.
In order to solve the above technical problems, the embodiments of the present specification are implemented as follows:
the embodiment of the specification provides a method for identifying character strings generated in batches, which comprises the following steps:
receiving character strings to be identified which are generated in batches;
dividing the character string to be identified to obtain at least one sub-character string of the character string to be identified;
determining the occurrence probability of at least one sub-character string of the character string to be identified, and determining the randomness degree of the character string to be identified according to the occurrence probability of the sub-character string;
judging whether the character string to be recognized is a randomly generated character string or not according to the randomness degree of the character string to be recognized.
The embodiment of the specification provides a device for identifying character strings generated in batches, which comprises: the device comprises a receiving module, a dividing module, a determining module and a judging module;
the receiving module is used for receiving the character strings to be identified which are generated in batches;
the segmentation module is used for segmenting the character string to be identified to obtain at least one sub-character string of the character string to be identified;
the determining module is used for determining the occurrence probability of at least one sub-character string of the character string to be identified, and determining the randomness degree of the character string to be identified according to the occurrence probability of the sub-character string;
the judging module is used for judging whether the character string to be identified is a randomly generated character string according to the randomness degree of the character string to be identified.
The device for identifying character strings generated in batches provided in the embodiment of the specification comprises: the system comprises a memory and a processor, wherein the memory stores a program and is configured to execute the method for identifying the character strings generated in batch.
The above-mentioned at least one technical scheme that this description embodiment adopted can reach following beneficial effect: determining the randomness degree of the character string by determining the occurrence probability of the sub-character string of the character string, and further judging whether the character string is a randomly generated character string or not, wherein a large amount of training data is not required to be marked manually in the whole process, and the labor cost is saved; aiming at the type of the character string to be identified, sample character string data can be selected in a targeted manner; the effect of recognizing the character strings with smaller overall length is improved.
Drawings
In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some of the embodiments described in the present description, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a method for identifying a batch of character strings according to an embodiment of the present disclosure;
FIG. 2 is another flow chart of a method for recognizing a batch-generated character string according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of an apparatus for recognizing a batch-generated character string according to an embodiment of the present disclosure.
Detailed Description
The embodiment of the specification provides a method, a device and equipment for identifying character strings generated in batches.
In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.
Fig. 1 is a schematic flow chart of a method for identifying a batch of character strings according to an embodiment of the present disclosure, where the schematic flow chart includes:
step 105, receiving character strings to be identified generated in batches;
in the embodiment of the present disclosure, account numbers of each large network platform are taken as an example, and the account numbers are character strings formed by splicing characters. The account number automatically generated by the machine is a random character string formed by character stitching, such as 'iehfdjksyneyg', and most of account numbers registered by common users adopt character strings with a certain meaning, such as 'ilekkobe', and the randomness degree of the character strings is far greater than that of the account numbers registered by the common users.
As in step 220 of fig. 2, a string (a string to be recognized generated in batch) is input, and in the embodiment of the present disclosure, the string input in step 220 is received, for example, the string "ak, ti od e dgza" is received.
Step 110, segmenting the character string to be identified to obtain at least one sub-character string of the character string to be identified;
preferably, the character string "ak, ti odoe dgza" to be identified received in step 105 is preprocessed, and characters which cannot be used by the account numbers such as space and punctuation marks are removed, so that the preprocessed character string is "aktiodoedza"; the pre-processed string is re-segmented to obtain at least one substring, as shown in step 225 of FIG. 2.
In this embodiment of the present disclosure, the preprocessed string is divided at intervals of a preset character length, for example, the preprocessed string is divided once every two character pairs and/or the preprocessed string is divided once every three character pairs, so as to obtain at least one sub-string.
In the embodiment of the present disclosure, if n=2 of the N-gram model is taken, the preprocessed string "aktiodoedza" is divided to obtain substrings "ak", "ti", "od", "oe", "dg" and "za"; if n=3 of the N-gram model is taken, the preprocessed string "aktiodoedgza" is split to obtain substrings "akt", "iod", "oed" and "gza".
Step 115, determining the occurrence probability of at least one sub-character string of the character string to be recognized, and determining the randomness degree of the character string to be recognized according to the occurrence probability of the sub-character string;
in the embodiment of the present specification, the probability dictionary is first used to match the probabilities of occurrence of the substrings "ak", "ti", "od", "oe", "dg" and "za" of the character string "ak, ti odoe dgza" to be recognized. According to the occurrence probability of the sub-character strings, the occurrence probability of the character strings "ak, ti odoe dgza" to be identified is calculated, and the randomness degree R of the character strings "ak, ti odoe dgza" to be identified is further determined, as shown in step 230 in fig. 2; wherein the probability dictionary contains correspondence between sample substrings and probabilities of the sample substrings. Specifically, in the case where the probabilities of occurrence of the substrings "ak", "ti", "od", "oe", "dg", and "za" are obtained as 0.79, 0.59, 0.63, 0.71, 0.56, and 0.68, respectively, the geometric average values of 0.79, 0.59, 0.63, 0.71, 0.56, and 0.68 are calculated as 0.66 as the probability P of occurrence of the character string to be recognized "ak, ti odoe dgza", and further, the degree of randomness r=1 to P of the character string to be recognized "ak, ti odoe dgza", the degree of randomness R is 0.34; or under the condition that the probability that at least two adjacent sub-strings in the sub-strings 'ak', 'ti', 'od', 'oe', 'dg' and 'za' are simultaneously appeared is obtained, taking the geometric average value of the probability that the adjacent at least two sub-strings are simultaneously appeared as the probability P of the character string to be identified. Taking the following example of obtaining the probability that two adjacent substrings "ak" and "ti", "ti" and "od", "od" and "oe", "oe" and "dg" and "za" occur simultaneously, respectively, calculating the geometric average value of 0.69, 0.63, 0.71 and 0.66 as 0.68 as the probability P that the character string "ak, ti od dgza" to be recognized occurs, and further, the randomness degree r=1-P of the character string "ak, ti od dgza" to be recognized, then the randomness degree R is 0.32; or under the condition that the probability of single occurrence of the sub-character strings of the character strings 'ak, ti odoe dgza' to be identified and the probability of simultaneous occurrence of two adjacent sub-character strings are obtained simultaneously, taking the arithmetic average value of the probability geometric average value of single occurrence of the sub-character strings and the probability geometric average value of simultaneous occurrence of the two adjacent sub-character strings as the probability P of occurrence of the character strings 'ak, ti odoe dgza' to be identified, wherein the probability P is 0.67. And determining the randomness degree R of the character string to be recognized to be 0.33 according to the probability 0.67 of the character string to be recognized.
It should be noted that, before the probabilities of occurrence of the sub-strings "ak", "ti", "od", "oe", "dg" and "za" of the character strings "ak, ti odoe dgza" to be recognized are matched by using the probability dictionary, the probability dictionary is obtained. In the embodiment of the present specification, the type of the sample string data is the same as the type of the batch-generated character string to be recognized. Therefore, taking an english magazine, an english web page, or other english articles that can be normally obtained as sample string data, step 205 in fig. 2 is taken as an example. Further, the sample character string data is segmented to obtain a plurality of sample substrings; as shown in step 210 of fig. 2, the number of times that a plurality of sample substrings occur individually and/or the number of times that at least two adjacent sample substrings occur simultaneously is counted; calculating the probability of the single occurrence of the plurality of sample substrings and/or the probability of the simultaneous occurrence of the adjacent at least two sample substrings to obtain a probability dictionary, as shown in step 215 in fig. 2; the probability dictionary comprises a plurality of sample substrings and the probability that the sample substrings appear independently and/or comprises at least two adjacent sample substrings and the probability that the sample substrings appear simultaneously.
And step 120, judging whether the character string to be recognized is a randomly generated character string according to the randomness degree of the character string to be recognized.
In the embodiment of the present disclosure, as shown in step 235 in fig. 2, the randomness degree R is determined to be equal to the preset random threshold. As shown in step 240 in fig. 2, if the randomness R of the character string "ak, ti odoe dza" is greater than the preset randomness threshold, it is determined that the character string "ak, ti odoe dza" is a randomly generated character string. The preset random threshold=1-a preset probability threshold; the preset probability threshold is the median of the probabilities of the single occurrence of a plurality of sample substrings in the probability dictionary; or the median of the probability of simultaneous occurrence of at least two adjacent sample substrings in the probability dictionary; or an arithmetic average of the median of probabilities of the individual occurrence of the plurality of sample substrings in the probability dictionary and the median of probabilities of the simultaneous occurrence of at least two adjacent sample substrings in the probability dictionary. Taking the preset probability threshold value of 0.7 as an example, the preset random threshold value of 0.3 is obtained. The randomness R of the character string "ak, ti odoe dgza" to be identified obtained in the above step 115 is greater than a preset random threshold value of 0.3. Thus, the character string "ak, ti odoe dgza" to be recognized is a randomly generated character string. As shown in step 245 in fig. 2, in the case that the randomness degree R of the character string is not greater than the preset randomness threshold, the character string is a normal character string.
Further, in the embodiment of the present disclosure, the randomly generated character strings "ak, ti odoe dgza" are controlled in a key manner, specifically, the authority of the character strings "ak, ti odoe dgza" is limited, or the verification is enhanced on the character strings "ak, ti odoe dgza" or the character strings "ak, ti odoe dgza" are forbidden to log on the network platform.
Compared with the prior art, the technical scheme adopted by the embodiment of the specification can achieve the following beneficial effects: determining the randomness degree of the character string by determining the occurrence probability of the sub-character string of the character string, and further judging whether the character string is a randomly generated character string or not, wherein a large amount of training data is not required to be marked manually in the whole process, and the labor cost is saved; aiming at the type of the character string to be identified, sample character string data can be selected in a targeted manner; the effect of recognizing the character strings with smaller overall length is improved.
Fig. 3 is a schematic structural diagram of an apparatus for identifying a batch of character strings according to an embodiment of the present disclosure, where the schematic structural diagram includes: a receiving module 305, a dividing module 310, a determining module 315 and a judging module 320;
the receiving module 305 is configured to receive a batch of generated character strings to be identified;
the segmentation module 310 is configured to segment the character string to be identified to obtain at least one sub-character string of the character string to be identified;
the determining module 315 is configured to determine a probability of occurrence of at least one sub-string of the character string to be identified, and determine a degree of randomness of the character string to be identified according to the probability of occurrence of the sub-string;
the judging module 320 is configured to judge whether the character string to be recognized is a randomly generated character string according to the randomness degree of the character string to be recognized.
Preferably, the determining module 315 is specifically configured to match probabilities of occurrence of sub-strings of the character string to be identified using a probability dictionary, where the probability dictionary includes correspondence between probabilities of sample sub-strings and sample sub-strings; and determining the randomness degree of the character string to be identified according to the occurrence probability of the sub-character string.
Preferably, the apparatus further comprises: the probability dictionary obtaining module is used for dividing the sample character string data to obtain a plurality of sample sub-character strings; counting the number of times that a plurality of sample substrings occur independently and/or the number of times that at least two adjacent sample substrings occur simultaneously; calculating the probability of the independent occurrence of the plurality of sample substrings and/or the probability of the simultaneous occurrence of the adjacent at least two sample substrings to obtain a probability dictionary; the probability dictionary comprises a plurality of sample substrings and the probability that the sample substrings appear independently and/or comprises at least two adjacent sample substrings and the probability that the sample substrings appear simultaneously.
Preferably, the type of the sample character string data is the same as the type of the character string to be identified generated in batch.
Preferably, the determining module 315 is further specifically configured to determine the probability of occurrence of the character string to be identified according to the probability of occurrence of the substring; and determining the randomness degree of the character strings to be identified according to the occurrence probability of the character strings to be identified.
More preferably, the determining module 315 is further specifically configured to, in a case where a probability that the substring of the character string to be recognized appears alone is obtained, use a geometric average of probabilities that the substring appears alone as the probability P of the character string to be recognized; or under the condition of obtaining the probability of the simultaneous occurrence of at least two adjacent substrings of the character string to be identified, taking the geometric mean value of the probability of the simultaneous occurrence of the at least two adjacent substrings as the probability P of the occurrence of the character string to be identified; or under the condition that the probability of the independent occurrence of the sub-character string of the character string to be recognized and the probability of the simultaneous occurrence of at least two adjacent sub-character strings of the character string to be recognized are obtained, taking the arithmetic average value of the probability geometric average value of the independent occurrence of the sub-character string and the probability geometric average value of the simultaneous occurrence of at least two adjacent sub-character strings as the probability P of the occurrence of the character string to be recognized.
Further, the determining module 315 is further specifically configured to determine a randomness degree r=1 of the character string to be identified—a probability P of occurrence of the character string to be identified.
Preferably, the judging module 320 is specifically configured to, in a case where the randomness degree R of the character string to be identified is greater than a preset random threshold, randomly generate the character string.
Preferably, the preset random threshold = 1-a preset probability threshold; the preset probability threshold is the median of the probabilities of the single occurrence of a plurality of sample substrings in the probability dictionary; or the median of the probability of simultaneous occurrence of at least two adjacent sample substrings in the probability dictionary; or the median of the probabilities of the independent occurrence of a plurality of sample substrings in the probability dictionary and the median of the probabilities of the simultaneous occurrence of at least two adjacent sample substrings in the probability dictionary.
Preferably, the apparatus further comprises: the key prevention and control module is used for performing key prevention and control on the randomly generated character string under the condition that the character string to be identified is determined to be the randomly generated character string; wherein the emphasis prevention includes at least one of restricting rights, enforcing authentication, and/or disabling login.
The embodiment of the specification also provides a device for identifying character strings generated in batches, which comprises: a memory storing a program and configured to execute receiving, by the processor, a batch-generated character string to be recognized; dividing the character string to be identified to obtain at least one sub-character string of the character string to be identified; determining the occurrence probability of at least one sub-character string of the character string to be identified, and determining the randomness degree of the character string to be identified according to the occurrence probability of the sub-character string; judging whether the character string to be recognized is a randomly generated character string or not according to the randomness degree of the character string to be recognized.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.
The foregoing is merely an example of the present specification and is not intended to limit the present specification. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims (19)

1. A method of identifying a batch of character strings, the method comprising:
receiving character strings to be identified which are generated in batches;
dividing the character string to be identified to obtain at least one sub-character string of the character string to be identified;
determining the occurrence probability of at least one sub-character string of the character string to be identified, and determining the randomness degree of the character string to be identified according to the occurrence probability of the sub-character string; under the condition that the probability of the independent occurrence of the sub-character strings of the character strings to be identified is obtained, taking the geometric average value of the probability of the independent occurrence of the sub-character strings as the probability P of the occurrence of the character strings to be identified; or (b)
Under the condition that the probability that at least two adjacent sub-strings of the character string to be recognized appear simultaneously is obtained, taking the geometric average value of the probability that the at least two adjacent sub-strings appear simultaneously as the probability P of the character string to be recognized; or (b)
Under the condition that the probability of the independent occurrence of the sub-character strings of the character string to be recognized and the probability of the simultaneous occurrence of at least two adjacent sub-character strings of the character string to be recognized are obtained, taking the arithmetic average value of the probability geometric average value of the independent occurrence of the sub-character strings and the probability geometric average value of the simultaneous occurrence of the at least two adjacent sub-character strings as the probability P of the occurrence of the character string to be recognized;
judging whether the character string to be recognized is a randomly generated character string or not according to the randomness degree of the character string to be recognized.
2. The method for recognizing character strings generated in batch according to claim 1, wherein the determining the probability of occurrence of at least one sub-character string of the character string to be recognized, determining the degree of randomness of the character string to be recognized according to the probability of occurrence of the sub-character string, comprises:
matching the occurrence probability of the sub-character strings of the character strings to be identified by using a probability dictionary, wherein the probability dictionary comprises the corresponding relation between the sample sub-character strings and the probability of the sample sub-character strings;
and determining the randomness degree of the character string to be identified according to the occurrence probability of the sub-character string.
3. The method of claim 2, wherein before matching the probabilities of occurrence of substrings of the character string to be recognized using a probability dictionary, the method further comprises:
dividing sample character string data to obtain a plurality of sample substrings;
counting the number of times that a plurality of sample substrings occur independently and/or the number of times that at least two adjacent sample substrings occur simultaneously;
calculating the probability of the independent occurrence of the plurality of sample substrings and/or the probability of the simultaneous occurrence of the adjacent at least two sample substrings to obtain a probability dictionary;
the probability dictionary comprises a plurality of sample substrings and the probability that the sample substrings appear independently and/or comprises at least two adjacent sample substrings and the probability that the sample substrings appear simultaneously.
4. A method of identifying a batch of character strings according to claim 3, further comprising: and the type of the sample character string data is the same as the type of the character string to be identified generated in batch.
5. The method for recognizing character strings generated in batch according to claim 2, wherein the determining the degree of randomness of the character strings to be recognized according to the probability of occurrence of the sub-character strings comprises:
determining the occurrence probability of the character string to be identified according to the occurrence probability of the sub character string;
and determining the randomness degree of the character strings to be identified according to the occurrence probability of the character strings to be identified.
6. The method for recognizing character strings generated in batch according to claim 1, wherein the determining the randomness degree of the character strings to be recognized according to the probability of occurrence of the character strings to be recognized comprises: and determining the randomness degree R=1 of the character strings to be identified, and determining the probability P of the occurrence of the character strings to be identified.
7. The method for recognizing character strings generated in batch according to claim 6, wherein the judging whether the character string to be recognized is a randomly generated character string according to the degree of randomness of the character string to be recognized comprises:
and under the condition that the randomness degree R of the character strings to be identified is larger than a preset random threshold value, the character strings to be identified are randomly generated character strings.
8. The method of claim 7, wherein the step of identifying the character string generated in batch,
the preset random threshold = 1-a preset probability threshold;
the preset probability threshold is the median of the probabilities of the single occurrence of a plurality of sample substrings in the probability dictionary; or the median of the probability of simultaneous occurrence of at least two adjacent sample substrings in the probability dictionary; or the median of the probabilities of the independent occurrence of a plurality of sample substrings in the probability dictionary and the median of the probabilities of the simultaneous occurrence of at least two adjacent sample substrings in the probability dictionary.
9. The method of identifying a batch of character strings according to claim 7, further comprising:
under the condition that the character string to be identified is a randomly generated character string, performing key prevention and control on the randomly generated character string;
wherein the emphasis prevention includes at least one of restricting rights, enforcing authentication, and/or disabling login.
10. An apparatus for identifying a batch of character strings, the apparatus comprising: the device comprises a receiving module, a dividing module, a determining module and a judging module;
the receiving module is used for receiving the character strings to be identified which are generated in batches;
the segmentation module is used for segmenting the character string to be identified to obtain at least one sub-character string of the character string to be identified;
the determining module is used for determining the occurrence probability of at least one sub-character string of the character string to be identified, and determining the randomness degree of the character string to be identified according to the occurrence probability of the sub-character string;
the judging module is used for judging whether the character string to be identified is a randomly generated character string according to the randomness degree of the character string to be identified;
the determining module is further specifically configured to, when the probability that the sub-string of the character string to be recognized appears alone is obtained, use a geometric average value of the probabilities that the sub-string appears alone as the probability P of the character string to be recognized; or under the condition of obtaining the probability of the simultaneous occurrence of at least two adjacent substrings of the character string to be identified, taking the geometric mean value of the probability of the simultaneous occurrence of the at least two adjacent substrings as the probability P of the occurrence of the character string to be identified; or under the condition that the probability of the independent occurrence of the sub-character string of the character string to be recognized and the probability of the simultaneous occurrence of at least two adjacent sub-character strings of the character string to be recognized are obtained, taking the arithmetic average value of the probability geometric average value of the independent occurrence of the sub-character string and the probability geometric average value of the simultaneous occurrence of at least two adjacent sub-character strings as the probability P of the occurrence of the character string to be recognized.
11. The apparatus for recognizing character strings generated in batch according to claim 10, wherein the determining module is specifically configured to match probabilities of occurrence of sub-character strings of the character strings to be recognized using a probability dictionary, the probability dictionary containing correspondence between sample sub-character strings and probabilities of sample sub-character strings; and determining the randomness degree of the character string to be identified according to the occurrence probability of the sub-character string.
12. The apparatus for identifying a batch of character strings as in claim 11, further comprising: the probability dictionary obtaining module is used for dividing the sample character string data to obtain a plurality of sample sub-character strings; counting the number of times that a plurality of sample substrings occur independently and/or the number of times that at least two adjacent sample substrings occur simultaneously; calculating the probability of the independent occurrence of the plurality of sample substrings and/or the probability of the simultaneous occurrence of the adjacent at least two sample substrings to obtain a probability dictionary; the probability dictionary comprises a plurality of sample substrings and the probability that the sample substrings appear independently and/or comprises at least two adjacent sample substrings and the probability that the sample substrings appear simultaneously.
13. The apparatus for recognizing character strings according to claim 12, wherein the sample character string data is the same type as the character string to be recognized in batch.
14. The apparatus for recognizing character strings generated in batch according to claim 11, wherein the determining module is further specifically configured to determine the probability of occurrence of the character string to be recognized according to the probability of occurrence of the sub-character string; and determining the randomness degree of the character strings to be identified according to the occurrence probability of the character strings to be identified.
15. The apparatus for recognizing character strings generated in batch according to claim 10, wherein the determining module is further specifically configured to determine a degree of randomness r=1 of the character strings to be recognized—a probability P of occurrence of the character strings to be recognized.
16. The apparatus for recognizing character strings generated in batch according to claim 15, wherein the judging module is specifically configured to, in a case where the degree of randomness R of the character string to be recognized is greater than a preset random threshold, randomly generate the character string.
17. The apparatus for identifying character strings generated in batch according to claim 16, wherein the preset random threshold = 1-preset probability threshold; the preset probability threshold is the median of the probabilities of the single occurrence of a plurality of sample substrings in the probability dictionary; or the median of the probability of simultaneous occurrence of at least two adjacent sample substrings in the probability dictionary; or the median of the probabilities of the independent occurrence of a plurality of sample substrings in the probability dictionary and the median of the probabilities of the simultaneous occurrence of at least two adjacent sample substrings in the probability dictionary.
18. The apparatus for identifying a batch of character strings as in claim 16, further comprising: the key prevention and control module is used for performing key prevention and control on the randomly generated character string under the condition that the character string to be identified is determined to be the randomly generated character string; wherein the emphasis prevention includes at least one of restricting rights, enforcing authentication, and/or disabling login.
19. An apparatus for identifying a batch of generated character strings, comprising: a memory storing a program and configured to perform the method of identifying a batch-generated string of any one of claims 1-9 by the processor.
CN201811074092.2A 2018-09-14 2018-09-14 Method, device and equipment for identifying character strings generated in batch Active CN109359274B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811074092.2A CN109359274B (en) 2018-09-14 2018-09-14 Method, device and equipment for identifying character strings generated in batch

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811074092.2A CN109359274B (en) 2018-09-14 2018-09-14 Method, device and equipment for identifying character strings generated in batch

Publications (2)

Publication Number Publication Date
CN109359274A CN109359274A (en) 2019-02-19
CN109359274B true CN109359274B (en) 2023-05-02

Family

ID=65350758

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811074092.2A Active CN109359274B (en) 2018-09-14 2018-09-14 Method, device and equipment for identifying character strings generated in batch

Country Status (1)

Country Link
CN (1) CN109359274B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765973B (en) * 2019-10-31 2023-07-04 上海掌门科技有限公司 Account type identification method and device
WO2021106173A1 (en) * 2019-11-28 2021-06-03 日本電信電話株式会社 Labeling device and labeling program

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10207987A (en) * 1997-01-28 1998-08-07 Nec Telecom Syst Ltd Hand-written character recognition device
CN104462058A (en) * 2014-10-24 2015-03-25 腾讯科技(深圳)有限公司 Character string identification method and device
CN104750666A (en) * 2015-03-12 2015-07-01 明博教育科技有限公司 Text character encoding mode identification method and system
CN106899411A (en) * 2016-12-08 2017-06-27 阿里巴巴集团控股有限公司 A kind of method of calibration and device based on identifying code

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3887088B2 (en) * 1997-12-08 2007-02-28 富士通株式会社 Character recognition device, character recognition method, and computer-readable recording medium
JP5199901B2 (en) * 2009-01-21 2013-05-15 日本電信電話株式会社 Language model creation method, language model creation device, and language model creation program
JP5927955B2 (en) * 2012-02-06 2016-06-01 カシオ計算機株式会社 Information processing apparatus and program
CN103077389B (en) * 2013-01-07 2016-08-03 华中科技大学 A kind of combination character level classification and character string level classification text detection and recognition methods
CN106033416B (en) * 2015-03-09 2019-12-24 阿里巴巴集团控股有限公司 Character string processing method and device
CN107679401A (en) * 2017-09-04 2018-02-09 北京知道未来信息技术有限公司 A kind of malicious web pages recognition methods and device
CN108288078B (en) * 2017-12-07 2020-09-29 腾讯科技(深圳)有限公司 Method, device and medium for recognizing characters in image
CN108470126B (en) * 2018-03-19 2020-05-01 腾讯科技(深圳)有限公司 Data processing method, device and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10207987A (en) * 1997-01-28 1998-08-07 Nec Telecom Syst Ltd Hand-written character recognition device
CN104462058A (en) * 2014-10-24 2015-03-25 腾讯科技(深圳)有限公司 Character string identification method and device
CN104750666A (en) * 2015-03-12 2015-07-01 明博教育科技有限公司 Text character encoding mode identification method and system
CN106899411A (en) * 2016-12-08 2017-06-27 阿里巴巴集团控股有限公司 A kind of method of calibration and device based on identifying code

Also Published As

Publication number Publication date
CN109359274A (en) 2019-02-19

Similar Documents

Publication Publication Date Title
US20200195667A1 (en) Url attack detection method and apparatus, and electronic device
CN107707545B (en) Abnormal webpage access fragment detection method, device, equipment and storage medium
CN109582833B (en) Abnormal text detection method and device
CN111159697B (en) Key detection method and device and electronic equipment
CN112860841A (en) Text emotion analysis method, device and equipment and storage medium
CN105740667A (en) User behavior based information identification method and apparatus
CN111147459A (en) C & C domain name detection method and device based on DNS request data
CN109359274B (en) Method, device and equipment for identifying character strings generated in batch
CN117195220A (en) Intelligent contract vulnerability detection method and system based on Tree-LSTM and BiLSTM
CN106888201A (en) A kind of method of calibration and device
CN111754338A (en) Method and system for identifying link loan website group
CN110532773B (en) Malicious access behavior identification method, data processing method, device and equipment
CN114201756A (en) Vulnerability detection method and related device for intelligent contract code segment
CN111414621B (en) Malicious webpage file identification method and device
CN112016317A (en) Sensitive word recognition method and device based on artificial intelligence and computer equipment
CN111988327A (en) Threat behavior detection and model establishment method and device, electronic equipment and storage medium
CN107861950A (en) The detection method and device of abnormal text
CN109818954B (en) Web injection type attack detection method and device, electronic equipment and storage medium
CN116561298A (en) Title generation method, device, equipment and storage medium based on artificial intelligence
CN113378156B (en) API-based malicious file detection method and system
CN105740666A (en) Method and device for identifying on-line operational risk
CN112765236B (en) Adaptive abnormal equipment mining method, storage medium, equipment and system
CN112100618B (en) Virus file detection method, system, equipment and computer storage medium
CN110636082B (en) Intrusion detection method and device
CN112559474A (en) Log processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20201010

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20201010

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20230307

Address after: 801-10, Section B, 8th floor, 556 Xixi Road, Xihu District, Hangzhou City, Zhejiang Province

Applicant after: Ant financial (Hangzhou) Network Technology Co.,Ltd.

Address before: 27 Hospital Road, George Town, Grand Cayman ky1-9008

Applicant before: Innovative advanced technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant