CN109241523B - Method, device and equipment for identifying variant cheating fields - Google Patents

Method, device and equipment for identifying variant cheating fields Download PDF

Info

Publication number
CN109241523B
CN109241523B CN201810907161.7A CN201810907161A CN109241523B CN 109241523 B CN109241523 B CN 109241523B CN 201810907161 A CN201810907161 A CN 201810907161A CN 109241523 B CN109241523 B CN 109241523B
Authority
CN
China
Prior art keywords
variant
cheating
character
digital
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810907161.7A
Other languages
Chinese (zh)
Other versions
CN109241523A (en
Inventor
陈玉焓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201810907161.7A priority Critical patent/CN109241523B/en
Publication of CN109241523A publication Critical patent/CN109241523A/en
Application granted granted Critical
Publication of CN109241523B publication Critical patent/CN109241523B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method, a device and equipment for identifying variant cheating fields, wherein the method comprises the following steps: acquiring a text to be identified; extracting a number paragraph from a text to be recognized; carrying out variant character conversion on characters in the digital number paragraph and carrying out leading word matching; if the guide word is matched, judging the digital number paragraph as a variant cheating field; if the guide word is not matched, extracting variant features from the digital number paragraph, and scoring according to the variant features to generate a score value; and if the score value is larger than a preset threshold value, judging the digital number paragraph as a variant cheating field. Therefore, the problems that the discontinuous digital fragments cannot be identified and the variant cheating fields cannot be matched and identified without the guide words are solved, and the accuracy of identifying the variant cheating fields is improved.

Description

Method, device and equipment for identifying variant cheating fields
Technical Field
The invention relates to the technical field of internet, in particular to a method, a device and equipment for identifying variant cheating fields.
Background
With the rapid development of internet technology, networks have become the main way for people to communicate and release information. However, variant cheating fields with links often appear on the internet, such as "← → mike ← → 199 ← → 2638 ← → 723 →" "jia dimension → I0230 burn 66183", and the like.
In the related technology, the first scheme identifies the variant cheating field according to the matching result by matching the guide words such as WeChat, telephone, mail and the like. The scheme has low accuracy, for example, the 'I knows him on WeChat' and 'he well believes' can be recognized as variant cheating fields, and when the guide words are not matched, the variant cheating fields cannot be recognized. And the second scheme is that a text paragraph is matched through a regular expression of micro signal codes, telephone numbers and url links, and if the text paragraph can be matched with a corresponding number field, the paragraph is identified as a variant cheating field. The scheme has low accuracy and can not match discontinuous digital fragments.
Disclosure of Invention
Embodiments of the present invention aim to address, at least to some extent, one of the technical problems in the related art.
Therefore, a first objective of the embodiments of the present invention is to provide a method for identifying variant cheating fields, so as to solve the problems in the related art that a discontinuous digital segment cannot be identified, and a leader word cannot be matched with and identify a variant cheating field, thereby improving the accuracy of identifying a variant cheating field.
A second object of an embodiment of the present invention is to provide an apparatus for identifying variant cheating fields.
A third objective of an embodiment of the present invention is to provide a computer device.
It is a fourth object of embodiments of the invention to provide a non-transitory computer-readable storage medium.
In order to achieve the above object, an embodiment of a first aspect of the present invention provides a method for identifying a variant cheating field, including:
acquiring a text to be identified;
extracting a number paragraph from the text to be recognized;
carrying out variant character conversion on characters in the digital number paragraph and carrying out leading word matching;
if the guide word is matched, judging that the digital number paragraph is a variant cheating field;
if the guide word is not matched, extracting variant features from the number paragraph, and scoring according to the variant features to generate a scoring value;
and if the score value is larger than a preset threshold value, judging that the digital number paragraph is a variant cheating field.
The method for identifying the variant cheating field comprises the steps of firstly obtaining a text to be identified and extracting a number paragraph from the text to be identified. And then, performing variant character conversion on characters in the digital number paragraph, performing leading word matching, and judging the digital number paragraph as a variant cheating field when the leading word is matched. And further, when the guide word is not matched, extracting variant features from the digital number paragraph, scoring according to the variant features to generate a score value, and further when the score value is larger than a preset threshold value, judging the digital number paragraph as a variant cheating field. In the embodiment, the number paragraph comprising the discontinuous number segment can be extracted from the text to be identified, and the problem that the discontinuous number segment cannot be identified in the related technology is solved. When the leader word is not matched, the variant characteristics are extracted from the digital number paragraph, and the score value is generated according to the variant characteristics, so that the variant cheating field can be identified according to the score value when the leader word is not matched. And moreover, by combining the variant characteristics to score, the regularized variant cheating field identification strategy is realized, and the accuracy of the algorithm and the accuracy of the variant cheating field identification are improved.
In addition, the identification method of the variant cheating field according to the above embodiment of the present invention may further have the following additional technical features:
optionally, before the extracting a number paragraph from the text to be recognized, the method further includes: and carrying out digital variant normalization on the text to be recognized.
Optionally, the performing digital variant normalization on the text to be recognized includes: and carrying out digital variant normalization on the text to be recognized according to a variant contact information database.
Optionally, the method for identifying the variant cheating field further includes:
converting characters in the text to be recognized into character pictures;
comparing the character picture with the digital picture to generate a similarity value;
and converting the characters corresponding to the character pictures with the similarity values larger than the preset similarity threshold value into the numbers corresponding to the corresponding digital pictures.
Optionally, the extracting a number paragraph from the text to be recognized includes:
acquiring the position of a special character in the text to be recognized;
and extracting character strings before and after the special character according with a preset rule, and adding the special character and the character strings before and after the special character according with the preset rule into the digital number paragraph.
Optionally, the preset rule is:
judging whether characters with preset character intervals are numbers or not forwards or backwards by taking the special characters as centers;
and if so, adding the characters between the special characters and the characters with the preset character interval into the number paragraph.
Optionally, the method for identifying the variant cheating field further includes:
and removing the interference symbols in the number paragraph.
Optionally, the extracting variant features from the number paragraphs and scoring according to the variant features to generate score values includes:
performing pinyin normalization on the digital number paragraph;
extracting variant features from the pinyin normalized number paragraph;
abstract features are extracted from the pinyin normalized number paragraph;
scoring according to the variant features and the abstract features to generate the score value.
In order to achieve the above object, an embodiment of a second aspect of the present invention provides an apparatus for identifying variant cheating fields, including:
the acquisition module is used for acquiring a text to be recognized;
the extraction module is used for extracting a number paragraph from the text to be recognized;
the matching module is used for performing variant character conversion on characters in the digital number paragraph and performing leading word matching;
the first judgment module is used for judging the digital number paragraph as a variant cheating field if the guide word is matched;
the scoring module is used for extracting variant features from the digital number paragraphs if the guide words are not matched, and scoring according to the variant features to generate scoring values;
and the second judging module is used for judging the digital number paragraph as a variant cheating field if the score value is greater than a preset threshold value.
The device for identifying the variant cheating field, provided by the embodiment of the invention, extracts the digital number paragraph from the text to be identified by obtaining the text to be identified, further converts the variant characters of the characters in the digital number paragraph and matches the leading words, judges the digital number paragraph to be the variant cheating field when the leading words are matched, extracts the variant characteristics from the digital number paragraph when the leading words are not matched, scores the variant characteristics according to the variant characteristics to generate the score value, and judges the digital number paragraph to be the variant cheating field when the score value is larger than a preset threshold value. Therefore, the problems that the discontinuous digital fragments cannot be identified and the variant cheating fields cannot be matched and identified without the guide words in the related technology are solved, and the accuracy of identifying the variant cheating fields is improved.
In addition, the identification apparatus of the variant cheating field according to the above embodiment of the present invention may further have the following additional technical features:
optionally, the device for identifying a variant cheating field further includes: and the conversion module is used for carrying out digital variant normalization on the text to be recognized.
Optionally, the conversion module is specifically configured to: and carrying out digital variant normalization on the text to be recognized according to a variant contact information database.
Optionally, the conversion module is specifically configured to:
converting characters in the text to be recognized into character pictures;
comparing the character picture with the digital picture to generate a similarity value;
and converting the characters corresponding to the character pictures with the similarity values larger than the preset similarity threshold value into the numbers corresponding to the corresponding digital pictures.
Optionally, the extracting module is specifically configured to:
acquiring the position of a special character in the text to be recognized;
and extracting character strings before and after the special character according with a preset rule, and adding the special character and the character strings before and after the special character according with the preset rule into the digital number paragraph.
Optionally, the preset rule is:
judging whether characters with preset character intervals are numbers or not forwards or backwards by taking the special characters as centers;
and if so, adding the characters between the special characters and the characters with the preset character interval into the number paragraph.
Optionally, the device for identifying a variant cheating field further includes: and the processing module is used for removing the interference symbols in the digital number paragraph.
Optionally, the scoring module is specifically configured to:
performing pinyin normalization on the digital number paragraph;
extracting variant features from the pinyin normalized number paragraph;
abstract features are extracted from the pinyin normalized number paragraph;
scoring according to the variant features and the abstract features to generate the score value.
To achieve the above object, a third embodiment of the present invention provides a computer device, including a processor and a memory; wherein the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to implement the identification method of the variant cheating field according to the embodiment of the first aspect.
To achieve the above object, a fourth embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is configured to, when executed by a processor, implement the method for identifying a variant cheating field according to the first embodiment.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
Fig. 1 is a schematic flowchart illustrating a method for identifying a variant cheating field according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of another variant cheating field identification method according to an embodiment of the present invention;
fig. 3 is a flowchart illustrating another variant cheating field identification method according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an apparatus for identifying a variant cheating field according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an apparatus for identifying a modified cheating field according to an embodiment of the present invention;
FIG. 6 illustrates a block diagram of an exemplary computer device suitable for use to implement embodiments of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
The following describes a method, an apparatus and a device for identifying a variant cheating field according to an embodiment of the present invention with reference to the accompanying drawings.
Fig. 1 is a schematic flowchart of a method for identifying a variant cheating field according to an embodiment of the present invention, as shown in fig. 1, the method for identifying a variant cheating field includes:
step 101, obtaining a text to be recognized.
In this embodiment, in order to identify the variant cheating field, the text to be identified needs to be obtained first.
In one embodiment of the invention, articles, comments or pushed information and the like can be acquired from the internet as the text to be recognized. For example, the friend circle push information' XX benefit fitment can be obtained, the APP does not need to be installed, the WeChat applet is identified, and then the fitment staging can be applied. "as the text to be recognized. For another example, the comment that "a meeting is a great joy, a meeting in mind, and even the most beautiful landscape of a life can be obtained. And (3) the step of (3) going to ← → wei ware → 199 ← → 2638 ← → 723 ← → "as text to be recognized.
Step 102, extracting a number paragraph from the text to be recognized.
In practical applications, since the variant cheating field usually includes a contact address, and the contact address (such as a micro signal code, a telephone number, etc.) usually consists of numbers, in order to identify the variant cheating field, a number paragraph needs to be extracted from the text to be identified. The number paragraph refers to a paragraph with a number, and may include numbers, Chinese, letters, special symbols, and the like.
There are various implementations of extracting a number paragraph from a text to be recognized, for example, as follows:
as a possible implementation manner, the position of the special character in the text to be recognized can be obtained, and then whether the character at the preset character interval is a number is judged forwards or backwards by taking the special character as the center, if so, the characters between the special character and the character at the preset character interval are added into the number paragraph. The special character may be a single numeric character, may be a continuous numeric character segment, or may be a character that hits a regular position of the character string (for example, the beginning and the end of the character string). The preset character interval can be obtained according to a large amount of experimental data, and can be set by a person skilled in the art.
In this embodiment, a number paragraph including consecutive number segments, such as "plus little letter 12345678" and the like, may be extracted from the text to be recognized, and a number paragraph including non-consecutive number segments, such as "1.3-4. -little 8-9 letter-0 plus this", and the like, may also be extracted.
And 103, performing variant character conversion on characters in the digital number paragraph and performing leading word matching.
In an embodiment of the present invention, a variant word conversion word list may be preset, and the variant word conversion relationship is stored in the variant word conversion word list, so as to perform variant word conversion on the number segment through the variant word conversion word list. For example, the number paragraph is "jia dimension 136919 encounter 71634", which is converted into "micro 136919 encounter 71634" through variant word conversion.
The inflected word conversion word list can be set according to inflected word sample data collected on line, and can also be set by persons in the field according to needs. For example, the inflected word list may include: "syndrome", "micro", "plus", "v", "micro", etc.
In this embodiment, after the variant word conversion is performed on the number segment, the leader word matching needs to be performed to identify the variant cheating field according to the matching result. As a possible implementation manner, a guide word list may be preset, and then, according to the guide word list, guide word matching is performed on the digit number paragraphs after the variant words are converted. For example, the introductory word may be "WeChat", "public number", etc.
And 104, if the guide word is matched, judging the digital number paragraph as a variant cheating field.
In this embodiment, if the leading word is matched, the number paragraph is determined to be the variant cheating field. For example, if the digital number paragraph "plus WeChat 12345678" matches the leader word "WeChat", the digital number paragraph is determined to be a variant cheating field.
If the guide word is not matched, extracting variant features from the number paragraphs, and scoring according to the variant features to generate scoring values.
In one embodiment of the invention, variant features may be extracted directly from the numeric number paragraphs by a correlation feature extraction algorithm.
In one embodiment of the invention, Chinese, numbers and the like in the number paragraph can be normalized into pinyin, and variant features can be extracted from the pinyin number paragraph. For example, the pinyin normalization of "plus Wenxin 12345678" is changed to "jiaweiixinyiiersinsiwuqiba".
Wherein, the variant characteristics can be obtained according to a large amount of experimental data, and can be set by the technicians in the field according to the needs. For example, variant features may include special exception symbol ratios, number of strings, and the like.
In this embodiment, after the variant features are extracted from the number paragraphs, scores may be scored according to the variant features in a variety of ways to generate score values.
As a possible implementation manner, a scoring formula may be preset, and after extracting variant features from a digit number paragraph, the variant features are substituted into the scoring formula to score to generate a score value. The scoring formula can be obtained according to a large amount of experimental data, and can also be set by a person skilled in the art according to needs.
As another possible implementation manner, a digital number paragraph may be selected on line as sample data, parameter information of the network neural model is trained according to the sample data to generate a scoring model, and after a variant feature is extracted from the digital number paragraph, the variant feature is input into the scoring model to be scored to generate a score value.
It should be noted that the above implementation manner of scoring according to the variant features to generate the score value is merely exemplary, and the score value may be generated by only one manner, or may be generated by combining a plurality of manners, which is not limited herein.
And step 106, if the score value is larger than a preset threshold value, judging that the digital number paragraph is a variant cheating field.
In this embodiment, the higher the score value is, the higher the possibility that the number paragraph is a variant cheating field is, whereas the lower the score value is, the lower the possibility that the number paragraph is a variant cheating field is.
Optionally, the obtained score value may be compared with a preset threshold, and then, it is determined that the digital number paragraph with the score value larger than the preset threshold is the variant cheating field, and the digital number paragraph with the score value smaller than or equal to the preset threshold is not the variant cheating field.
In the embodiment, the problem that discontinuous digital segments cannot be identified in the related technology is solved through the refinement of the digital number segment extraction process. When the guide word is not matched, the variant features are extracted from the number paragraphs, and the score value is generated according to the variant features, so that variant cheating fields can be identified according to the score value when the guide word is not matched. And moreover, by combining the variant characteristics to score, the regularized variant cheating field identification strategy is realized, and the accuracy of the algorithm and the accuracy of the variant cheating field identification are improved.
In summary, in the method for identifying the variant cheating field according to the embodiment of the present invention, the text to be identified is obtained, the number paragraph is further extracted from the text to be identified, the words in the number paragraph are further subjected to variant word conversion and subjected to leader word matching, when the leader word is matched, the number paragraph is determined as the variant cheating field, when the leader word is not matched, the variant feature is extracted from the number paragraph, and scoring is performed according to the variant feature to generate the score value, and when the score value is greater than the preset threshold value, the number paragraph is determined as the variant cheating field. Therefore, the problems that the discontinuous digital fragments cannot be identified and the variant cheating fields cannot be matched and identified without the guide words in the related technology are solved, and the accuracy of identifying the variant cheating fields is improved.
For a clearer explanation of the present invention, the following description will be made in detail with respect to the extraction of a number paragraph from a text to be recognized. Fig. 2 is a schematic flowchart of another method for identifying a variant cheating field according to an embodiment of the present invention, and as shown in fig. 2, after obtaining a text to be identified, the method includes:
step 201, performing digital variant normalization on the text to be recognized.
In this embodiment, before extracting a number paragraph from a text to be recognized, a number variant normalization may be performed on the text to be recognized, and the number variant in the text to be recognized is converted into a normal number, so as to solve the problems that the number variant (such as "1- > one", "2- > two", "8- > 〥") in the related art cannot be recognized and the coverage rate is low.
In one embodiment of the invention, the text to be recognized may be subjected to numerical variant normalization according to a variant contact information database.
For example, variant contact information on the line can be collected regularly as a negative sample, the corresponding relation between the variant digital alphabetical characters and the normal digital alphabetical characters is stored in a variant contact information database, the text to be recognized is further matched with the variant contact information database, and the variant digital alphabetical characters in the text to be recognized are further converted into the normal digital alphabetical characters according to the matching result and the corresponding relation.
Wherein the variant numeric alphabetic characters include but are not limited to Chinese numerals (one->1) RMB number (one->1) Number with circle
Figure BDA0001760918870000071
Variant letter
Figure BDA0001760918870000072
And the like. The correspondence of variant and normal alphanumeric characters may be stored by a trie-master structure, a dit structure, or other similar structures, without limitation.
In an embodiment of the present invention, the characters in the text to be recognized may also be converted into character pictures, and then the character pictures are compared with the digital pictures in terms of similarity to generate a similarity value, and further, the characters corresponding to the character pictures with the similarity value greater than a preset similarity threshold are converted into the numbers corresponding to the corresponding digital pictures.
For example, a text to be recognized may be converted into a unicode (uniform code) coding segment, characters in the unicode coding segment are converted into character pictures, the character pictures are compared with preset digital pictures for similarity, a similarity value of the pictures is calculated through a Perceptual hash algorithm (hash algorithm, abbreviated as hash), the similarity value is further matched with a preset similarity threshold, characters corresponding to the character pictures with the similarity value greater than the preset similarity threshold are obtained, and the characters are converted into numbers corresponding to the corresponding digital pictures.
Step 202, acquiring the position of the special character in the text to be recognized.
The special characters can be single numeric characters, continuous numeric character segments, characters hitting the regular positions of the character strings, and the like.
As a possible implementation manner, matching may be performed on the text to be recognized in a regular matching manner, so as to obtain the special characters in the text to be recognized, and obtain the positions of the special characters.
Step 203, extracting the character strings before and after the special character according with the preset rule, and adding the special character and the character strings before and after the special character according with the preset rule into the digital number paragraph.
In one embodiment of the present invention, the preset rule may be: and judging whether the characters at the preset character intervals are numbers forwards or backwards by taking the special characters as centers, and if so, adding the characters between the special characters and the characters at the preset character intervals into the number paragraph.
For example, the preset character interval is 5, the text to be recognized is 'plus-1-2-3', the special character is 'plus', the character which takes the special character 'plus' as the center and is 5 characters backwards is '3', and the 'plus-1-2-3' is added into the number paragraph because the '3' is a number.
For another example, if the preset character interval is 10, it is known that the character in the character interval of 10 backward around the special character is not a number, and it is further determined that the character in the character interval of 9 backward is a number, then the special character and the character in the character interval of 9 backward are both added to the number paragraph.
The preset character interval can be obtained according to a large amount of experimental data, and can also be set by a person skilled in the art.
Therefore, the number paragraphs including continuous number segments can be extracted from the text to be recognized, and the number paragraphs including non-continuous number segments can also be extracted.
In one embodiment of the present invention, after the number segment is acquired, the interference symbols in the number segment can be removed. Where the interference symbols include, but are not limited to, abnormal interference symbols and special symbols, such as ︻,
Figure BDA0001760918870000081
☆,
Figure BDA0001760918870000082
and the like.
Alternatively, the interference symbols in the digit number segment may be extracted based on the unicode code segment. That is, if a character hits a unicode code segment in some special symbol set, the character is treated as an interference symbol, removed and added to the calculation of the special symbol ratio. Among them, the unicode code segment in the special symbol set may be (\ u2600- \ u26FF, \ u2700- \ u27 BF).
The method for identifying the variant cheating field realizes identification of variant numbers by carrying out digital variant normalization on the text to be identified. Furthermore, by acquiring the position of the special character in the text to be recognized, extracting the character strings which are in front of and behind the special character and accord with the preset rule, and adding the special character and the character strings which are in front of and behind the special character and accord with the preset rule into the number paragraph, the number paragraph comprising the continuous number segment can be extracted from the text to be recognized, and the number paragraph comprising the discontinuous number segment can also be extracted, so that the problem that the discontinuous number segment cannot be recognized in the related technology is solved, and the refinement of the number paragraph extraction process is realized. And interference symbols in the digital number can be removed, so that the accuracy of identifying the variant cheating field is further improved.
Based on the embodiment, further, the variant feature and the abstract feature can be extracted from the pinyin normalized number paragraph, the score value is obtained according to the variant feature and the abstract feature, and the variant cheating field is identified according to the score value.
Fig. 3 is a schematic flowchart of another variant cheating field identification method according to an embodiment of the present invention, and as shown in fig. 3, after a number paragraph is extracted from a text to be identified, the variant cheating field identification method further includes:
step 301, performing variant character conversion on characters in the number paragraph and performing leading word matching.
In one embodiment of the present invention, a guide word list may be preset, and a guide word matching feature M may be establishedguideAnd then, conducting leading word matching on the digit number paragraph after the variant character conversion according to the leading word list, and when the leading word is matched, M is conductedguide1 is ═ 1; when the guide word is not matched, Mguide=0。
It should be noted that the explanation of the variant word conversion performed on the number segment in the foregoing embodiment is also applicable to step 301, and is not described herein again.
Step 302, performing pinyin normalization on the number paragraphs.
In an embodiment of the present invention, a database may be preset, and the corresponding relationship between the chinese language, the number and the pinyin is stored in the database, and further the chinese language and the number in the number paragraph are converted into the pinyin according to the corresponding relationship between the chinese language, the number and the pinyin. For example, the pinyin normalization of "plus Wenxin 12345678" is changed to "jiaweiixinyiiersinsiwuqiba".
Step 303, extracting variant features from the pinyin normalized number paragraphs.
Wherein the variant features are exemplified as follows:
E-distanceguide: variant lead edit distance. The Edit Distance (also called Levenshtein Distance) refers to the minimum number of editing operations required to change one string into another string. For example, if the pinyin normalization is preceded by "Weixin", the leader word is "WeChat", and the edit distance is 1.
Spec _ ratio: special exception symbol fraction. The calculation method comprises the following steps: lens _ ratio (lens) len (lens)/len (comment), where len (lens) is the number of special symbols and len (comment) is the number of total symbols.
Guide _ pinyin: the number of homophones of the pinyin and the guide word. For example (Gal- > jia, Wei- > wei, Xin- > xin).
Digit _ pinyin: the number of the digital homophones. For example (two- > er, take it a walk- > liuluuqi).
Distancegd: distance between the lead word and the number string. For example, the distance between the pinyin leader "weixin" and the number string. The calculation method comprises the following steps: posguide-Posdigit-seq
Matchg: and guiding the character connection degree after pinyin normalization. For example, if the pinyin is normalized and hits words such as "weixin, jiawei, gongzhonghao, jiaq", etc., the degree of connection is 1; if a word is hit, such as "wei, dian, jia, q", the degree of concatenation is 0.
E-distancedigit: number string edit distance. For example, if a number string is "one 37 eight 77", the number string edit distance is 3.
Seq _ number: number of number strings. The number string Seq is defined as a string of consecutive alphanumeric segments (less than 2 chinese characters in the middle interval), the number of alphanumeric characters being between 5 and 11.
Step 304, abstract features are extracted from the pinyin normalized number paragraphs.
In this embodiment, the number segment after pinyin normalization may be further abstracted and normalized into an abstract string, and abstract features may be extracted from the abstract string.
As an example, the normalization rule is: the character hitting the pinyin of the guide word is 2, the character hitting the pinyin of the numeric characters is 1, the character hitting the special symbols is 3, and the rest characters are 0. Examples are as follows: if the digital number paragraph is "Jia Ci ィ [ 0 Lin Tri Ji wine lacquer 4 years old ] Perkin is Jiji", the abstract cluster is "2223111111111300000000".
Wherein, the abstract string is marked as Seq-ab, and the abstract characteristics are described as follows:
var-w: the specific calculation method of the contact way characteristic variance comprises the following steps: var (Seq-ab).
The number dispersion of the Digit-var is calculated by the following specific method: var (pos1), the variance of the position where a 1 (representing a number letter) appears in an abstract string.
Guide-var: the method for calculating the dispersion of the guide words comprises the following steps: var (pos2), the variance of the position where 2 (representing the lead word) appears in the abstract string.
Spec-var: the specific calculation method of the special symbol dispersion comprises the following steps: var (pos3), the variance of the position where 3 (representing a special symbol) appears in the abstract string.
Scoring is performed 305 based on the variant features and the abstract features to generate scoring values.
In step 306, if the score value is greater than the preset threshold, the digital number paragraph is determined to be a variant cheating field.
In this embodiment, the number segment may be scored in combination with a scoring formula and an xgboost model.
In an embodiment of the present invention, a score may be given according to the variant features and the leading word matching features in combination with a formula to determine the possibility of a contact in a number paragraph, where the scoring formula is as follows:
Figure BDA0001760918870000111
and if score is greater than or equal to 2, judging that the number paragraph has a contact way.
In an embodiment of the invention, the variant cheating field with a normal or variant contact way can be selected as a negative sample, the normal field without a contact way can be selected as a positive sample, the xgboost model is trained through the sample data, and the score value is generated through the xgboost model according to the variant characteristic and the abstract characteristic.
Wherein, the variation features and the abstract features are the features extracted in step 303 and step 304, for example, as follows:
var-w: the contact means characteristic variance.
Digit-var number dispersion.
Guide-var: the lead word dispersion.
Spec-var: special symbol dispersion.
E-distanceguide: variant lead edit distance.
Spec _ ratio: special exception symbol fraction.
Guide _ pinyin: number of homophones of phonetic alphabet and guide word
Digit _ pinyin: the number of the digital homophones.
Distancegd: distance between the lead word and the number string.
Matchg: and guiding the character connection degree after pinyin normalization.
E-distancedigit: number string edit distance.
Seq _ number: number of number strings.
In this embodiment, when the score values of the scoring formula and the xgboost model are both greater than the preset threshold, the digital number paragraph may be determined as the variant cheating field.
It should be noted that the above explanation of scoring according to the variant feature and the abstract feature to generate the score value is only exemplary, and in another embodiment of the present invention, when the leader word is matched, the number paragraph may be determined as the variant cheating field, when the leader word is not matched, the number paragraph may be pinyin normalized, the variant feature and the abstract feature may be extracted from the normalized number paragraph, further scoring may be performed according to the variant feature and the abstract feature through a correlation formula and a model, and the variant cheating field may be identified according to the score value.
The method for identifying the variant cheating field in the embodiment of the invention performs pinyin normalization on the digital number paragraph and extracts variant characteristics and abstract characteristics from the pinyin normalized digital number paragraph. And furthermore, by combining the matching characteristics, the variant characteristics and the abstract characteristics of the guide words and integrating two scoring modes of a formula and a model to generate scoring values, the variant cheating fields are identified according to the scoring values, and the accuracy of identifying the variant cheating fields is further improved. In the embodiment, the regularized variant identification strategy is realized through two different scoring modes, the algorithm accuracy is improved, and the algorithm generalization capability is greatly improved by starting from the aspects of special symbol proportion calculation, variant normalization, number and guide word position distribution abstraction, model automatic training and the like.
In order to implement the foregoing embodiment, the present invention further provides a device for identifying variant cheating fields, fig. 4 is a schematic structural diagram of the device for identifying variant cheating fields provided in the embodiment of the present invention, and as shown in fig. 4, the device for identifying variant cheating fields includes: the system comprises an acquisition module 100, an extraction module 200, a matching module 300, a first judgment module 400, a scoring module 500 and a second judgment module 600.
The obtaining module 100 is configured to obtain a text to be recognized.
An extracting module 200, configured to extract a number paragraph from the text to be recognized.
The matching module 300 is configured to perform variant word conversion on the characters in the number paragraph and perform leader word matching.
The first judging module 400 is configured to judge that the number paragraph is the variant cheating field if the leading word is matched.
And the scoring module 500 is used for extracting variant features from the number paragraphs if the guide words are not matched, and scoring according to the variant features to generate scoring values.
A second determining module 600, configured to determine that the digital number paragraph is a variant cheating field if the score value is greater than a preset threshold.
On the basis of fig. 4, the identification apparatus of the variant cheating field shown in fig. 5 further includes: a conversion module 700 and a processing module 800.
The conversion module 700 is configured to perform digital variant normalization on the text to be recognized.
Further, the conversion module 700 is specifically configured to: and carrying out digital variant normalization on the text to be recognized according to the variant contact information database.
Further, the conversion module 700 is specifically configured to:
converting characters in a text to be recognized into character pictures;
comparing the similarity of the character picture and the digital picture to generate a similarity value;
and converting characters corresponding to the character pictures with the similarity values larger than the preset similarity threshold value into numbers corresponding to the corresponding digital pictures.
Further, the extraction module 200 is specifically configured to:
acquiring the position of a special character in a text to be recognized;
extracting character strings before and after the special characters according with preset rules, and adding the special characters and the character strings before and after the special characters according with the preset rules into the digital number paragraph.
Further, the preset rule is as follows: judging whether characters with preset character intervals are numbers forwards or backwards by taking the special characters as centers; if yes, adding characters between the special characters and characters with preset character intervals into the number paragraph.
A processing module 800, configured to remove an interference symbol in a number segment.
Further, the scoring module 500 is specifically configured to:
performing pinyin normalization on the digital number paragraphs;
extracting variant features from the pinyin normalized number paragraph;
abstract features are extracted from the pinyin normalized number paragraph;
scoring is performed based on the variant features and the abstract features to generate scoring values.
It should be noted that the explanation of the identification method for the variant cheating field in the foregoing embodiment is also applicable to the identification apparatus for the variant cheating field in this embodiment, and details are not repeated here.
In summary, the device for identifying variant cheating fields according to the embodiments of the present invention extracts a number paragraph from a text to be identified by obtaining the text to be identified, further performs variant word conversion on words in the number paragraph and performs leader word matching, determines that the number paragraph is the variant cheating field when the leader word is matched, extracts variant features from the number paragraph when the leader word is not matched, performs scoring according to the variant features to generate a score value, and determines that the number paragraph is the variant cheating field when the score value is greater than a preset threshold value. Therefore, the problems that the discontinuous digital fragments cannot be identified and the variant cheating fields cannot be matched and identified without the guide words in the related technology are solved, and the accuracy of identifying the variant cheating fields is improved.
In order to implement the above embodiments, the present invention further provides a computer device, including a processor and a memory; wherein the processor runs a program corresponding to the executable program code by reading the executable program code stored in the memory, so as to implement the identification method of the variant cheating field according to any one of the preceding embodiments.
In order to implement the above embodiments, the present invention further proposes a computer program product, wherein instructions of the computer program product, when executed by a processor, implement the method for identifying a variant cheating field according to any of the preceding embodiments.
In order to implement the above embodiments, the present invention further proposes a non-transitory computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the method for identifying a variant cheating field according to any of the preceding embodiments.
FIG. 6 illustrates a block diagram of an exemplary computer device suitable for use to implement embodiments of the present invention. The computer device 12 shown in FIG. 6 is only an example and should not bring any limitations to the functionality or scope of use of embodiments of the present invention.
As shown in FIG. 6, computer device 12 is in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.
Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 28 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 6, and commonly referred to as a "hard drive"). Although not shown in FIG. 6, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only Memory (CD-ROM), a Digital versatile disk Read Only Memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally perform the functions and/or methodologies of the embodiments described herein.
The computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with the computer system/server 12, and/or with any devices (e.g., network card, modem, etc.) that enable the computer system/server 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network such as the Internet) via Network adapter 20. As shown, network adapter 20 communicates with the other modules of computer device 12 via bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processing unit 16 executes various functional applications and data processing, for example, implementing the methods mentioned in the foregoing embodiments, by executing programs stored in the system memory 28.
In the description of the present invention, it is to be understood that the terms "first", "second" and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (14)

1. A method for identifying variant cheating fields, comprising:
acquiring a text to be identified;
acquiring the position of a special character in the text to be recognized, and adding characters between the special character and characters at a preset character interval into a number paragraph if the characters at the preset character interval are preset forward or backward by taking the special character as a center as a number;
carrying out variant character conversion on characters in the digital number paragraph and carrying out leading word matching;
if the guide word is matched, judging that the digital number paragraph is a variant cheating field;
if the guide word is not matched, extracting variant features from the number paragraph, and scoring according to the variant features to generate a scoring value;
and if the score value is larger than a preset threshold value, judging that the digital number paragraph is a variant cheating field.
2. The method of identifying a variant cheating field according to claim 1, further comprising:
and carrying out digital variant normalization on the text to be recognized.
3. The method for identifying variant cheating fields according to claim 2, wherein said numerically variant normalizing said text to be identified comprises:
and carrying out digital variant normalization on the text to be recognized according to a variant contact information database.
4. The method of identifying a variant cheating field according to claim 3, further comprising:
converting characters in the text to be recognized into character pictures;
comparing the character picture with the digital picture to generate a similarity value;
and converting the characters corresponding to the character pictures with the similarity values larger than the preset similarity threshold value into the numbers corresponding to the corresponding digital pictures.
5. The method of identifying a variant cheating field according to claim 1, further comprising:
and removing the interference symbols in the number paragraph.
6. The method of identifying variant cheating fields according to claim 1, wherein said extracting variant features from said number paragraph and scoring based on said variant features to generate a score value comprises:
performing pinyin normalization on the digital number paragraph;
extracting variant features from the pinyin normalized number paragraph;
abstract features are extracted from the pinyin normalized number paragraph;
scoring according to the variant features and the abstract features to generate the score value.
7. An apparatus for identifying variant cheating fields, comprising:
the acquisition module is used for acquiring a text to be recognized;
the extraction module is used for acquiring the position of a special character in the text to be recognized, and if characters with a preset character interval forward or backward by taking the special character as a center are taken as numbers, adding all the characters between the special character and the characters with the preset character interval into a number paragraph;
the matching module is used for performing variant character conversion on characters in the digital number paragraph and performing leading word matching;
the first judgment module is used for judging the digital number paragraph as a variant cheating field if the guide word is matched;
the scoring module is used for extracting variant features from the digital number paragraphs if the guide words are not matched, and scoring according to the variant features to generate scoring values;
and the second judging module is used for judging the digital number paragraph as a variant cheating field if the score value is greater than a preset threshold value.
8. The apparatus for identifying variant cheating fields of claim 7, further comprising:
and the conversion module is used for carrying out digital variant normalization on the text to be recognized.
9. The apparatus for identifying variant cheating fields of claim 8, wherein the conversion module is specifically configured to:
and carrying out digital variant normalization on the text to be recognized according to a variant contact information database.
10. The apparatus for identifying variant cheating fields of claim 9, wherein the conversion module is specifically configured to:
converting characters in the text to be recognized into character pictures;
comparing the character picture with the digital picture to generate a similarity value;
and converting the characters corresponding to the character pictures with the similarity values larger than the preset similarity threshold value into the numbers corresponding to the corresponding digital pictures.
11. The apparatus for identifying variant cheating fields of claim 7, further comprising:
and the processing module is used for removing the interference symbols in the digital number paragraph.
12. The apparatus for identifying variant cheating fields of claim 7, wherein said scoring module is specifically configured to:
performing pinyin normalization on the digital number paragraph;
extracting variant features from the pinyin normalized number paragraph;
abstract features are extracted from the pinyin normalized number paragraph;
scoring according to the variant features and the abstract features to generate the score value.
13. A computer device comprising a processor and a memory;
wherein the processor executes a program corresponding to the executable program code by reading the executable program code stored in the memory for implementing the variant cheating field identification method of any of claims 1-6.
14. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements a method of identifying variant cheating fields according to any of claims 1-6.
CN201810907161.7A 2018-08-10 2018-08-10 Method, device and equipment for identifying variant cheating fields Active CN109241523B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810907161.7A CN109241523B (en) 2018-08-10 2018-08-10 Method, device and equipment for identifying variant cheating fields

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810907161.7A CN109241523B (en) 2018-08-10 2018-08-10 Method, device and equipment for identifying variant cheating fields

Publications (2)

Publication Number Publication Date
CN109241523A CN109241523A (en) 2019-01-18
CN109241523B true CN109241523B (en) 2020-12-11

Family

ID=65070547

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810907161.7A Active CN109241523B (en) 2018-08-10 2018-08-10 Method, device and equipment for identifying variant cheating fields

Country Status (1)

Country Link
CN (1) CN109241523B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110085224B (en) * 2019-04-10 2021-06-01 深圳康佳电子科技有限公司 Intelligent terminal whole-course voice control processing method, intelligent terminal and storage medium
CN110298020B (en) * 2019-05-30 2023-05-16 北京百度网讯科技有限公司 Text anti-cheating variant reduction method and equipment, and text anti-cheating method and equipment
CN110717328B (en) * 2019-07-04 2021-06-18 北京达佳互联信息技术有限公司 Text recognition method and device, electronic equipment and storage medium
CN112784592A (en) * 2019-11-11 2021-05-11 四川睿象科技有限公司 Method for extracting effective alarm data based on natural language features
CN113282746B (en) * 2020-08-08 2023-05-23 西北工业大学 Method for generating variant comment countermeasure text of network media platform
CN112201225B (en) * 2020-09-30 2024-02-02 北京大米科技有限公司 Corpus acquisition method and device, readable storage medium and electronic equipment
CN113408270B (en) * 2021-06-10 2023-02-10 广州三七极创网络科技有限公司 Variant text recognition method and device and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101729520A (en) * 2008-10-28 2010-06-09 北京大学 Method and device for detecting sensitive information
CN102184188A (en) * 2011-04-15 2011-09-14 百度在线网络技术(北京)有限公司 Method and equipment for determining sensitivity of target text
CN102591854A (en) * 2012-01-10 2012-07-18 凤凰在线(北京)信息技术有限公司 Advertisement filtering system and advertisement filtering method specific to text characteristics
CN103064850A (en) * 2011-10-20 2013-04-24 腾讯科技(深圳)有限公司 Method and system of digging cheating data
CN103514174A (en) * 2012-06-18 2014-01-15 北京百度网讯科技有限公司 Text categorization method and device
CN104050556A (en) * 2014-05-27 2014-09-17 哈尔滨理工大学 Feature selection method and detection method of junk mails
CN106407324A (en) * 2016-08-31 2017-02-15 北京城市网邻信息技术有限公司 Method and device for recognizing contact information

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101729520A (en) * 2008-10-28 2010-06-09 北京大学 Method and device for detecting sensitive information
CN102184188A (en) * 2011-04-15 2011-09-14 百度在线网络技术(北京)有限公司 Method and equipment for determining sensitivity of target text
CN103064850A (en) * 2011-10-20 2013-04-24 腾讯科技(深圳)有限公司 Method and system of digging cheating data
CN102591854A (en) * 2012-01-10 2012-07-18 凤凰在线(北京)信息技术有限公司 Advertisement filtering system and advertisement filtering method specific to text characteristics
CN103514174A (en) * 2012-06-18 2014-01-15 北京百度网讯科技有限公司 Text categorization method and device
CN104050556A (en) * 2014-05-27 2014-09-17 哈尔滨理工大学 Feature selection method and detection method of junk mails
CN106407324A (en) * 2016-08-31 2017-02-15 北京城市网邻信息技术有限公司 Method and device for recognizing contact information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于中文变形词匹配的贝叶斯邮件过滤模型;汪霞等;《计算机应用与软件》;20100131;第27卷(第1期);第105-107、130页 *

Also Published As

Publication number Publication date
CN109241523A (en) 2019-01-18

Similar Documents

Publication Publication Date Title
CN109241523B (en) Method, device and equipment for identifying variant cheating fields
WO2019184217A1 (en) Hotspot event classification method and apparatus, and storage medium
CN108984530B (en) Detection method and detection system for network sensitive content
WO2019153605A1 (en) Identification method for sensitive information in text, electronic device, and readable storage medium
US8380488B1 (en) Identifying a property of a document
JP5173141B2 (en) Efficient language identification
CN109271641B (en) Text similarity calculation method and device and electronic equipment
CN108304377B (en) Extraction method of long-tail words and related device
CN101785050B (en) Voice recognition correlation rule learning system, voice recognition correlation rule learning program, and voice recognition correlation rule learning method
Hossain et al. Auto-correction of english to bengali transliteration system using levenshtein distance
CA2500467A1 (en) Scalable neural network-based language identification from written text
CN109271524B (en) Entity linking method in knowledge base question-answering system
WO2017005207A1 (en) Input method, input device, server and input system
CN111354340B (en) Data annotation accuracy verification method and device, electronic equipment and storage medium
CN113988061A (en) Sensitive word detection method, device and equipment based on deep learning and storage medium
CN111191008A (en) Password guessing method based on numerical factor reverse order
CN111444905B (en) Image recognition method and related device based on artificial intelligence
CN112329390A (en) Chinese word similarity detection algorithm based on sound, shape and meaning
CN114861635B (en) Chinese spelling error correction method, device, equipment and storage medium
CN111797217A (en) Information query method based on FAQ matching model and related equipment thereof
CN111339778A (en) Text processing method, device, storage medium and processor
CN113553410B (en) Long document processing method, processing device, electronic equipment and storage medium
CN114548075A (en) Text processing method, text processing device, storage medium and electronic equipment
CN107861941B (en) User nickname authenticity evaluation method, storage medium, electronic device and system
CN113793611A (en) Scoring method, scoring device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant