CN109657228B - Sensitive text determining method and device - Google Patents

Sensitive text determining method and device Download PDF

Info

Publication number
CN109657228B
CN109657228B CN201811290233.4A CN201811290233A CN109657228B CN 109657228 B CN109657228 B CN 109657228B CN 201811290233 A CN201811290233 A CN 201811290233A CN 109657228 B CN109657228 B CN 109657228B
Authority
CN
China
Prior art keywords
target text
matching
sensitive
text
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811290233.4A
Other languages
Chinese (zh)
Other versions
CN109657228A (en
Inventor
袁喆
张晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sankuai Online Technology Co Ltd
Original Assignee
Beijing Sankuai Online Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sankuai Online Technology Co Ltd filed Critical Beijing Sankuai Online Technology Co Ltd
Priority to CN201811290233.4A priority Critical patent/CN109657228B/en
Publication of CN109657228A publication Critical patent/CN109657228A/en
Application granted granted Critical
Publication of CN109657228B publication Critical patent/CN109657228B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the disclosure provides a sensitive text determining method and device, wherein the method comprises the following steps: determining whether at least one character in the target text belongs to a preset blacklist; under the condition that no character belongs to a preset blacklist, matching the target text according to the preset blacklist, and counting the total length of the matched characters; determining matching parameters of the target text and the white list according to the total length of the matched characters and the length of the target text; and under the condition that the matching parameter is larger than a preset matching parameter threshold value, determining that the target text is a non-sensitive text. Matching parameters can be calculated according to the matching length and the text length, and whether the text is a sensitive text or not is determined according to the matching parameters, so that the recognition accuracy of the sensitive text is improved; and whether the text is sensitive or not can be determined according to the white list and the black list with smaller data quantity, so that the recognition speed can be effectively improved.

Description

Sensitive text determining method and device
Technical Field
The embodiment of the disclosure relates to the technical field of text matching, in particular to a sensitive text determining method and device.
Background
The commodity network sales platform is convenient for people to live. In order to ensure the healthy development of the platform and reduce the operation risk, sensitive information in commodity information needs to be identified and filtered.
In the prior art, compared with the method for identifying the sensitive information by adopting full text matching, the improved sensitive text determination scheme has better matching efficiency. It recognizes the sensitization information in the text mainly by a matching algorithm. For example, the KMP algorithm uses string matching to determine whether a target string contains a reference string by moving the reference string continuously. When the reference character string is identical to one segment in the target character string, determining that the target character string contains the reference character string, and successfully matching; when the reference character string is not identical to any of the fragments in the target character string, it is determined that the target character string does not contain the reference character string.
It can be seen that when the above scheme confirms whether the text is sensitive text, if the matching is successful, the text is considered to be the sensitive text, the algorithm is simple, the accuracy is low, and in addition, the recognition speed is low due to the fact that the text is matched one by one.
Disclosure of Invention
The embodiment of the disclosure provides a method and a device for determining a sensitive text, which are beneficial to improving the accuracy of determining the sensitive text.
According to a first aspect of embodiments of the present disclosure, there is provided a sensitive text determination method, the method comprising:
determining whether at least one character in the target text belongs to a preset blacklist;
under the condition that no character belongs to a preset blacklist, matching the target text according to the preset blacklist, and counting the total length of the matched characters;
determining matching parameters of the target text and the white list according to the total length of the matched characters and the length of the target text;
and under the condition that the matching parameter is larger than a preset matching parameter threshold value, determining that the target text is a non-sensitive text.
According to a second aspect of embodiments of the present disclosure, there is provided a sensitive text determining apparatus, the apparatus comprising:
the blacklist matching module is used for determining whether at least one character in the target text belongs to a preset blacklist;
the white list matching module is used for matching the target text according to a preset white list and counting the total length of the matched characters under the condition that no characters belong to the preset black list;
the matching parameter determining module is used for determining the matching parameters of the target text and the white list according to the total length of the matched characters and the length of the target text;
and the sensitivity determining module is used for determining that the target text is a non-sensitive text under the condition that the matching parameter is larger than a preset matching parameter threshold value.
According to a third aspect of embodiments of the present disclosure, there is provided an electronic device, comprising:
a processor, a memory and a computer program stored on the memory and executable on the processor, the processor implementing the aforementioned sensitive text determination method when executing the program.
According to a fourth aspect of embodiments of the present disclosure, there is provided a readable storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform the aforementioned sensitive text determination method.
The embodiment of the disclosure provides a sensitive text determining method and device, wherein the method comprises the following steps: determining whether at least one character in the target text belongs to a preset blacklist; under the condition that no character belongs to a preset blacklist, matching the target text according to the preset blacklist, and counting the total length of the matched characters; determining matching parameters of the target text and the white list according to the total length of the matched characters and the length of the target text; and under the condition that the matching parameter is larger than a preset matching parameter threshold value, determining the target text as a non-sensitive text. Matching parameters can be calculated according to the matching length and the text length, and whether the text is a sensitive text or not is determined according to the matching parameters, so that the recognition accuracy of the sensitive text is improved; and whether the text is sensitive or not can be determined according to the white list and the black list with smaller data quantity, so that the recognition speed can be effectively improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are needed in the description of the embodiments of the present disclosure will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without any invasive effort to a person of ordinary skill in the art.
FIG. 1 is a flowchart showing specific steps of a method for determining sensitive text according to an embodiment of the present disclosure;
FIG. 2 is a flowchart illustrating specific steps of a method for determining a sensitive text according to a second embodiment of the present disclosure;
FIG. 3 is a block diagram of a sensitive text determining apparatus provided in a third embodiment of the present disclosure;
fig. 4 is a block diagram of a sensitive text determining apparatus according to a fourth embodiment of the present disclosure;
fig. 5 is a block diagram of an electronic device provided by an embodiment of the present disclosure.
Detailed Description
The following description of the technical solutions in the embodiments of the present disclosure will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments in this disclosure without inventive faculty, are intended to fall within the scope of the protection of this disclosure.
Example 1
Referring to fig. 1, a flowchart illustrating specific steps of a sensitive text determination method according to an embodiment of the present disclosure is shown.
Step 101, determining whether at least one character in the target text belongs to a preset blacklist.
Where the black list is used to store severely sensitive objects, e.g. words related to terrorism, related to drugs, related to violence.
Since the blacklist does not relate to business scenarios, scenario impact is severe, requiring first filtering.
Specifically, each word in the blacklist is matched with a target text, if at least one blacklist object exists in the target text, the matching is successful, and the target text is confirmed to be a sensitive text; if the blacklist object does not exist in the target text, the matching fails, and the target text needs to be continuously processed to identify whether the target text is sensitive text or not.
And 102, under the condition that no character belongs to a preset blacklist, matching the target text according to the preset blacklist, and counting the total length of the matched characters.
Wherein the whitelist is used for storing the text information free of sensitive examination, and the whitelist can be manually set or mined from a standard database. For example, for a typical application scenario of an embodiment of the present invention, the sensitive identification of merchandise text by a network sales platform, the platform deems that there is certainly no sensitive information, or that merchandise uploaded by a merchant that is frequently affiliated, may be added to a standard merchandise library. So that whitelists can be mined from the standard store.
It can be understood that the simplest white list matching method is to match each object in the white list with the target text, and if at least one object in the white list exists in the target text, the matching is successful; otherwise, the matching fails.
And step 103, determining the matching parameters of the target text and the white list according to the total length of the matched characters and the length of the target text.
In the embodiment of the invention, in order to further determine the matching degree, the matching degree is described by adopting a matching parameter. For example, for the case of a match failure, the match parameter is 0; for the situation that the matching is successful, if the more objects are matched, the longer the object length is, the larger the matching parameters are; the fewer objects that match, the smaller the object length, and the smaller the matching parameters.
It should be noted that the embodiment of the present invention may be applied to various scenes of text matching, and is not limited to sensitive recognition of commodity text.
And 104, determining the target text as a non-sensitive text under the condition that the matching parameter is larger than a preset matching parameter threshold value.
The matching parameter threshold is used for judging whether sensitive recognition is needed to be carried out on the target text. It can be appreciated that the matching parameter threshold may be set according to an actual application scenario, which is not limited by the embodiment of the present invention.
It can be obtained that when the matching parameter is smaller than the matching parameter threshold, the target text is considered not to be the target text without sensitivity recognition, and whether the target text is the sensitive text needs to be further judged; and when the matching parameter is greater than or equal to the matching parameter threshold, the target text is considered to be the target text free from sensitive recognition, and the target text is directly considered to be the non-sensitive text.
For a typical application scenario of the embodiment of the invention, when the network sales platform determines that the target text of the commodity is a sensitive text, the commodity audit is not passed, and the merchant is prompted to modify commodity information; and when the target text about the commodity is not the sensitive text, allowing the commodity to enter the platform after the commodity is checked and approved.
In summary, the embodiments of the present disclosure provide a method for determining sensitive text, including: determining whether at least one character in the target text belongs to a preset blacklist; under the condition that no character belongs to a preset blacklist, matching the target text according to the preset blacklist, and counting the total length of the matched characters; determining matching parameters of the target text and the white list according to the total length of the matched characters and the length of the target text; and under the condition that the matching parameter is larger than a preset matching parameter threshold value, determining the target text as a non-sensitive text. Matching parameters can be calculated according to the matching length and the text length, and whether the text is a sensitive text or not is determined according to the matching parameters, so that the recognition accuracy of the sensitive text is improved; and whether the text is sensitive or not can be determined according to the white list and the black list with smaller data quantity, so that the recognition speed can be effectively improved.
Example two
Referring to fig. 2, a flowchart illustrating specific steps of a method for determining a sensitive text according to a second embodiment of the present disclosure is shown.
Step 201, determining whether at least one character in the target text belongs to a preset blacklist.
This step may refer to the detailed description of step 101, and will not be described herein.
Step 202, when there is no character belonging to the preset blacklist and the subject information is not in the subject whitelist, the total length of the matched characters is 0.
The subject information is key information for distinguishing different target texts, for example, for a target text based on a commodity, the commodity identification or the name is subject information, so that the commodity identification or the name free from sensitive identification is stored in a subject white list.
It will be appreciated that when the subject information is not within the whitelist, the target text fails to match the whitelist.
Optionally, in another embodiment of the present disclosure, the above-mentioned body information includes: the subject whitelist includes a name whitelist, and the association information includes: brands and specifications, the associated whitelist comprising: brand whitelists and specification whitelists.
Specifically, the whitelist may be further divided into different categories according to application scenarios, for example, for the products, the products may be divided into a product whitelist, a brand whitelist, and a specification whitelist. The commodity white list can determine commodities which do not need to be subjected to sensitive identification, such as mineral water, rice and other commodities which cannot have sensitive information; the white list may identify brands that do not require sensitive identification, e.g., famous brands from Master, wangwang, etc., or non-inspection brands approved by other countries; the specification whitelist may be set in terms of weight, number, volume, etc., for example, 20 kg or less, 100 or less, 500 ml or less.
Step 203, matching the associated information according to the associated white list to obtain successfully matched associated information when no character belongs to a preset black list and the subject information is in the subject white list.
The related information is other related information than the main body information in the text, for example, for a commodity, the related information may be brands, specifications, and the like. Whereby associating the whitelist includes storing a brand whitelist of brand immunity sensitive identification and storing a specification whitelist of specification immunity sensitive identification.
Specifically, each piece of associated information in the associated white list is matched with the target text, and if the target text contains the associated information, the associated information is successfully matched; if the target text does not contain the associated information, the associated information is the associated information of failed matching.
And 204, calculating the sum of the lengths of the associated information and the main body information which are successfully matched to obtain the total length of the matched characters.
It will be appreciated that the length is expressed in terms of the number of characters.
Specifically, for a target text, the total length MatchLen of the matched characters can be obtained according to the following calculation formula:
Figure BDA0001849899690000061
wherein MLen is the length of main body information of successful matching, M is the number of associated information of successful matching, len1 i The length of the associated information which is successfully matched is the ith length.
It is understood that when a plurality of texts are simultaneously matched with the target text of the commodity, the length of the main body information may also be the sum of the lengths of the plurality of main body information.
And step 205, calculating the ratio of the total length of the matched characters to the length of the target text to obtain the matching parameters.
Specifically, the matching parameter MatchPara may be calculated according to the following formula:
Figure BDA0001849899690000062
where L is the length of the target text and may be represented by the number of characters.
It can be understood that in practical application, the ratio can be further transformed to serve as a matching parameter, so that the value range of the matching parameter can be flexibly adjusted.
And 206, determining the target text as a non-sensitive text under the condition that the matching parameter is larger than a preset matching parameter threshold value.
This step may refer to the detailed description of step 104, and will not be described herein.
Step 207, when the matching parameter is less than or equal to a preset matching parameter threshold, word segmentation is performed on the pinyin characters in the target text by using a pre-generated pinyin probability matrix.
The matching parameter threshold is used for judging the matching degree of the target text and the white list, and can be set according to the actual application scene, and the value of the matching parameter threshold is not limited in the embodiment of the invention.
The pinyin probability matrix shows the probability of concatenation of two syllables in practical application, for example, the probability of "h" to "hu" is 0.7, and the probability of "h" and "ang" to "hang" is 0.8. Therefore, word segmentation can be performed according to the condition that the splicing probability is maximum, and hang is selected as a word segmentation result.
Optionally, in another embodiment of the present disclosure, the step 207 comprises sub-steps 2071 to 2073:
and a sub-step 2071 of performing word segmentation on the pinyin characters in the target text to obtain word segmentation groups, wherein the word segmentation groups comprise at least one syllable group spliced by syllables.
In particular, a syllable concatenation table may be used to determine possible concatenation results. For example, for "huanghe", the possible word segmentation results are "hu ang he", "huang he", while "hu an ghe", "h ua ng he" are all impossible word segmentation results.
Wherein, for "huang he", two syllable groups "huang" and "he" are separated; for "hu ang he", the syllable groups are three syllable groups of "hu", "ang", "he".
A sub-step 2072, for each word segmentation group, determining word segmentation accuracy of the word segmentation group by using a pre-generated pinyin probability matrix.
Specifically, for each syllable group in each word group, the probability of splicing the first syllable with the second syllable is found from the pinyin probability matrix, and then the second and third splicing probabilities are obtained by analogy, and the splicing probabilities of adjacent syllables are multiplied to obtain the probability of the syllable group; finally, multiplying the probabilities of the syllable groups to obtain the word segmentation accuracy of the word segmentation groups.
Sub-step 2073, replacing the pinyin character with the word segmentation group with the greatest word segmentation accuracy.
In practical application, the character groups corresponding to different word segmentation or the character groups with larger word segmentation accuracy can be added to the pinyin characters. If one of the character sets is successfully matched, the character set represents that one matching is successful; if the matching fails, the matching fails.
Step 208, performing generalization processing on the target text, where the generalization processing includes: word union, character splitting, endian expansion, character conversion.
In the embodiment of the invention, in order to improve the accuracy of sensitive identification of the target text, generalization processing is carried out on the target text before the sensitive identification is carried out, and inaccurate information in the target text is eliminated. The target text after the generalization process may be different from the previous length. The target text is in a unified pinyin format, and non-normalized characters possibly existing in the target text can be corrected.
Stop words and invalid characters in the target text may also be filtered prior to the generalization process.
Wherein, the stop word refers to an insignificant word or word automatically filtered before or after processing natural language data (or text) in order to save storage space and improve searching efficiency during information retrieval. Such as a mood word, a pause word, etc.
The invalid character may be a word other than the stop word that does not affect the meaning of the text. The invalid characters may be for different scenarios.
In practical applications, the stop word library may be used to identify stop words in the target text. The invalid character can also set different invalid character libraries according to different application scenes.
In the embodiment of the invention, the interference of the stop words and the invalid characters on the text matching can be reduced by filtering the stop words and the invalid characters, and the efficiency and the accuracy of the text matching can be effectively improved.
Specifically, the step of character merging includes: and combining adjacent Chinese characters in the target text by adopting a word splitting dictionary, and adding the combined characters into the target text under the condition that the combination is successful.
The word splitting dictionary records Chinese characters which can be split into two to three parts, and the word splitting dictionary is divided into two types of vertical splitting and horizontal splitting. For example, "jacquard" can be split up and down into "western" and "shellfish" and "building" can be split left and right into "wood" and "basket".
Specifically, adjacent Chinese characters can be combined left and right or combined up and down, and whether the combined Chinese characters are in the word splitting dictionary or not is judged. If yes, merging is successful, and the merged characters are added into the target text. For example, it may be added after the corresponding character before merging.
The character splitting step comprises the following steps: and splitting each Chinese character in the target text by adopting a word splitting dictionary, and adding the split characters into the target text under the condition that the splitting is successful.
Specifically, it may be determined whether a kanji character exists in the split word dictionary, and if so, the split word is added to the target text. For example, after the character before splitting.
In addition, the split or combined characters can be marked, so that when the matching is performed, if the character before the splitting or the combination and the character after the splitting or the combination are successfully matched, the character before the splitting or the combination is used as one-time matching success, and the character before the splitting or the combination is used as the character successfully matched.
The step of endian expansion includes: for adjacent kanji characters in the target text, expanding the adjacent kanji characters into a set of kanji characters of a different endianness and adding to the target text.
Specifically, several kanji characters are selected as a group for sequential reorganization. For example, for "yellow crane" it can be extended to "yellow crane", "Huang Lou", "yellow crane".
In the embodiment of the invention, the number of the recombined Chinese characters can be selected according to the actual application scene, and typically, two to three adjacent Chinese characters can be selected.
It will be appreciated that the above word segmentation, chinese character splitting and merging, and endian expansion may be performed by appropriately adjusting the order, or selecting one or more of them.
The character conversion step includes: and A1, replacing the expression symbol in the target text with a corresponding Chinese character.
Wherein, the emoticons can comprise figures such as smiling face, calling, and the like.
In practical application, each emoji can be assigned with its corresponding kanji character when defined, and an emoji library is generated. Thus, when the user inputs the Chinese character, the corresponding emoticons can be associated, or the corresponding Chinese character can be found through the emoticons.
And A2, replacing the Chinese characters in the target text with corresponding pinyin characters.
Specifically, the pinyin character corresponding to the hanzi character may be looked up from the dictionary.
It is understood that the kanji characters herein include both the original kanji characters in the target text and the kanji characters converted in step 210.
And step 209, matching the target text by adopting a preset sensitive database to obtain a successfully matched sensitive word.
It is understood that for step 208, the sensitive information in the sensitive database is represented by pinyin characters.
Specifically, matching a pinyin character in a sensitive database with a target text, if the pinyin character exists in the target text, the matching is successful, and taking a Chinese character corresponding to the pinyin character as a successfully matched sensitive word; if the pinyin character does not exist in the target text, the matching fails, the Chinese character corresponding to the pinyin character is not a sensitive word which is successfully matched, and other pinyin characters are continuously matched.
And step 210, determining the sensitive parameters of the target text according to the total length of the successfully matched sensitive words.
Wherein the sensitivity parameter is related to the number and length of the successfully matched sensitivity words, for example, the greater the number of the successfully matched sensitivity words, the longer the length, the greater the sensitivity parameter; the smaller the number of the successfully matched sensitive words, the shorter the length and the smaller the sensitive parameters.
It will be appreciated that when the number of sensitive words that match successfully is 0, the sensitive parameter may be 0.
Optionally, in another embodiment of the present invention, the step 210 includes sub-steps 2101 to 2102:
in a substep 2101, the sum of the lengths of the sensitive words successfully matched is calculated, so as to obtain a sensitive length.
Specifically, for a target text, the sensitivity length SenLen can be obtained according to the following calculation formula:
Figure BDA0001849899690000101
wherein N is the number of sensitive words successfully matched, len2 j The length of the sensitive word successfully matched is j.
And step 2102, calculating the ratio of the sensitive length to the length of the target text to obtain the sensitive parameters of the target text.
Specifically, the sensitivity parameter SenPara can be calculated according to the following formula:
Figure BDA0001849899690000102
where L is the same as L in formula (2), and is the length of the target text without any processing, and may be represented by the number of characters.
Step 211, determining that the target text is a sensitive text if the sensitive parameter is greater than a preset sensitive parameter threshold.
The sensitive parameter threshold is used for determining whether the target text is a sensitive text or not, and can be set according to the actual application scene.
It can be appreciated that when the sensitivity parameter is greater than or equal to the sensitivity parameter threshold, the target text can be considered as the sensitivity text; and when the sensitive parameter is smaller than the sensitive parameter threshold, the target text is considered to be the fee-sensitive text.
In summary, the embodiments of the present disclosure provide a method for determining sensitive text, including: determining whether at least one character in the target text belongs to a preset blacklist; when no character belongs to a preset blacklist and main body information is not in the main body whitelist, the total length of the matched character is 0; under the condition that no character belongs to a preset blacklist and the main body information is in the main body whitelist, matching the associated information according to the associated whitelist to obtain successfully matched associated information; calculating the sum of the lengths of the associated information and the main body information which are successfully matched to obtain the total length of the matched characters; calculating the ratio of the total length of the matched characters to the length of the target text to obtain a matching parameter; under the condition that the matching parameter is larger than a preset matching parameter threshold value, determining that the target text is a non-sensitive text; when the matching parameter is smaller than or equal to a preset matching parameter threshold value, word segmentation is carried out on pinyin characters in the target text by adopting a pre-generated pinyin probability matrix; performing generalization processing on the target text, wherein the generalization processing comprises: character merging, character splitting, character sequence expansion and character conversion; matching the target text by adopting a preset sensitive database to obtain a successfully matched sensitive word; determining the sensitive parameters of the target text according to the total length of the successfully matched sensitive words; and under the condition that the sensitive parameter is larger than a preset sensitive parameter threshold value, determining the target text as the sensitive text. Matching parameters can be calculated according to the matching length and the text length, and whether the text is a sensitive text or not is determined according to the matching parameters, so that the recognition accuracy of the sensitive text is improved; and whether the text is sensitive or not can be determined according to the white list and the black list with smaller data quantity, so that the recognition speed can be effectively improved. In addition, the method can also divide words, split Chinese characters, combine Chinese characters and expand character sequence of the target text, and finally uniformly adopt pinyin characters for sensitive confirmation, thereby being beneficial to further improving recognition accuracy.
Example III
Referring to fig. 3, a structural diagram of a sensitive text determining apparatus according to a third embodiment of the present disclosure is shown, which is specifically as follows.
The blacklist matching module 301 is configured to determine whether at least one character in the target text belongs to a preset blacklist.
The white list matching module 302 is configured to match the target text according to a preset white list and count the total length of the matched characters when no characters belong to the preset black list;
and the matching parameter determining module 303 is configured to determine a matching parameter of the target text and the white list according to the total length of the matched characters and the length of the target text.
The sensitivity determination module 304 is configured to determine that the target text is a non-sensitive text if the matching parameter is greater than a preset matching parameter threshold.
In summary, an embodiment of the present disclosure provides a sensitive text determining apparatus, including: the blacklist matching module is used for determining whether at least one character in the target text belongs to a preset blacklist; the white list matching module is used for matching the target text according to a preset white list and counting the total length of the matched characters under the condition that no characters belong to the preset black list; the matching parameter determining module is used for determining the matching parameters of the target text and the white list according to the total length of the matched characters and the length of the target text; and the sensitivity determining module is used for determining that the target text is a non-sensitive text under the condition that the matching parameter is larger than a preset matching parameter threshold value. Matching parameters can be calculated according to the matching length and the text length, and whether the text is a sensitive text or not is determined according to the matching parameters, so that the recognition accuracy of the sensitive text is improved; and whether the text is sensitive or not can be determined according to the white list and the black list with smaller data quantity, so that the recognition speed can be effectively improved.
The third embodiment is a device embodiment corresponding to the first method embodiment, and the detailed description may refer to the first embodiment, which is not repeated herein.
Example IV
Referring to fig. 4, a structural diagram of a sensitive text determining apparatus according to a fourth embodiment of the present disclosure is shown, which is specifically as follows.
The blacklist matching module 401 is configured to determine whether at least one character in the target text belongs to a preset blacklist.
The white list matching module 402 is configured to match the target text according to a preset white list and count the total length of the matched characters when no characters belong to the preset black list; optionally, in an embodiment of the present disclosure, the whitelist matching module 402 includes:
a matching failure submodule 4021, configured to, in a case where the subject information is not in the subject whitelist, match a total length of characters to be matched to be 0.
The association information matching submodule 4022 is configured to, when the subject information is on the subject whitelist, match the association information according to the association whitelist, and obtain successfully matched association information.
The matching length calculating submodule 4023 is configured to calculate a sum of lengths of the associated information and the main body information that are successfully matched to obtain a total length of the matched characters.
A matching parameter determining module 403, configured to determine a matching parameter of the target text and the whitelist according to the total length of the matched characters and the length of the target text; optionally, in an embodiment of the present disclosure, the matching parameter determining module 403 includes:
and a matching parameter calculation submodule 4031, configured to calculate a ratio of the total length of the matched character to the length of the target text, so as to obtain a matching parameter.
And the sensitivity determining module 404 is configured to determine that the target text is a non-sensitive text if the matching parameter is greater than a preset matching parameter threshold.
And the word segmentation module 405 is configured to segment the pinyin characters in the target text by using a pre-generated pinyin probability matrix when the matching parameter is less than or equal to a preset matching parameter threshold.
A generalization processing module 406, configured to generalize the target text, where the generalization processing includes: character merging, character splitting, character sequence expansion, character conversion.
And the sensitive word matching module 407 is configured to match the target text by using a preset sensitive database, so as to obtain a successfully matched sensitive word.
And the sensitive parameter determining module 408 is configured to determine a sensitive parameter of the target text according to the total length of the sensitive words successfully matched.
The second sensitivity determining module 409 is configured to determine that the target text is a sensitive text if the sensitivity parameter is greater than a preset sensitivity parameter threshold.
Optionally, in another embodiment of the present disclosure, the above-mentioned body information includes: the subject whitelist includes a name whitelist, and the association information includes: brands and specifications, the associated whitelist comprising: brand whitelists and specification whitelists.
Optionally, in another embodiment of the disclosure, the word segmentation module 405 includes:
the word segmentation group generation sub-module is used for carrying out word segmentation on the pinyin characters in the target text to obtain word segmentation groups, and the word segmentation groups comprise syllable groups spliced by at least one syllable.
The word segmentation accuracy determining sub-module is used for determining word segmentation accuracy of each word segmentation group by adopting a pre-generated pinyin probability matrix.
And the word segmentation sub-module is used for replacing the pinyin characters with word segmentation groups with the maximum word segmentation accuracy.
In summary, an embodiment of the present disclosure provides a sensitive text determining apparatus, including: the blacklist matching module is used for determining whether at least one character in the target text belongs to a preset blacklist; the white list matching module is used for matching the target text according to a preset white list and counting the total length of the matched characters under the condition that no characters belong to the preset black list; the white list matching module includes: a matching failure sub-module, configured to, when the main body information is not in the main body whitelist, match a total length of characters to be 0; the association information matching submodule is used for matching the association information according to the association white list to obtain successfully matched association information under the condition that the main body information is in the main body white list; the matching length calculation sub-module is used for calculating the sum of the lengths of the associated information and the main body information which are successfully matched to obtain the total length of the matched characters; the matching parameter determining module is used for determining the matching parameters of the target text and the white list according to the total length of the matched characters and the length of the target text; the above-mentioned matching parameter determination module includes: the matching parameter calculation sub-module is used for calculating the ratio of the total length of the matched characters to the length of the target text to obtain matching parameters; the sensitivity determining module is used for determining that the target text is a non-sensitive text under the condition that the matching parameter is larger than a preset matching parameter threshold value; the word segmentation module is used for segmenting the pinyin characters in the target text by adopting a pre-generated pinyin probability matrix under the condition that the matching parameter is smaller than or equal to a preset matching parameter threshold value; the generalization processing module is used for generalizing the target text, and the generalization processing module comprises: character merging, character splitting, character sequence expansion and character conversion; the sensitive word matching module is used for matching the target text by adopting a preset sensitive database to obtain a successfully matched sensitive word; the sensitive parameter determining module is used for determining the sensitive parameter of the target text according to the total length of the successfully matched sensitive words; and the second sensitivity determining module is used for determining that the target text is a sensitive text under the condition that the sensitive parameter is larger than a preset sensitive parameter threshold value. The matching parameters can be calculated according to the matching length and the text length, and whether the text is the sensitive text or not is determined according to the matching parameters, so that the recognition accuracy of the sensitive text is improved. In addition, the method can also divide words, split Chinese characters, combine Chinese characters and expand character sequence of the target text, and finally uniformly adopt pinyin characters for sensitive confirmation, thereby being beneficial to further improving recognition accuracy.
The fourth embodiment is a device embodiment corresponding to the second method embodiment, and the detailed description may refer to the second embodiment, which is not repeated herein.
The embodiment of the disclosure further provides an electronic device, referring to fig. 5, including: a process 501, a memory 502 and a computer program 5021 stored on the memory 502 and executable on the processor 501, the processor 501 implementing the aforementioned sensitive text determination method when executing the program.
The disclosed embodiments also provide a readable storage medium that, when executed by a processor of an electronic device, enables the electronic device to perform the aforementioned sensitive text determination method.
For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for the construction of such a system is apparent from the description above. In addition, the present disclosure is not directed to any particular programming language. It should be appreciated that the present disclosure described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present disclosure.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the disclosure may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the disclosure, various features of the disclosure are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed disclosure requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this disclosure.
Those skilled in the art will appreciate that the modules in the apparatus of an embodiment may be adaptively changed and disposed in one or more apparatuses different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and they may be further divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Various component embodiments of the present disclosure may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functions of some or all of the components in a sequential display device according to embodiments of the present disclosure may be implemented in practice using microprocessors or Digital Signal Processors (DSPs). The present disclosure may also be implemented as a device or apparatus program for performing part or all of the methods described herein. Such a program embodying the present disclosure may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the disclosure, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The disclosure may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.
It will be clear to those skilled in the art that, for convenience and brevity of description, the specific working procedures of the systems, apparatuses and units described above may refer to the corresponding procedures in the foregoing method embodiments, and are not repeated here.
The foregoing description of the preferred embodiments of the present disclosure is not intended to limit the disclosure, but is intended to cover any modifications, equivalents, and alternatives falling within the spirit and principles of the present disclosure.
The foregoing is merely specific embodiments of the disclosure, but the protection scope of the disclosure is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the disclosure, and it is intended to cover the protection scope of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims (11)

1. A method of sensitive text determination, the method comprising:
determining whether at least one character in the target text belongs to a preset blacklist;
under the condition that no character belongs to a preset blacklist, matching the target text according to the preset blacklist, and counting the total length of the matched characters;
determining matching parameters of the target text and the white list according to the total length of the matched characters and the length of the target text;
and under the condition that the matching parameter is larger than a preset matching parameter threshold value, determining that the target text is a non-sensitive text.
2. The method according to claim 1, wherein the target text includes subject information and associated information, the whitelist includes a subject whitelist and an associated whitelist, and the step of matching the target text according to a preset whitelist and counting total lengths of the matched characters includes:
in the case that the subject information is not in the subject whitelist, the total length of the matched characters is 0;
under the condition that the main body information is in the main body white list, matching the associated information according to the associated white list to obtain successfully matched associated information;
and calculating the sum of the lengths of the associated information and the main body information which are successfully matched to obtain the total length of the matched characters.
3. The method of claim 1, wherein the step of determining the matching parameters of the target text and the whitelist based on the total length of the matched characters and the length of the target text comprises:
and calculating the ratio of the total length of the matched characters to the length of the target text to obtain the matching parameters.
4. The method according to claim 1, wherein the method further comprises:
under the condition that the matching parameter is smaller than or equal to a preset matching parameter threshold value, matching the target text by adopting a preset sensitive database to obtain a successfully matched sensitive word;
determining the sensitive parameters of the target text according to the total length of the successfully matched sensitive words;
and under the condition that the sensitive parameter is larger than a preset sensitive parameter threshold value, determining the target text as the sensitive text.
5. The method of claim 4, further comprising, prior to the step of matching the target text using a preset sensitive database to obtain a successfully matched sensitive word:
and word segmentation is carried out on the pinyin characters in the target text by adopting a pre-generated pinyin probability matrix.
6. The method of claim 5, wherein the step of word segmentation of pinyin characters in the target text using a pre-generated pinyin probability matrix comprises:
word segmentation is carried out on the Pinyin characters in the target text to obtain word segmentation groups, and the word segmentation groups comprise syllable groups spliced by at least one syllable;
for each word segmentation group, determining word segmentation accuracy of the word segmentation group by adopting a pre-generated pinyin probability matrix;
and replacing the pinyin characters with the word segmentation groups with the maximum word segmentation accuracy.
7. The method of claim 4, further comprising, prior to the step of matching the target text using a preset sensitive database to obtain a successfully matched sensitive word:
performing generalization processing on the target text, wherein the generalization processing comprises: character merging, character splitting, character sequence expansion, character conversion.
8. The method of claim 2, wherein the body information comprises: the subject whitelist includes a name whitelist, and the association information includes: brands and specifications, the associated whitelist comprising: brand whitelists and specification whitelists.
9. A sensitive text determining apparatus, the apparatus comprising:
the blacklist matching module is used for determining whether at least one character in the target text belongs to a preset blacklist;
the white list matching module is used for matching the target text according to a preset white list and counting the total length of the matched characters under the condition that no characters belong to the preset black list;
the matching parameter determining module is used for determining the matching parameters of the target text and the white list according to the total length of the matched characters and the length of the target text;
and the sensitivity determining module is used for determining that the target text is a non-sensitive text under the condition that the matching parameter is larger than a preset matching parameter threshold value.
10. An electronic device, comprising:
processor, memory and computer program stored on the memory and executable on the processor, characterized in that the processor implements the sensitive text determination method according to any of claims 1 to 8 when executing the program.
11. A readable storage medium, characterized in that instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the sensitive text determination method according to any one of the method claims 1 to 8.
CN201811290233.4A 2018-10-31 2018-10-31 Sensitive text determining method and device Active CN109657228B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811290233.4A CN109657228B (en) 2018-10-31 2018-10-31 Sensitive text determining method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811290233.4A CN109657228B (en) 2018-10-31 2018-10-31 Sensitive text determining method and device

Publications (2)

Publication Number Publication Date
CN109657228A CN109657228A (en) 2019-04-19
CN109657228B true CN109657228B (en) 2023-06-06

Family

ID=66110662

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811290233.4A Active CN109657228B (en) 2018-10-31 2018-10-31 Sensitive text determining method and device

Country Status (1)

Country Link
CN (1) CN109657228B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061874B (en) * 2019-12-10 2022-07-08 思必驰科技股份有限公司 Sensitive information detection method and device
CN111159759A (en) * 2019-12-19 2020-05-15 上海上讯信息技术股份有限公司 Mixed sensitive information discovery method and device based on black and white list and electronic equipment
CN111159354A (en) * 2019-12-31 2020-05-15 中国银行股份有限公司 Sensitive information detection method, device, equipment and system
CN113076748B (en) * 2021-04-16 2024-01-19 平安国际智慧城市科技股份有限公司 Bullet screen sensitive word processing method, device, equipment and storage medium
CN113128220B (en) * 2021-04-30 2023-07-18 北京奇艺世纪科技有限公司 Text discrimination method, text discrimination device, electronic equipment and storage medium
CN113408270B (en) * 2021-06-10 2023-02-10 广州三七极创网络科技有限公司 Variant text recognition method and device and electronic equipment

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AUPO850097A0 (en) * 1997-08-11 1997-09-04 Silverbrook Research Pty Ltd Image processing method and apparatus (art31)
CN108182246B (en) * 2017-12-28 2020-10-30 东软集团股份有限公司 Sensitive word detection and filtering method and device and computer equipment
CN108519970B (en) * 2018-02-06 2021-08-31 平安科技(深圳)有限公司 Method for identifying sensitive information in text, electronic device and readable storage medium

Also Published As

Publication number Publication date
CN109657228A (en) 2019-04-19

Similar Documents

Publication Publication Date Title
CN109657228B (en) Sensitive text determining method and device
CN106778241B (en) Malicious file identification method and device
US20190188729A1 (en) System and method for detecting counterfeit product based on deep learning
US20200082083A1 (en) Apparatus and method for verifying malicious code machine learning classification model
CN110611840B (en) Video generation method and device, electronic equipment and storage medium
CN111858843B (en) Text classification method and device
EP3113174A1 (en) Method for building a speech feature library, method, apparatus, and device for speech synthesis
US20160147867A1 (en) Information matching apparatus, information matching method, and computer readable storage medium having stored information matching program
CN110750615B (en) Text repeatability judgment method and device, electronic equipment and storage medium
WO2019028990A1 (en) Code element naming method, device, electronic equipment and medium
CN107688651B (en) News emotion direction judgment method, electronic device and computer readable storage medium
CN110427453B (en) Data similarity calculation method, device, computer equipment and storage medium
US20210124976A1 (en) Apparatus and method for calculating similarity of images
CN109299276B (en) Method and device for converting text into word embedding and text classification
CN107577943B (en) Sample prediction method and device based on machine learning and server
CN112380401B (en) Service data checking method and device
US20150036930A1 (en) Discriminating synonymous expressions using images
CN112036187A (en) Context-based video barrage text auditing method and system
CN111666816A (en) Method, device and equipment for detecting state of logistics piece
CN111241496B (en) Method and device for determining small program feature vector and electronic equipment
CN113222022A (en) Webpage classification identification method and device
CN113961768A (en) Sensitive word detection method and device, computer equipment and storage medium
US20200220897A1 (en) Domain name recognition method and domain name recognition device
CN114201756A (en) Vulnerability detection method and related device for intelligent contract code segment
CN110427496B (en) Knowledge graph expansion method and device for text processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant