CN117725161A

CN117725161A - Method and system for identifying variant words in text and extracting sensitive words

Info

Publication number: CN117725161A
Application number: CN202311772589.2A
Authority: CN
Inventors: 黄智坤
Original assignee: Weijin Investment Co ltd
Current assignee: Weijin Investment Co ltd
Priority date: 2023-12-21
Filing date: 2023-12-21
Publication date: 2024-03-19

Abstract

A method and a system for identifying variant words in text and extracting sensitive words relate to the technical field of information processing. The method comprises the following steps: acquiring data to be verified on a shared financial platform; extracting the text content of the data to be verified, and comparing the text content with a preset sensitive word stock to obtain a first sensitive word recognition result; determining a variant word set according to each sensitive word in the first sensitive word recognition result, wherein the variant word set comprises a plurality of similar words with the same meaning, which are obtained by converting characters of each sensitive word; comparing the variant word set with the text content of the data to be verified to obtain a second sensitive word recognition result; and matching a target replacement word according to the first sensitive word recognition result and the second sensitive word recognition result, and updating the data to be verified according to the target replacement word. The effect of improving the verification accuracy of the sensitive words is achieved.

Description

Method and system for identifying variant words in text and extracting sensitive words

Technical Field

The application relates to the technical field of information processing, in particular to a method and a system for identifying variant words in text and extracting sensitive words.

Background

With the rapid development of the internet, the financial tax sharing platform is increasingly widely used, and the text information related to financial tax matters generated by massive users may contain sensitive words damaging national interests, social public land and legal interests of others. To effectively prevent such risks, auditing text information in a financial and tax sharing platform has become an urgent need.

Currently, sensitive words in a text are identified mainly by establishing a fixed word stock in advance and then by means of word matching. However, in practical application, simple word matching is performed through a fixed word stock, and the accuracy of sensitive word verification is low.

Disclosure of Invention

The method and the system for identifying variant words in the text and extracting the sensitive words have the effect of improving verification accuracy of the sensitive words.

In a first aspect, the present application provides a method for identifying variant words in text and extracting sensitive words, including:

acquiring data to be verified on a shared financial platform;

extracting the text content of the data to be verified, and comparing the text content with a preset sensitive word stock to obtain a first sensitive word recognition result;

determining a variant word set according to each sensitive word in the first sensitive word recognition result, wherein the variant word set comprises a plurality of similar words with the same meaning, which are obtained by converting characters of each sensitive word;

Comparing the variant word set with the text content of the data to be verified to obtain a second sensitive word recognition result;

and matching a target replacement word according to the first sensitive word recognition result and the second sensitive word recognition result, and updating the data to be verified according to the target replacement word.

By adopting the technical scheme, the data to be verified on the shared financial platform is obtained, the sensitive vocabulary in the text is extracted through double-round recognition, the recognition range is enlarged by using variant vocabulary matching, and finally the recognized sensitive vocabulary is replaced by neutral vocabulary, so that safe and effective verification of the shared data is realized. Specifically, the first round of preset sensitive word stock matching is performed, so that obvious risk words in the text can be identified. Then, a variant word set is constructed, a large number of approximate words similar to the sensitive word semanteme are included, and the second round of matching is carried out, so that the behavior of hiding information by using a variant word assembly can be effectively identified, and the coverage of auditing is greatly improved. And the target replacement words are used for replacing all the identified sensitive words, so that the risk information is deleted, and the complete consistency of the text is maintained. By the scheme, the sensitive information in the text can be accurately identified, the potential safety hazard of shared data is effectively reduced, the balance of information safety sharing and utilization is realized, and the verification accuracy of the sensitive words is improved.

Optionally, detecting the information type of the data to be verified; if the information type of the data to be verified is text information, traversing the character string of the data to be verified, and determining the text content of the data to be verified; if the information type of the data to be verified is video information, determining a plurality of frame images according to the video stream in the data to be verified, and extracting text content in each frame image.

By adopting the technical scheme, the information type of the data to be verified is detected, and if the data is of the text type, the text content is determined by traversing the character string; if the video type is video type, multi-frame images in the video stream are required to be extracted first, and then characters are identified in each frame image as contents. The information type detection and differentiation processing steps are added, so that the scheme has the flexibility of processing multiple types of data, common texts can be checked, risk identification can be performed on non-text information such as videos and images, and the application range of the scheme is widened. The automatic security audit and risk vocabulary recognition replacement of different types of shared information contents are realized. The wide applicability of the shared information security guarantee is improved.

Optionally, reading each keyword in the text content, comparing each keyword with each sensitive word in the preset sensitive word bank to obtain at least one matched target sensitive word, and taking the sensitive word as the first sensitive word recognition result.

By adopting the technical scheme, key words are extracted from the text contents to be verified, then the key words are compared with a preset sensitive word stock one by one, and the matching relation among the words is judged, so that the matched target sensitive words are obtained. Therefore, high false positive rate caused by full-text scanning can be avoided, the accuracy of first-round identification can be effectively improved by only comparing keywords, false alarm quantity is reduced, the identification result is more accurate, the excessive limitation on shared information can be reduced on the premise of guaranteeing the identification effect, and the balance of information security guarantee and utilization efficiency is realized.

Optionally, pinyin of each sensitive word is obtained, and harmonic words of each sensitive word are determined according to the pinyin; matching synonyms and hyponyms of the sensitive words according to a preset word stock; according to the character string length of each sensitive word and the special characters in the preset word stock, a plurality of character variant words are obtained through cross combination; and using the harmonic words, the synonyms, the paraphrasing words and the character variant words of the sensitive words as the approximate words in the variant word set.

By adopting the technical scheme, the pinyin of the sensitive word is acquired to determine the harmonic words, and the sound-like words can be identified. Synonyms and paraphrasing are obtained, and vocabulary variants with similar semantics can be identified. The word variation containing special characters can be generated by character cross combination, so that the constructed variation word set has wider coverage, and can identify various means for hiding information by using word sense or spelling variation, thereby greatly improving the effect of second-round identification. The comprehensive and accurate identification of hidden risk vocabularies in the shared information is realized, and the miss rate of information security audit is reduced to the maximum extent.

Optionally, comparing the harmonic words, the synonyms, the paraphrasing words and the character variant words in the variant word set with the keywords in the text content to obtain the matching degree of the keywords; and if the matching degree of at least one target variant word and each keyword in the variant word set is greater than the preset matching degree, taking the target variant word as the second sensitive word recognition result.

By adopting the technical scheme, the constructed variant word set is compared with the text keywords one by one to obtain the matching degree. And setting a matching degree threshold, judging that the matching is successful when the matching degree of the variant word and the keyword exceeds the threshold, and identifying the variant word as a second-round sensitive word result. The matching degree calculating mechanism can avoid false alarms caused by too wide matching. Only when the vocabulary relativity is high enough, the successful matching is judged, and the accuracy of the second-round recognition is improved. The method and the device realize accurate identification of hidden risk information in the text, reduce misjudgment rate of variant word matching, and achieve balance of information security guarantee and utilization efficiency.

Optionally, if the matching degree of the target variant word and each keyword is greater than the preset matching degree, acquiring the vocabulary information of each keyword in the preset character string length interval; identifying word usage scenes corresponding to the keywords according to the vocabulary information, and determining the sensitivity level of the keywords in the word usage scenes; and if the target keywords with the sensitivity level larger than a preset sensitivity level exist in the keywords, the target keywords are used as the second sensitive word recognition result.

By adopting the technical scheme, when the keywords cannot be successfully matched with the variant words, related words of the keywords are obtained, and the language scene to which the keywords belong is judged. And setting the sensitivity level of the keywords according to different scenes. And if the sensitivity level of the keyword is higher than the threshold value, judging the keyword as a second round of sensitive word result. The semantic sensitivity judgment based on the vocabulary information can effectively reduce the miss judgment rate of the second-round recognition, and even if variant words are not successfully matched, hidden risk information can be recognized through word sense analysis, so that the recognition comprehensiveness is improved. The dual judgment mechanism combining vocabulary matching and semantic analysis is realized, the risk content in the shared information can be more accurately identified, the semantic-level auditing effect is achieved, and the information security is effectively ensured.

Optionally, determining a replacement object of the data to be verified according to the first sensitive word recognition result and the second sensitive word recognition result; determining a replacement position according to the replacement object; matching the target replacement word according to the word sense of the replacement object; and updating the replacement object in the data to be verified into the target replacement word according to the replacement position.

By adopting the technical scheme, the replacement object vocabulary is determined according to the two-round recognition results, the position of each replacement word in the text is positioned, and the target replacement word is matched according to the word sense, so that the replacement is performed at the corresponding position. The accurate positioning replacement mode can avoid excessive damage to the original text, realizes accurate risk vocabulary replacement, and furthest reserves the integrity and consistency of the text. The method and the device have the advantages that the risk of sharing information is effectively reduced, meanwhile, the maintenance of the text utilization value is also considered, the balance of information security guarantee and utilization efficiency is achieved, and the reliability and the practicability of a text auditing technology are improved.

In a second aspect of the present application, a system for identifying variant words in text and extracting sensitive words is provided.

The data acquisition module is used for acquiring data to be verified on the shared financial platform;

the first recognition module is used for extracting the text content of the data to be verified, and comparing the text content with a preset sensitive word stock to obtain a first sensitive word recognition result;

the second recognition module is used for determining a variant word set according to each sensitive word in the first sensitive word recognition result, wherein the variant word set comprises a plurality of similar words with the same meaning, which are obtained by converting characters of each sensitive word; comparing the variant word set with the text content of the data to be verified to obtain a second sensitive word recognition result;

and the sensitive word replacement module is used for matching a target replacement word according to the first sensitive word recognition result and the second sensitive word recognition result, and updating the data to be verified according to the target replacement word.

In a third aspect of the present application, an electronic device is provided.

A system for identifying variant words in text and extracting sensitive words comprises a memory, a processor and a program stored in the memory and capable of running on the processor, wherein the program can be loaded and executed by the processor to realize a method for identifying variant words in text and extracting sensitive words.

In a fourth aspect of the present application, a computer-readable storage medium is provided.

A computer readable storage medium storing a computer program which when executed by a processor causes the processor to implement a method of identifying variant words in text and extracting sensitive words.

In summary, one or more technical solutions provided in the embodiments of the present application at least have the following technical effects or advantages:

1. according to the method, the sensitive vocabulary in the text is extracted through double-round recognition by acquiring the data to be verified on the shared financial platform, the recognition range is enlarged by using variant vocabulary set matching, and finally the recognized sensitive vocabulary is replaced by neutral vocabulary, so that safe and effective verification of the shared data is realized. Specifically, the first round of preset sensitive word stock matching is performed, so that obvious risk words in the text can be identified. Then, a variant word set is constructed, a large number of approximate words similar to the sensitive word semanteme are included, and the second round of matching is carried out, so that the behavior of hiding information by using a variant word assembly can be effectively identified, and the coverage of auditing is greatly improved. And the target replacement words are used for replacing all the identified sensitive words, so that the risk information is deleted, and the complete consistency of the text is maintained. By the scheme, the sensitive information in the text can be accurately identified, the potential safety hazard of shared data is effectively reduced, the balance of information safety sharing and utilization is realized, and the verification accuracy of the sensitive words is improved.

2. According to the method, key words are extracted from the text content to be verified, then the key words are compared with a preset sensitive word stock one by one, and the matching relation among the words is judged, so that the matched target sensitive words are obtained. Therefore, high false positive rate caused by full-text scanning can be avoided, the accuracy of first-round identification can be effectively improved by only comparing keywords, false alarm quantity is reduced, the identification result is more accurate, the excessive limitation on shared information can be reduced on the premise of guaranteeing the identification effect, and the balance of information security guarantee and utilization efficiency is realized.

3. The method and the device can identify the voice-like vocabulary by acquiring the pinyin of the sensitive word to determine the harmonic word. Synonyms and paraphrasing are obtained, and vocabulary variants with similar semantics can be identified. The word variation containing special characters can be generated by character cross combination, so that the constructed variation word set has wider coverage, and can identify various means for hiding information by using word sense or spelling variation, thereby greatly improving the effect of second-round identification. The comprehensive and accurate identification of hidden risk vocabularies in the shared information is realized, and the miss rate of information security audit is reduced to the maximum extent.

Drawings

FIG. 1 is a flow chart of a method for recognizing variant words and extracting sensitive words in text according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a system for recognizing variant words and extracting sensitive words in text according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device according to the disclosure in an embodiment of the present application.

Reference numerals illustrate: 300. an electronic device; 301. a processor; 302. a communication bus; 303. a user interface; 304. a network interface; 305. a memory.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present application, but not all embodiments.

In the description of embodiments of the present application, words such as "for example" or "for example" are used to indicate examples, illustrations or descriptions. Any embodiment or design described herein as "such as" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "or" for example "is intended to present related concepts in a concrete fashion.

In the description of the embodiments of the present application, the term "plurality" means two or more. For example, a plurality of systems means two or more systems, and a plurality of screen terminals means two or more screen terminals. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating an indicated technical feature. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

In order to facilitate understanding of the methods and systems provided in the embodiments of the present application, a description of the background of the embodiments of the present application is provided before the description of the embodiments of the present application.

The embodiment of the application discloses a method and a system for identifying variant words in text and extracting sensitive words, wherein the method and the system are characterized in that the text content of data to be verified is compared with a preset sensitive word library, the sensitive words are identified, then transformation, such as synonyms, near-meaning words, special character interpenetration and the like, is carried out according to the sensitive words, a variant word set is obtained, secondary identification is carried out, a variant word identification result in the data to be verified is obtained, and then the identification result of the two sensitive words is combined for replacement. The method is mainly used for solving the problem that the accuracy of sensitive word verification is low when simple word matching is carried out through a fixed word stock.

Those skilled in the art will appreciate that the problems associated with the prior art are solved by the foregoing background description, and a detailed description of the technical solutions in the embodiments of the present application is provided below, with reference to the drawings in the embodiments of the present application, where the described embodiments are only some embodiments of the present application, but not all embodiments.

Referring to fig. 1, a method for identifying variant words and extracting sensitive words in text includes steps S10 to S40, specifically includes the following steps:

s10: and acquiring the data to be verified on the shared financial platform.

Specifically, an access interface of the shared financial platform is preset, and various newly added and updated information of various data released by the platform can be periodically captured through the interface, including various formats such as texts, pictures and videos, and the information is stored in a local database. When the method is used for acquiring, the newly added and updated data on the shared financial platform is actively captured through the access interface according to the preset time interval, the characteristics of the information content, the format and the like of the data are extracted, and the extracted data are stored in a set local database to be verified. Thus, the data to be verified of the shared financial tax platform can be circularly and automatically acquired. The method comprises the steps of acquiring data to be verified on a shared financial platform, providing an original input source for subsequently identifying and extracting sensitive words in a text, and being the first step of text auditing. Therefore, by actively acquiring the shared platform data, the financial and tax text information can be comprehensively and dynamically monitored, and newly released data contents possibly at risk can be timely found and checked, so that illegal and illegal information in network propagation can be effectively prevented, and the safety and compliance of the shared financial and tax information are ensured.

S20: and extracting the text content of the data to be verified, and comparing the text content with a preset sensitive word stock to obtain a first sensitive word recognition result.

The method comprises the steps of presetting a sensitive word bank, pre-configuring sensitive words required by text auditing, organizing and storing words in a database mode, classifying and preprocessing the words, and providing a word foundation for the follow-up text auditing.

Specifically, detecting the format type of the data to be verified, and if the data is in a text format, directly traversing the character string to extract the text content; if the picture is in the picture format, OCR text recognition is needed to be carried out on the picture, and text information in the picture is extracted; if the video format is the video format, video frame isolation is needed, OCR is carried out on each frame of image, and text content in the video frame is obtained. Thus, various data formats can be covered, and text information in the data to be verified can be comprehensively acquired. And matching and comparing the extracted text content with a pre-established sensitive word stock. A large number of high risk sensitive words which may appear in text information in relation to violations of public germanes etc. are included in the sensitive word stock. By comparing the words with the words in the sensitive word stock, known sensitive words appearing in the text can be found out and used as the recognition result of the first-round sensitive words. The identification process can rapidly locate obvious risk words in the text, and can be used for primarily screening data on the shared financial tax platform, so that the text auditing efficiency is effectively improved. But since hidden sensitive information of vocabulary variant expressions may be missed by means of only known sensitive word matches, further improvement of accuracy by variant word recognition is needed.

On the basis of the above embodiment, the specific steps of extracting text content further include S21 to S22:

s21: detecting the information type of the data to be verified; if the information type of the data to be verified is text information, traversing the character string of the data to be verified, and determining the text content of the data to be verified.

Illustratively, by analyzing the characteristics of the multimedia attribute, the header information, etc. of the material file, it is possible to determine which of the text format, the picture format, the video format, etc. the material is. If the data to be verified is judged to be in the text format, the character strings can be directly traversed, and the text contents can be extracted word by word. The text file stores text information in character codes, and all text contents in the text can be completely acquired by analyzing the character codes. Compared with the method that the characters in the pictures and the videos are required to be extracted through image recognition, the text-format data is simple and efficient to operate. Therefore, after the data is determined to be in the text format, the embodiment directly traverses the character string to analyze and encode, extracts the text content, and can rapidly acquire all vocabulary information in the text, thereby providing text sources for the subsequent comparison and recognition of sensitive words. By judging the data format first and selecting the adaptive text extraction mode, the extraction efficiency can be improved, the text content can be uniformly acquired by all the data to be verified in different formats, input materials are provided for text-based sensitive word recognition, the application range of the method is enlarged, and various data on a shared financial tax platform can be more comprehensively checked.

S22: if the information type of the data to be verified is video information, determining a plurality of frame images according to the video stream in the data to be verified, and extracting text content in each frame image.

For example, for the data to be verified in the video format, since the video image may also contain important text information, the embodiment of the invention needs to perform video frame image extraction after judging that the data is in the video format, and acquire the text contained in the video image for recognition and matching of the sensitive words. The method is specifically implemented by analyzing video data streams in a video file according to the video file, and extracting encoded image frame data. A fixed time interval is set, for example, an image frame is extracted every 0.5 seconds, so that excessive repeated texts caused by too dense taking is avoided. This intercepts a plurality of representative frame images from the video stream at a stable frame rate. And performing optical character recognition on the intercepted video frame images, recognizing Chinese characters contained in each frame image, and outputting the recognized character content. And (3) repeatedly performing OCR text recognition on all the extracted frame images, and summarizing all the text contents recognized by each frame to serve as text information of the video data for subsequent sensitive word recognition and matching. By extracting video frames and identifying words, static words appearing in the video can be covered, a more complete text source is obtained, sensitive contents in video data are prevented from being missed, the range of identifiable data is expanded, and the comprehensiveness of data auditing of a financial and tax sharing platform is improved.

In an optional embodiment of the present application, the specific process of comparing the text content with a preset sensitive word stock to obtain the first sensitive word recognition result includes: and performing word segmentation processing on the extracted text content to obtain each key word in the text. And reading the keywords one by one, and comparing the keywords with sensitive words in a sensitive word stock one by one to judge whether matching items can be found in the word stock. A large number of high-risk sensitive words which may appear in the text are recorded in the preset sensitive word stock. By matching with the lexicon, known sensitive word elements that occur in the text can be found out. The matching process adopts text matching algorithms such as regular expressions, editing distances and the like, so that the matching accuracy is improved. And outputting the successfully matched sensitive words as the result of the first round of sensitive word recognition. The method can quickly locate obvious risk words in the text and conduct preliminary examination on the data. But only by means of known sensitive word matching, hidden sensitive information expressed by vocabulary variants can be omitted, and in order to further improve recognition accuracy, the invention captures word sense related hidden sensitive content through variant word recognition in the subsequent steps.

S30: and determining a variant word set according to each sensitive word in the first sensitive word recognition result, wherein the variant word set comprises a plurality of similar words with the same meaning obtained by converting each sensitive word into a character.

Specifically, after the first round of sensitive word matching is performed, a variant word set of sensitive words needs to be generated to provide a vocabulary source required by the second round of recognition. Variant word recognition is performed because some approximate words or morphological changes are often used in text to express sensitive content, and omission is easily caused by only recognizing a known word stock. The method is implemented by collecting all sensitive words in the first round of recognition results. For each word, vocabulary variants with approximate meaning and slightly changed word shape are automatically generated by adding, deleting and replacing characters. For each identified sensitive word, a plurality of character transformed approximation words may be derived. All the approximate words are combined into a set which is used as a potential sensitive word related to word sense, and the second round of matching recognition is carried out on the word sense and the original text. Therefore, through variant word detection, the recognition rate can be greatly improved, the condition that judgment is missed only through simple word matching is prevented, hidden sensitive information in a text is comprehensively captured, and the safety of sharing financial tax information is ensured.

On the basis of the above embodiment, the specific step of determining the variant word set further includes S31 to S33:

s31: and acquiring the pinyin of each sensitive word, and determining harmonic words of each sensitive word according to the pinyin.

Illustratively, when generating variant vocabulary of sensitive words, harmonic words need to be generated by pinyin of the sensitive words. The pinyin is obtained and harmonic words are determined because a large number of words with different meanings due to similar pronunciation exist in the Chinese. These close sounding harmonic words can be used to express the obscure sensitive content. The specific implementation is that a Chinese phonetic alphabet phonetic symbol system is called, and corresponding phonetic representations are directly obtained according to a sensitive word dictionary. These harmonic words are used as variant words of the sensitive words, and the second round of text matching is performed. By generating harmonic words based on pinyin, recognition vocabulary can be greatly increased, and hidden sensitive content expressed by harmonic words in text can be captured. Avoiding the situation that missed judgment occurs only by means of limited known sensitive words. Thus, the comprehensive recognition capability of the risk information in the financial tax text can be improved.

S32: and matching synonyms and paraphraseology of each sensitive word according to a preset word bank.

Illustratively, in constructing a variant vocabulary set of sensitive words, it is necessary to expand the coverage of the recognition vocabulary by matching synonyms and paraphrasing. Synonyms and paraphrasing are matched because words that are semantically similar are often used in text to express information about a concept or topic. Specifically, a large-scale synonym dictionary and a paraphrase dictionary are established in advance. For each sensitive word identified in the first round, the embodiment searches the synonym forest for its synonym, and searches the hyponym dictionary for its hyponym. The synonyms and the near-meaning words which are semantically related to the sensitive words can also express the sensitive information in the text, and the second round of matching is needed to be carried out as variant words, so that the vocabulary of the recognition coverage can be greatly expanded, and the accurate recognition capability of the sensitive information in the text is improved.

S33: according to the character string length of each sensitive word and the special characters in the preset word stock, obtaining a plurality of character variant words through cross combination; harmonic words, synonyms, paraphrasing words and character variant words of each sensitive word are used as the approximate words in the variant word set.

Illustratively, in the process of constructing the variant word set, a plurality of sensitive word approximate words with character variation are generated by means of character cross combination. Character conversion combinations are performed because in actual text, some kind of sensitive information is often expressed using words, harmonics, abbreviations, and these words may be difficult to recognize by simple character substitution. The method is characterized in that the character string length of the known sensitive words is split, a special character word stock is preset, character cross combination is carried out, and a new word is automatically generated. These words are used as approximate words of the original sensitive words, and the second round of matching is performed. Through this process, a variant word set can be constructed that contains harmonic words, synonyms, paraphrasing, and a large number of character variants. The vocabularies can effectively represent various hidden variants of sensitive words possibly appearing in the text, are used for second-round recognition, and greatly improve the capturing coverage of text risk information.

S40: and comparing the variant word set with the text content of the data to be verified to obtain a second sensitive word recognition result.

Specifically, after the variant word set is constructed, it needs to be matched with the original text content for the second round to identify hidden sensitive information expressed using the variant word. The second round of recognition is performed because the method is easily bypassed by the method of hiding information by vocabulary modification only by virtue of known sensitive word matching, resulting in the risk of missed judgment. Matching all the vocabularies in the constructed variant word set with the text content of the data to be verified one by one. The matching still adopts algorithms such as regular expressions, editing distances and the like, and a certain degree of character substitution is tolerated. Any vocabulary successfully matched with the variant word set in the text is output as a result of the second round of sensitive word recognition. Through the second round of variant word recognition, the recognition rate of hidden sensitive information in the text can be greatly improved, the content using the morphological transformation hidden information is captured to the maximum extent, and the occurrence of missed judgment due to the fact that the word stock is only known is avoided. And the safety monitoring of the shared financial information content is ensured.

On the basis of the above embodiment, the specific step of determining the second sensitive word recognition result further includes S41 to S45:

S41: and comparing the harmonic words, the synonyms, the paraphraseology and the character variant words in the variant word set with the keywords in the text content to obtain the matching degree of the keywords.

Illustratively, in performing the second round of variant word matching, the matching degree of the text keyword and each word in the variant word set needs to be calculated. The matching degree is calculated because the text keywords and the variant words are often not completely equal, but a certain character replacement, deletion and the like exist, and the similarity between the words needs to be calculated to judge the matching relationship. The method is specifically realized by firstly segmenting the text to be verified and extracting all key words. And then, extracting the vocabulary in the variant word set one by one, including harmonic words, synonyms and the like, and comparing the vocabulary with each keyword to calculate the matching degree. The comparison may employ an edit distance algorithm to calculate the number of operands required for character substitution as a similarity measure. The vocabulary pairs with the matching degree higher than the set threshold value are judged to be successfully matched, and the vocabulary pairs are output as the recognition results of the second round of sensitive words. The correlation among the vocabularies can be accurately judged by calculating the vocabulary matching degree, the accuracy of the variant word recognition result is improved, and the missing judgment or the misjudgment caused by direct comparison is avoided.

S42: and if the matching degree of at least one target variant word and each keyword in the variant word set is greater than the preset matching degree, taking the target variant word as a second sensitive word recognition result.

Illustratively, after the matching degree of the text keyword and the variant vocabulary is obtained, the successfully matched variant vocabulary needs to be determined according to the matching degree result. The matching threshold is preset because the matching success can be determined only by determining that the vocabulary correlation is insufficient and judging whether the correlation exceeds the expectation. Specifically, the embodiment traverses the vocabulary matching degree result, and finds the vocabulary pairs with the matching degree larger than a preset threshold value, wherein the preset threshold value can be determined through a vocabulary similarity algorithm. And when the matching degree result is higher than the threshold value, judging that the text keywords are successfully matched with the corresponding variant words. And collecting all the variant words successfully matched as a second-round recognition result and outputting the second-round recognition result. By setting the matching degree threshold, misjudgment caused by too loose matching can be avoided, accuracy of variant word recognition is improved, and hidden sensitive content in a text is comprehensively and accurately recognized.

S43: if the matching degree of the target variant word and each keyword is not greater than the preset matching degree in the variant word set, acquiring the vocabulary information of each keyword in the preset character string length interval.

For example, in the second round of variant word matching, there may be cases where some text keywords cannot be successfully matched with the variant word set. The method is characterized in that related vocabulary information of keywords is required to be obtained in an expanded mode to enrich vocabulary sources and re-compare, and specifically, related vocabularies of the keywords which are not successfully matched are obtained in a preset word length interval. By expanding the relevant vocabulary range of keywords, the likelihood of subsequent matches can be increased. The related vocabulary can be obtained through a word vector technology, semantic distances among the vocabularies are calculated, and the vocabulary with the nearest vector space distance is obtained. Related vocabulary can also be obtained through a dictionary, a dictionary and other tools. After the expanded vocabulary is obtained, the embodiment will perform variant word matching again until the keyword is successfully matched or the matching frequency threshold is reached. The scheme can furthest improve the probability of successful matching of the keywords and the variant words, reduce the miss rate of the second-round recognition, and more comprehensively recognize hidden sensitive content in the text.

S44: and identifying the keyword scenes corresponding to the keywords according to the vocabulary information, and determining the sensitivity level of the keywords under the keyword scenes.

The word scene refers to the use environment and the context of the vocabulary in the language or the text, and the semantic scene of the vocabulary is deduced by judging the related words and the context information of the vocabulary.

The sensitivity level is a risk level rating of words or segments of speech in the text. The sensitivity level of the same vocabulary in different word scenes can be different.

For example, after the vocabulary information related to the keywords is obtained, the language scene to which the keywords belong needs to be determined, and the sensitivity degree of the language scene is determined according to different scenes. The language scene is judged because the expressed risk sensitivity degree of the same keyword is different under different use scenes. The specific implementation is that the language scene corresponding to the keyword is judged by using the NLP technology through the related vocabulary of the keyword. According to different scenes, the keywords have different preset risk levels. For the keywords judged to be high-risk scenes, even if variant words are not successfully matched, higher sensitivity level is given; while some neutral scene keywords, even if partially matching variant words, may be given a lower level of sensitivity. Keyword semantics are acquired through language scene recognition, recognition accuracy of unsuccessfully matched words can be improved, misjudgment caused by context separation is avoided, text risk classification is more accurately performed, and contents with different risk degrees are accurately recognized.

S45: and if the target keywords with the sensitivity level larger than the preset sensitivity level exist in the keywords, taking the target keywords as the second sensitivity word recognition result.

Illustratively, after the sensitivity level of the keyword is determined, a final second round of sensitive words need to be extracted according to the level result. The sensitivity level threshold is set because the tolerable sensitivity level is different for different risk scenarios, and a dynamic judgment standard is required. The specific implementation is that the risk level results of all keywords are traversed, and comparison with a preset level threshold is carried out. This threshold can be dynamically adjusted according to different industries and word scenes. And when the sensitivity level obtained by analyzing the keywords is higher than the current threshold, judging that the keywords have higher risk. And extracting all keywords higher than the sensitivity level threshold, and outputting the keywords as second-round sensitive word recognition results after variant word matching. And for low-risk keywords, even if variant words are not successfully matched, the low-risk keywords can be filtered and removed, so that the false alarm rate is reduced. Therefore, hidden sensitive content which is possibly at risk can be accurately identified, and misjudgment results caused by context separation are avoided, so that the accuracy of shared information security audit is improved.

S50: and matching the target replacement words according to the first sensitive word recognition result and the second sensitive word recognition result, and updating the data to be verified according to the target replacement words.

The target replacement word refers to a neutral word used for replacing the identified sensitive word in text security audit. The replacement of sensitive words is performed because the direct deletion of the identified sensitive words may destroy the semantic structure and consistency of the text. The source of the target replacement word is to bind neutral replacement words for sensitive words with different risk levels and different categories by pre-establishing a word mapping table.

Specifically, after the recognition results of the first round of sensitive words and the second round of sensitive words are obtained, the recognition vocabulary is replaced, so that the content of the text data is updated. Replacement is performed because direct deletion of the identified sensitive vocabulary may destroy the integrity and consistency of the text semantics. The specific implementation is that aiming at each identified sensitive vocabulary, a preset vocabulary mapping table is called to match corresponding replacement words. The vocabulary mapping table can preset neutral replacement vocabularies for sensitive words with different categories and risk levels. And replacing the identified sensitive vocabulary position in the text data with the corresponding neutral vocabulary to finish the updating of the content. The updated text eliminates the risk information and maintains consistency of semantic expressions. Through the replacement of the sensitive words, risk sensitive information in the text can be effectively removed, the potential safety hazard of information sharing is reduced, meanwhile, the readability and the integrity of the text are reserved to the greatest extent, and the safe and efficient text auditing effect is achieved.

On the basis of the above embodiment, the specific steps of determining the target replacement word and replacing the verification material to be verified further include S51 to S53:

s51: and determining a replacement object of the data to be verified according to the first sensitive word recognition result and the second sensitive word recognition result.

For example, after the recognition results of the sensitive words in the first round and the second round are obtained, the objects to be replaced in the text, that is, all the recognized sensitive words, need to be determined, and the original text is directly subjected to replacement operation, which may cause a certain risk of misplacement. The method is characterized in that all sensitive vocabularies in the first round and the second round of recognition results are traversed and extracted, and a certain replacement object vocabulary is formed after filtering and repeating. The risk category and grade of each vocabulary are also marked, so that the subsequent replacement operation is convenient. After the vocabulary of the object to be replaced is obtained, when the vocabulary is replaced, the embodiment checks whether the position of each vocabulary in the text is consistent with the replaced object one by one, and then the replacement is executed, so that the risk of misoperation is reduced. By confirming the recognition result of the sensitive words to form a replacement object, the accuracy of vocabulary replacement can be improved, unnecessary damage to the original text is avoided, and therefore content auditing and sensitivity filtering operations are more accurately and effectively achieved.

S52: determining a replacement position according to the replacement object; and matching the target replacement word according to the word sense of the replacement object.

Illustratively, after obtaining the sensitive words that need to be replaced, it is necessary to determine the specific position of each word in the text, and match the appropriate target replacement word according to its word sense. The location and matching word sense are determined because only accurate acquisition of these two items of information allows proper replacement operations. The method is specifically realized by analyzing position coordinates of each replacement object vocabulary in the text one by one, such as what section, sentence and the like, and accurately acquiring the replacement position. Word sense attributes of each word are analyzed through NLP technology, for example, a word belongs to an economic scene. And searching target replacement words matched with the word senses of the replacement objects in a preset vocabulary mapping table, for example, matching neutral vocabularies in an economic scene. After the replacement positions and the target words are obtained, the corresponding sensitive words can be accurately replaced in the text according to the positions, the filtering operation is completed, the accuracy of replacement can be ensured, the filtering text with smooth semantics and complete details can be generated, and the safety of information sharing is improved.

S53: and updating the replacement object in the data to be verified as a target replacement word according to the replacement position.

Illustratively, after the replacement location and the target replacement word are determined, a replacement operation needs to be performed to update the identified sensitive vocabulary to a neutral vocabulary without negative semantics. The replacement updating is performed to eliminate the risk information hidden in the text on the premise of keeping the text semantic consistency. The specific implementation is that the text content to be verified is accessed, and the text content to be verified is positioned to the determined position coordinates. And directly replacing the original sensitive vocabulary of the position with the target replacement word according to the vocabulary matching result. The replacement operation is repeated until the scan completion identifies the sensitive vocabulary to be replaced. And finally generating an updated text which is subjected to content security audit and in which the risk vocabulary is neutralized. Through the accurate replacement of the position location, the sensitive information in the text can be effectively eliminated, the continuity and the readability of the text are reserved to the maximum extent, and the text auditing purpose of considering both the information security and the sharing efficiency is achieved.

Referring to fig. 2, a system for identifying variant words and extracting sensitive words in text according to an embodiment of the present application includes: the system comprises a data acquisition module, a first identification module, a second identification module and a sensitive word replacement module, wherein:

the second recognition module is used for determining a variant word set according to each sensitive word in the first sensitive word recognition result, wherein the variant word set comprises a plurality of similar words with the same meaning obtained by converting each sensitive word by characters; comparing the variant word set with the text content of the data to be verified to obtain a second sensitive word recognition result;

and the sensitive word replacement module is used for matching the target replacement word according to the first sensitive word recognition result and the second sensitive word recognition result, and updating the data to be verified according to the target replacement word.

On the basis of the embodiment, the first identification module is further used for detecting the information type of the data to be verified; if the information type of the data to be verified is text information, traversing the character strings of the data to be verified, and determining the text content of the data to be verified; if the information type of the data to be verified is video information, determining a plurality of frame images according to the video stream in the data to be verified, and extracting text content in each frame image.

On the basis of the embodiment, the first recognition module further includes reading each keyword in the text content, comparing each keyword with each sensitive word in a preset sensitive word library to obtain at least one matched target sensitive word, and taking the sensitive word as a first sensitive word recognition result.

On the basis of the embodiment, the second recognition module is further used for obtaining pinyin of each sensitive word and determining harmonic words of each sensitive word according to the pinyin; matching synonyms and hyponyms of each sensitive word according to a preset word stock; according to the character string length of each sensitive word and the special characters in the preset word stock, obtaining a plurality of character variant words through cross combination; harmonic words, synonyms, paraphrasing words and character variant words of each sensitive word are used as the approximate words in the variant word set.

On the basis of the embodiment, the second recognition module further comprises comparing harmonic words, synonyms, paraphrasing words and various character variant words in the variant word set with various keywords in the text content to obtain the matching degree of the various keywords; and if the matching degree of at least one target variant word and each keyword in the variant word set is greater than the preset matching degree, taking the target variant word as a second sensitive word recognition result.

On the basis of the above embodiment, the second recognition module further includes obtaining vocabulary information of each keyword in a preset character string length interval if the matching degree of the target variant word and each keyword is greater than a preset matching degree in the variant word set; identifying application word scenes corresponding to the keywords according to the vocabulary information, and determining the sensitivity level of the keywords under the application word scenes; and if the target keywords with the sensitivity level larger than the preset sensitivity level exist in the keywords, taking the target keywords as the second sensitivity word recognition result.

On the basis of the embodiment, the sensitive word replacement module is further used for determining a replacement object of the data to be verified according to the first sensitive word recognition result and the second sensitive word recognition result; determining a replacement position according to the replacement object; matching target replacement words according to word senses of the replacement objects; and updating the replacement object in the data to be verified as a target replacement word according to the replacement position.

It should be noted that: in the device provided in the above embodiment, when implementing the functions thereof, only the division of the above functional modules is used as an example, in practical application, the above functional allocation may be implemented by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the embodiments of the apparatus and the method provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the embodiments of the method are detailed in the method embodiments, which are not repeated herein.

The application also discloses electronic equipment. Referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic device according to the disclosure in an embodiment of the present application. The electronic device 300 may include: at least one processor 301, at least one network interface 304, a user interface 303, a memory 305, at least one communication bus 302.

Wherein the communication bus 302 is used to enable connected communication between these components.

The user interface 303 may include a Display screen (Display) interface and a Camera (Camera) interface, and the optional user interface 303 may further include a standard wired interface and a standard wireless interface.

The network interface 304 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.

Wherein the processor 301 may include one or more processing cores. The processor 301 utilizes various interfaces and lines to connect various portions of the overall server, perform various functions of the server and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 305, and invoking data stored in the memory 305. Alternatively, the processor 301 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 301 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), and a modem etc. The CPU mainly processes an operating system, a user interface diagram, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 301 and may be implemented by a single chip.

The Memory 305 may include a random access Memory (Random Access Memory, RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 305 includes a non-transitory computer readable medium (non-transitory computer-readable storage medium). Memory 305 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 305 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the above-described respective method embodiments, etc.; the storage data area may store data or the like involved in the above respective method embodiments. Memory 305 may also optionally be at least one storage device located remotely from the aforementioned processor 301. Referring to fig. 3, an operating system, a network communication module, a user interface module, and an application program of a method of recognizing variant words in text and extracting sensitive words may be included in the memory 305 as a computer storage medium.

In the electronic device 300 shown in fig. 3, the user interface 303 is mainly used for providing an input interface for a user, and acquiring data input by the user; and processor 301 may be used to invoke an application in memory 305 that stores a method of identifying variant words in text and extracting sensitive words, which when executed by one or more processors 301, causes electronic device 300 to perform the method as in one or more of the embodiments described above. It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

In the several embodiments provided herein, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, such as a division of units, merely a division of logic functions, and there may be additional divisions in actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some service interface, device or unit indirect coupling or communication connection, electrical or otherwise.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned memory includes: various media capable of storing program codes, such as a U disk, a mobile hard disk, a magnetic disk or an optical disk.

The above are merely exemplary embodiments of the present disclosure and are not intended to limit the scope of the present disclosure. That is, equivalent changes and modifications are contemplated by the teachings of this disclosure, which fall within the scope of the present disclosure. Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure.

This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a scope and spirit of the disclosure being indicated by the claims.

Claims

1. A method for identifying variant words and extracting sensitive words in text, comprising:

acquiring data to be verified on a shared financial platform;

2. The method for recognizing and extracting sensitive words from text according to claim 1, wherein said extracting the text content of the material to be verified comprises:

detecting the information type of the data to be verified;

if the information type of the data to be verified is text information, traversing the character string of the data to be verified, and determining the text content of the data to be verified;

if the information type of the data to be verified is video information, determining a plurality of frame images according to the video stream in the data to be verified, and extracting text content in each frame image.

3. The method for recognizing variant words and extracting sensitive words in text according to claim 2, wherein the comparing the text content with a preset sensitive word stock to obtain a first sensitive word recognition result comprises:

and reading each keyword in the text content, comparing each keyword with each sensitive word in the preset sensitive word library to obtain at least one matched target sensitive word, and taking the sensitive word as the first sensitive word recognition result.

4. The method for identifying and extracting sensitive words from text according to claim 1, wherein said determining a set of variant words from each sensitive word in the result of the first sensitive word identification comprises:

Acquiring pinyin of each sensitive word, and determining harmonic words of each sensitive word according to the pinyin;

matching synonyms and hyponyms of the sensitive words according to a preset word stock;

according to the character string length of each sensitive word and the special characters in the preset word stock, a plurality of character variant words are obtained through cross combination;

and using the harmonic words, the synonyms, the paraphrasing words and the character variant words of the sensitive words as the approximate words in the variant word set.

5. The method for identifying variant words and extracting sensitive words in text according to claim 4, wherein comparing the variant word set with the text content of the to-be-verified material to obtain a second sensitive word identification result comprises:

comparing the harmonic words, the synonyms, the paraphrasing words and the character variant words in the variant word set with the keywords in the text content to obtain the matching degree of the keywords;

and if the matching degree of at least one target variant word and each keyword in the variant word set is greater than the preset matching degree, taking the target variant word as the second sensitive word recognition result.

6. The method of claim 5, wherein said comparing said harmonic words, said synonyms, said paraphrasing, and each of said character variants in said set of variants to each keyword in said textual content further comprises:

if the matching degree of the target variant word and each keyword is larger than the preset matching degree, acquiring the vocabulary information of each keyword in the preset character string length interval;

identifying word usage scenes corresponding to the keywords according to the vocabulary information, and determining the sensitivity level of the keywords in the word usage scenes;

and if the target keywords with the sensitivity level larger than a preset sensitivity level exist in the keywords, the target keywords are used as the second sensitive word recognition result.

7. The method for recognizing and extracting a word-sensitive from a text according to claim 1, wherein the matching a target replacement word according to the first word-sensitive recognition result and the second word-sensitive recognition result, and updating the material to be verified according to the target replacement word, comprises:

Determining a replacement object of the data to be verified according to the first sensitive word recognition result and the second sensitive word recognition result;

determining a replacement position according to the replacement object;

matching the target replacement word according to the word sense of the replacement object;

and updating the replacement object in the data to be verified into the target replacement word according to the replacement position.

8. A system for identifying variant words in text and extracting sensitive words, said system comprising:

9. An electronic device comprising a processor, a memory, a user interface and a network interface, the memory for storing instructions, the user interface and the network interface for communicating to other devices, the processor for executing the instructions stored in the memory to cause the electronic device to perform a method of identifying variant words in text and extracting sensitive words according to any one of claims 1-7.

10. A computer readable storage medium storing instructions which, when executed, perform the method steps of identifying variant words in a text and extracting sensitive words in a text according to any of claims 1-7.