CN113472686A - Information identification method, device, equipment and storage medium - Google Patents

Information identification method, device, equipment and storage medium Download PDF

Info

Publication number
CN113472686A
CN113472686A CN202110761153.8A CN202110761153A CN113472686A CN 113472686 A CN113472686 A CN 113472686A CN 202110761153 A CN202110761153 A CN 202110761153A CN 113472686 A CN113472686 A CN 113472686A
Authority
CN
China
Prior art keywords
preset
identification
text content
text
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110761153.8A
Other languages
Chinese (zh)
Other versions
CN113472686B (en
Inventor
齐文杰
刘志诚
汪亚军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Lexin Software Technology Co Ltd
Original Assignee
Shenzhen Lexin Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Lexin Software Technology Co Ltd filed Critical Shenzhen Lexin Software Technology Co Ltd
Priority to CN202110761153.8A priority Critical patent/CN113472686B/en
Publication of CN113472686A publication Critical patent/CN113472686A/en
Application granted granted Critical
Publication of CN113472686B publication Critical patent/CN113472686B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2483Traffic characterised by specific attributes, e.g. priority or QoS involving identification of individual flows
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Character Discrimination (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses an information identification method, an information identification device, information identification equipment and a storage medium. The method comprises the following steps: converting the network flow data into text content; determining whether the text content meets a preset identification requirement or not according to the total text length, the preset character number, the word segmentation number and the named entity number corresponding to the text content; and if so, carrying out recognition processing aiming at preset information on the text content based on a preset recognition rule. By adopting the technical scheme, the embodiment of the invention can improve the timeliness of the identification of the preset information, effectively reduce the false alarm and improve the operation efficiency of information leakage prevention.

Description

Information identification method, device, equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of computers, in particular to an information identification method, an information identification device, information identification equipment and a storage medium.
Background
With the rapid development of internet technology, data interaction is more and more convenient, however, the network traffic data may contain user sensitive information and needs to be identified in a targeted manner to prevent network data leakage.
At present, a great deal of network data leakage prevention schemes at home and abroad often cause a great deal of false alarm information, so that data leakage prevention operators are tired of processing false alarms, and a real data leakage event cannot be focused.
Therefore, the existing information identification scheme based on network traffic data is still not perfect, and needs to be improved.
Disclosure of Invention
The embodiment of the invention provides an information identification method, an information identification device, information identification equipment and a storage medium, which can optimize the existing information identification scheme based on network flow data.
In a first aspect, an embodiment of the present invention provides an information identification method, including:
converting the network flow data into text content;
determining whether the text content meets a preset identification requirement or not according to the total text length, the preset character number, the word segmentation number and the named entity number corresponding to the text content;
and if so, carrying out recognition processing aiming at preset information on the text content based on a preset recognition rule.
In a second aspect, an embodiment of the present invention provides an information identification apparatus, including:
the text conversion module is used for converting the network flow data into text contents;
the recognition requirement judging module is used for determining whether the text content meets preset recognition requirements or not according to the total text length, the preset character number, the word segmentation number and the named entity number corresponding to the text content;
and the identification processing module is used for carrying out identification processing on the text content aiming at preset information based on a preset identification rule when the judgment result of the identification requirement judging module is satisfied.
In a third aspect, an embodiment of the present invention provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the information identification method according to the embodiment of the present invention.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the information identification method provided by the embodiment of the present invention.
According to the information identification scheme provided by the embodiment of the invention, network flow data is converted into text content, whether the text content meets the preset identification requirement is determined according to the total length of the text corresponding to the text content, the number of preset characters, the number of participles and the number of named entities, and if so, the text content is identified according to the preset information based on the preset identification rule. Through adopting above-mentioned technical scheme, before discerning the information of predetermineeing that contains in the network flow, judge whether text content satisfies the requirement of predetermineeing the discernment from a plurality of dimensions earlier, if satisfy, just can carry out identification process, can reduce unnecessary identification operation, improve the timeliness of predetermineeing information identification, and can effectively reduce the wrong report, improve the operation efficiency that information was prevented leaking.
Drawings
Fig. 1 is a schematic flowchart of an information identification method according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of another information identification method according to an embodiment of the present invention;
fig. 3 is a block diagram of an information recognition apparatus according to an embodiment of the present invention;
fig. 4 is a block diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the steps as a sequential process, many of the steps can be performed in parallel, concurrently or simultaneously. In addition, the order of the steps may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.
Fig. 1 is a schematic flow chart of an information identification method according to an embodiment of the present invention, where the method is applicable to a scenario in which preset information in network traffic is identified, and specifically, may be a privacy protection scenario in which preset information is prevented from being leaked. The preset information may be preset sensitive information, such as user sensitive information, which may include, but is not limited to, a user name, an identification number, a bank card number, a telephone number, a mobile phone number, a personal address, a family relationship, a school calendar, a gender, a school name, and the like.
As shown in fig. 1, the method includes:
step 101, converting the network traffic data into text content.
The network traffic data can be understood as data transmitted on a network, and in the embodiment of the invention, the specific source of the network traffic data is not limited, and the network traffic data can be imported from the outside, can also be actively acquired, and the like. After the network traffic data is obtained, the network traffic data may be converted into text content, the specific conversion mode is not limited, and the data volume of the network traffic data converted at a time may also be set according to the actual situation.
For example, different network traffic data may correspond to different network protocols, a target protocol type corresponding to the network traffic data that needs text conversion at present may be identified, the network traffic data may be analyzed based on the target protocol type to obtain actual transmission content, and then the transmission content may be converted into text content. Optionally, for convenience of identification, different types of transmission contents or transmission files and the like may be converted into text contents in a unified Format, and the unified Format may be set according to an actual situation, for example, the unified Format may be an 8-bit (UTF-8) Format, where UTF is called as Unicode Transformation Format.
For example, when the network traffic data acquired at a time is more, for example, the network traffic data includes a file, the file may be converted into file contents in a unified format, the file contents are divided to obtain a plurality of text contents, and each text content is identified one by one.
And 102, determining whether the text content meets a preset identification requirement according to the total length of the text corresponding to the text content, the preset number of characters, the number of word segments and the number of named entities.
For example, the total length of text may be the total number of characters contained in the text content, i.e. the total number of characters. The preset number of characters may be a number of preset characters, and the preset characters may be set according to actual identification requirements, and may include, for example, American Standard Code for Information Interchange (ASCII) and uncommon Chinese characters, and may further include other characters, which is not limited specifically. Among them, ASCII is a set of computer coding systems based on latin letters, mainly used to display modern english and other western european languages. The definition of the uncommon Chinese character can be determined according to a set of national standards implemented by the State administration of national standards, 5.1.1981, 6763 Chinese characters are included in GB 2312-1980, 99.75% of the use frequency of the Chinese characters included in GB2312 covers the continental China, and the characters except the rarely-used Chinese characters can be regarded as the rarely-used Chinese characters. The number of the participles may be understood as the number of the participles obtained after the text content is participled, and a specific participle manner is not limited, and for example, the participles may be based on Natural Language Processing (NLP) technology. The number of Named entities may be understood as the number of Named entities identified based on Named Entity Recognition (NER) technology.
For example, the preset identification requirement may be set in combination with the plurality of dimensions, the probability that the text content includes the preset information may be preliminarily determined by using the preset identification requirement, and the probability that the text content is a messy code or the ratio of the text content including the messy code may be preliminarily determined. If the probability that the text content contains the preset information is high, subsequent recognition can be performed to accurately recognize whether the text content contains the preset information, and if the probability is low, the necessity of performing the subsequent recognition can be considered to be low, so that unnecessary recognition operation can be reduced, the timeliness of the recognition of the preset information is improved, and false alarm can be effectively reduced. If the probability that the text content is the messy codes is high or the proportion of the messy codes contained in the text content is high, the text content may be encrypted ciphertext or other meaningless messy codes, the readability is poor, the possibility of carrying clear text sensitive information is low, the problem of divulging sensitive information cannot be generally involved, the necessity of performing subsequent identification is considered to be low, and otherwise, the subsequent identification is considered to be required. The preset identification requirement may be set according to an actual situation, such as a source of the network traffic data, and is not limited specifically. The step can be regarded as a process of denoising the file content, that is, text content with high probability of containing preset information is screened from the whole file content for subsequent accurate identification.
Optionally, on the basis of the multiple dimensions, preset identification requirements may be set in combination with other dimensions, for example, the preset identification requirements may include parts of speech, word senses, document coding methods, keywords, and the like.
And 103, if the text content meets the preset information requirement, performing identification processing on the text content according to preset identification rules.
Illustratively, after the preset identification requirement is determined to be met, it is stated that noise reduction has been performed, and fine-grained accurate identification may be performed at this time, so as to ensure that whether the text content includes the preset information is accurately identified. It should be noted that, in this step, the identification processing for the preset information may be performed on all or part of the text content, for example, if a small amount of messy codes or other types of unreadable characters exist in the text content, the identification processing for the preset information may be performed on the remaining characters in the text content except the messy codes and other unreadable characters, so as to further improve the identification efficiency.
Optionally, if the text content does not contain the preset information, it is determined that the current text content does not relate to the user sensitive information, and subsequent operations such as alarm condition identification may not be performed.
Optionally, after performing recognition processing on the text content based on a preset recognition rule and aiming at preset information, the method further includes: if the text content is determined to contain preset information, judging whether preset alarm conditions are met, and if so, carrying out alarm processing on the network traffic data. The method has the advantages that if the preset information is determined to be contained, the fact that the sensitive information is possibly leaked is indicated, alarm judgment is needed, alarm processing can be timely carried out on the condition that the alarm condition is met, desensitization alarm is carried out on the event, operators are informed of event response in real time, and leakage of the sensitive information is prevented. The alarm condition may be set according to an actual situation, for example, the alarm condition may be set in consideration of a type of the preset information or a number of the preset information. Optionally, the file content may be used as a unit, and the comprehensive judgment is performed in combination with the condition that all text contents in the file content include the preset information, so as to determine whether to trigger the alarm.
The information identification method provided by the embodiment of the invention comprises the steps of converting network flow data into text contents, determining whether the text contents meet the preset identification requirements according to the total length of the text corresponding to the text contents, the preset character number, the word segmentation number and the named entity number, and if so, identifying the text contents according to preset information based on preset identification rules. Through adopting above-mentioned technical scheme, before discerning the information of predetermineeing that contains in the network flow, judge whether text content satisfies the requirement of predetermineeing the discernment from a plurality of dimensions earlier, if satisfy, just can carry out identification process, can reduce unnecessary identification operation, improve the timeliness of predetermineeing information identification, and can effectively reduce the wrong report, improve the operation efficiency that information was prevented leaking.
In some embodiments, the predetermined characters include ASCII characters and uncommon chinese characters. Determining whether the text content meets a preset identification requirement according to the total text length, the preset character number, the word segmentation number and the named entity number corresponding to the text content, including: determining whether the text content meets a preset identification requirement according to at least one of the following items: the method comprises the following steps of obtaining a ratio of the number of ASCII characters to the total length of a text, a ratio of the number of rarely used Chinese characters to a first preset numerical value, a ratio of the number of participles to the first preset numerical value, and the number of named entities, wherein the first preset numerical value is a difference value of the total length of the text and the number of ASCII characters. The advantage of setting up like this is that can more rationally set up the requirement of predetermineeing discernment, accurately carry out data and fall the noise. Optionally, a corresponding threshold may be set for one or more of the above items, and whether the text content meets the preset identification requirement is further determined according to a comparison result between the value of each item and the corresponding threshold.
In some embodiments, the text content is determined to meet a preset identification requirement when at least one of: the ratio of the number of ASCII characters to the total length of the text is larger than a first preset threshold, and the number of named entities is larger than a second preset threshold; and secondly, the ratio of the number of the uncommon Chinese characters to the first preset value is smaller than a third preset threshold, and the ratio of the number of the participles to the first preset value is smaller than a fourth preset threshold. The advantage of setting up like this is that can more rationally set up the requirement of predetermineeing discernment, accurately carry out data and fall the noise. The first preset threshold, the second preset threshold, the third preset threshold and the fourth preset threshold may be set according to actual conditions, and specific numerical values are not limited.
For example, when the first item is satisfied, the ratio of the number of ASCII characters is large, and the number of named entities is large, so that the possibility that preset information exists in the text content is considered to be high; when the second item is met, the rarely-used Chinese characters occupy a small proportion, the word segmentation quantity occupies a small proportion, the possibility of including messy codes is considered to be small, and the possibility of the preset information existing in the text content is considered to be large. Since the two items evaluate the possibility of the existence of the preset information from different angles, alternative adoption or combined adoption can be set according to actual conditions. For example, when the first item or the second item is satisfied, the text content may be considered to satisfy a preset recognition requirement; for another example, when the first item and the second item are simultaneously satisfied, the text content may be considered to satisfy the preset identification requirement, that is, when the first item is not satisfied or the second item is not satisfied, the text content may be considered to not satisfy the preset identification requirement.
In some embodiments, the preset information includes a name; the identification processing aiming at the preset information is carried out on the text content based on the preset identification rule, and the identification processing comprises the following steps: judging whether the participles obtained after the word segmentation processing is carried out on the text content contain character entities or not by utilizing a named entity recognition technology, and if yes, determining that the text content contains preset information; the word segmentation processing is carried out by adopting an NLP technology. The advantage of this arrangement is that the user's name can be identified more accurately than in the related art that relies only on matching surname keywords.
For the identification of Chinese names in the related technology, a scheme of matching character strings at the beginning of surnames is generally adopted, since more than 5000 Chinese surnames exist and 6763 Chinese characters (GB2312) are frequently used, a large number of Chinese sentences are mixed with surnames, the scheme of considering that the sentences have the Chinese names is inaccurate only by matching surname keywords, and false reports are easy to generate. In the embodiment of the invention, the NLP technology can be firstly adopted to perform word segmentation on the text content, the NER processing is performed on each word segmentation, whether a PERSON (PERSON) entity is identified is directly judged, and if the PERSON (PERSON) entity is identified, the name contained in the text content can be accurately determined.
In some embodiments, the preset information includes a preset character string, wherein the preset character string includes at least one of an identification number, a bank card number, a telephone number and a mobile phone number; the identification processing aiming at the preset information is carried out on the text content based on the preset identification rule, and the identification processing comprises the following steps: identifying target content in the text content by using a regular expression corresponding to preset information, judging whether adjacent characters of the target content are numbers or letters, and if not, determining that the text content contains the preset information. This has the advantage that strings representing user sensitive information can be identified more accurately.
In the related art, when the preset character string is identified, only the regular expression is generally used for identification, however, many character strings containing numbers, such as a circumference ratio or a time stamp, may match with the set regular expression, and thus a large number of false alarms may occur. In the embodiment of the invention, on the basis of the regular expression, the judgment of adjacent characters is added, so that the identification accuracy can be greatly improved, and the false alarm rate is reduced. The regular expressions can be set for different types of preset character strings, the target content can be understood as content matched with the regular expressions, and the adjacent characters can include characters adjacent to the target content on the left and/or on the right.
In some embodiments, the converting the network traffic data into textual content includes: acquiring network flow data; determining a target protocol type according to a flow port number corresponding to the network flow data; analyzing the network flow data based on the target protocol type, and identifying a target file contained in the network flow data; performing content identification on the target file according to the file type corresponding to the target file, and converting the target file into file content in a preset text format according to a content identification result; and dividing the file content by adopting a preset dividing mode to obtain text content. The preset dividing manner may be, for example, dividing sentences in a preset number as a unit, where the preset number may be one or more, and may be freely set.
In some embodiments, the dividing the file content by using a preset dividing manner to obtain the text content includes: judging whether the ratio of the number of target type characters contained in the file content to the total number of characters of the file content exceeds a preset ratio value or not, if so, dividing the file content in a preset dividing mode to obtain text content, wherein the target type characters comprise ASCII characters and Chinese character internal code extension standard characters. The advantage of setting up like this lies in, can fall before the data is fallen the noise, carries out preliminary content and draws, filters fast and fall the file content that contains the very little possibility of preset information, further improves the timeliness of preset information discernment, and can effectively reduce the wrong report.
Fig. 2 is a schematic flow chart of another information identification method provided in an embodiment of the present invention, which is optimized based on the above optional embodiments, specifically, the method includes the following steps:
step 201, obtaining network traffic data, determining a target protocol type according to a traffic port number corresponding to the network traffic data, analyzing the network traffic data based on the target protocol type, and identifying a target file contained therein.
For example, a special traffic access module may be configured to receive externally imported traffic information to obtain network traffic data. For example, communication traffic may be captured as network traffic data from a high speed network egress or other network egress. For convenience of description, the embodiment of the present invention is described by taking a single traffic acquisition as an example, and in the specific implementation, a parallel processing manner may be adopted, for example, the communication traffic may be captured from a plurality of network outlets in parallel, and subsequent parallel identification processing may be performed. After the network traffic data is obtained, the corresponding network protocol may be identified first, and specifically, the protocol sorting may be performed on the network traffic data according to the traffic port, for example, a corresponding protocol tag may be determined according to a port number, and the protocol tag is used to reflect a corresponding protocol type. Common Protocol types may include, for example, Server Message Block (SMB) Protocol, File Transfer Protocol (FTP), Simple Mail Transfer Protocol (SMTP), and HyperText Transfer Protocol (HTTP).
For example, a protocol parsing module may be provided, where a plurality of protocol parsing sub-modules may be provided below the protocol parsing module, different protocol parsing sub-modules correspond to different protocol types, and after a target protocol type corresponding to current network traffic data is determined, the network traffic data may be parsed based on the protocol parsing sub-module corresponding to the target protocol type, and a target file and a corresponding target file type included in the network traffic data may be identified. For example, the recognition result may be a file type of pdf, xlsx, txt, docx, html, or binary, etc.
For example, when the subsequent steps are executed and then the step 201 is executed, it may be understood that the subsequent steps are continuously executed for the new network traffic data, and if the new network traffic data cannot be continuously acquired or the user actively closes the identification function, the process may be ended.
Step 202, performing content identification on the target file according to the file type corresponding to the target file, and converting the target file into file content in a preset text format according to a content identification result.
For example, the common encoding modes of the file may include UTF-8, ASCII, GBK, Unicode (uniform Code), GBK2312, etc., where GBK represents a Chinese Internal Code Specification (Chinese Internal Code Specification), and for convenience of identification, the GBK may be uniformly converted into a UTF-8 text format for output, that is, the predetermined text format may be UTF-8. Among them, for files with pdf, jpg, jpeg, png, or the like, an Optical Character Recognition (OCR) technique may be used for Recognition.
Step 203, judging whether the ratio of the number of the target type characters contained in the file content to the total number of the characters of the file content exceeds a preset proportional value, if so, executing step 204; otherwise, return to execute step 201.
Illustratively, the target type characters can be understood as readable characters, specifically ASCII characters and GBK characters, if there are a large number of readable characters, the following recognition can be continued, if there are few or no readable characters, the file can be preliminarily considered as non-readable or scrambled, and the probability of carrying plaintext sensitive information is very little, and the process can return to step 201.
And 204, dividing the file content by adopting a preset dividing mode to obtain a plurality of text contents.
Step 205, determining whether the text content meets a first preset identification requirement according to the total length of the text, the number of ASCII characters and the number of named entities corresponding to the text content, and if so, executing step 207; otherwise, step 206 is performed.
For example, the total length of text may be denoted as s (generally greater than 5), the number of ASCII characters may be denoted as s1, the number of uncommon chinese characters may be denoted as s2, the number of participles may be denoted as s3(0< s3< ═ s), and the number of named entities may be denoted as s 4.
The expression corresponding to the first preset identification requirement may be represented as s1/s > X & & s4>1, where X represents a probability value including readable information, and may be recorded as a first preset threshold, and a value range of the first preset threshold may be, for example, 0.5 to 0.8, in the above equation, a value of the second preset threshold is 1, and of course, other values may also be used, such as 2. When the expression is true, it can be considered that the string has a high possibility of having sensitive content, and subsequent recognition can be performed, that is, step 207 is performed. When the above expression is false, the determination of the second preset recognition requirement may be continued.
Step 206, determining whether the text content meets a second preset identification requirement or not according to the total length of the text, the number of rare characters and the number of word segmentation corresponding to the text content aiming at the current text content, and if so, executing step 207; otherwise, step 208 is performed.
The expression corresponding to the second preset identification requirement can be expressed as s2/(s-s1) < a & & s3/(s-s1) < B, where a and B are probability values of text containing scrambling codes and can be recorded as a third preset threshold and a fourth preset threshold, respectively. The values of a and B may be different according to different network traffic environments, and generally, a may be set to 0.2 and B may be set to 0.3. When the expression is false, the text segment may contain many uncommon words, the text readability is poor, the possibility of sensitive information such as names, mobile phone numbers, addresses, identity cards and the like is low, and the next text content may be continuously determined without subsequent identification, that is, step 208 is executed. When the expression is true, it can be considered that more readable plaintext is included, and further identification is required, i.e., step 207 is performed.
Further, it is also possible to set the expression s2/(s-s1) > < a | | s3/(s-s1) > < B, execute step 208 when the expression is true, and execute step 207 when the expression is false.
It should be noted that the order of the determination of the first preset identification requirement and the determination of the second preset identification requirement may be interchanged, for example, first determining whether the second preset identification requirement is met, if so, executing step 207, otherwise, continuously determining whether the first preset identification requirement is met, if so, executing step 207, otherwise, executing step 208.
Step 207, judging whether the participle obtained after the word segmentation processing is performed on the current text content comprises a character entity by using a named entity identification technology, identifying the target content in the text content by using a regular expression corresponding to the preset information, and judging whether adjacent characters of the target content are numbers or letters.
In this step, sensitive information identification can be performed on the current text content according to the name, the identification card number, the bank card number, the telephone number and the mobile phone number.
(1) For the identification of Chinese surnames, firstly, the sentence of the current text content is segmented according to NLP, then each segmentation is subjected to NER, and the identified PERSON entity is directly taken as a name.
(2) For the identification of the mobile phone number, the currently disclosed technical scheme is obtained according to a regular expression, such as: 1[0-9] {10}, which verifies 11-digit domestic mobile phone numbers beginning with 1, and has the obvious problem that any extra-large numbers such as an identity card, a circumference ratio, a timestamp and the like can be considered to contain the mobile phone numbers, thereby causing a large amount of false alarms. The inventor obtains that front and back characters adjacent to the mobile phone number do not need to be letters and numbers through regular research on sensitive information in a large amount of network traffic, and the accuracy of mobile phone number identification can be greatly improved by combining a mobile phone number regular expression according to the characteristics. The identification of the telephone number is similar to a mobile phone number.
(3) For identification of identity card numbers, the verification method can be similar to that of a mobile phone number, and can be matched by using the regular [1-9] \ d {5} (18|19| ([23] \ d)) \ d {2} ((0[1-9]) | (10|11|12)) (([0-2] [1-9]) |10|20|30|31) \ d {3} [0-9Xx ] $) | ([ 1-9] \ d {5} \ d {2} ((0[1-9]) | (10|11|12)) ([0-2] [1-9]) |10|20|30|31) \\\\\\ d {2}, and (3) verifying whether two characters adjacent to the left and right of the identification card number (the target content obtained by identification) are numbers or letters, if so, determining that the characters are not the identification card number, and if not, determining that the characters are the identification card number.
(4) For the identification of the bank card number, the identification scheme is inaccurate by using a single regular expression, the deposit card number and the credit card number of each bank are different and have respective characteristics, the credit card and the deposit card of the bank needing identification can be collected and made into a corresponding regular expression, and the regular expression is used for extraction. At present, about 1 ten thousand financial card types in China exist, a large enough bank card type regular identification library can be established for carrying out targeted identification, and finally whether two characters adjacent to the bank card number are numbers or letters is verified, so that the accurate bank card number can be selected.
Step 208, judging whether unidentified text content exists, if so, returning to execute step 205; otherwise, step 209 is performed.
Illustratively, if there is an unrecognized text content, the next text content is taken as the current text content, and step 205 is executed.
Step 209, determining whether a preset alarm condition is met according to the recognition results of all the text contents, if so, executing step 210; otherwise, return to execute step 201.
It should be noted that, if the recognized text content already contains more preset information before all the text contents are recognized, the preset alarm condition may also be determined in advance in order to improve timeliness of the alarm, which is not limited in the embodiment of the present invention. Optionally, after each pair of text contents is subjected to recognition processing for preset information, whether a preset alarm condition is met is determined according to a recognition result of the recognized text contents, and if yes, step 210 is executed; otherwise, the step 201 is executed again, that is, the remaining unrecognized text content does not need to be recognized continuously, and new network traffic data is continuously acquired for recognition, so that the overall recognition efficiency is improved.
And step 210, performing alarm processing on the network traffic data.
For example, an alarm processing mode may be set according to actual requirements, such as sending alarm information to related operators.
The information identification method provided by the embodiment of the invention is based on natural language treatment and named entity identification, combines technical characteristics such as a messy code identification technology, a coding identification technology, Chinese character name characteristics, telephone and mobile phone number characteristics, identity card number characteristics, bank card number and credit card number characteristics and the like, is applied to the field of network data sensitive content identification, can reduce unnecessary identification operation, improve the timeliness of preset information identification, effectively reduce false alarm and greatly improve the operation efficiency of network data leakage prevention of data security operators.
Fig. 3 is a block diagram of an information recognition apparatus according to an embodiment of the present invention, where the apparatus may be implemented by software and/or hardware, and may be generally integrated in a computer device such as a server, and may perform information recognition by executing an information recognition method. As shown in fig. 3, the apparatus includes:
a text conversion module 301, configured to convert the network traffic data into text content;
the recognition requirement judging module 302 is configured to determine whether the text content meets a preset recognition requirement according to a total text length, a preset number of characters, a number of participles, and a number of named entities corresponding to the text content;
and the identification processing module 303 is configured to perform identification processing on the text content according to a preset identification rule based on preset information when the determination result of the identification requirement determining module is satisfied.
The information recognition device provided by the embodiment of the invention converts the network flow data into the text content, then determines whether the text content meets the preset recognition requirement according to the total text length, the preset character number, the word segmentation number and the named entity number corresponding to the text content, and if so, performs recognition processing aiming at the preset information on the text content based on the preset recognition rule. Through adopting above-mentioned technical scheme, before discerning the information of predetermineeing that contains in the network flow, judge whether text content satisfies the requirement of predetermineeing the discernment from a plurality of dimensions earlier, if satisfy, just can carry out identification process, can reduce unnecessary identification operation, improve the timeliness of predetermineeing information identification, and can effectively reduce the wrong report, improve the operation efficiency that information was prevented leaking.
Optionally, the preset characters comprise American standard information interchange code ASCII characters and rarely-used Chinese characters;
the identification requirement determining module is specifically configured to:
determining whether the text content meets a preset identification requirement according to at least one of the following items:
the method comprises the following steps of obtaining a ratio of the number of ASCII characters to the total length of a text, a ratio of the number of rarely used Chinese characters to a first preset numerical value, a ratio of the number of participles to the first preset numerical value, and the number of named entities, wherein the first preset numerical value is a difference value of the total length of the text and the number of ASCII characters.
Optionally, the identification requirement determining module is specifically configured to: determining that the text content meets a preset identification requirement when at least one of the following is met:
the ratio of the number of ASCII characters to the total length of the text is greater than a first preset threshold, and the number of named entities is greater than a second preset threshold; and the ratio of the number of the uncommon Chinese characters to the first preset value is smaller than a third preset threshold value, and the ratio of the number of the participles to the first preset value is smaller than a fourth preset threshold value.
Optionally, the preset information includes a name; the identification processing aiming at the preset information is carried out on the text content based on the preset identification rule, and the identification processing comprises the following steps:
judging whether the participles obtained after the word segmentation processing is carried out on the text content contain character entities or not by utilizing a named entity recognition technology, and if yes, determining that the text content contains preset information; and the word segmentation processing is carried out by adopting a Natural Language Processing (NLP) technology.
Optionally, the preset information includes a preset character string, where the preset character string includes at least one of an identification number, a bank card number, a telephone number, and a mobile phone number; the identification processing aiming at the preset information is carried out on the text content based on the preset identification rule, and the identification processing comprises the following steps:
identifying target content in the text content by using a regular expression corresponding to preset information, judging whether adjacent characters of the target content are numbers or letters, and if not, determining that the text content contains the preset information.
Optionally, the text conversion module is specifically configured to:
acquiring network flow data;
determining a target protocol type according to a flow port number corresponding to the network flow data;
analyzing the network flow data based on the target protocol type, and identifying a target file contained in the network flow data;
performing content identification on the target file according to the file type corresponding to the target file, and converting the target file into file content in a preset text format according to a content identification result;
dividing the file content by adopting a preset dividing mode to obtain text content;
after the text content is subjected to recognition processing aiming at preset information based on a preset recognition rule, the method further comprises the following steps:
if the text content is determined to contain preset information, judging whether preset alarm conditions are met, and if so, carrying out alarm processing on the network traffic data.
Optionally, the dividing the file content by using a preset dividing manner to obtain the text content includes:
judging whether the ratio of the number of target type characters contained in the file content to the total number of characters of the file content exceeds a preset ratio value or not, if so, dividing the file content in a preset dividing mode to obtain text content, wherein the target type characters comprise ASCII characters and Chinese character internal code extension standard characters.
The embodiment of the invention provides computer equipment, wherein the information identification device provided by the embodiment of the invention can be integrated in the computer equipment. Fig. 4 is a block diagram of a computer device according to an embodiment of the present invention. The computer device 400 may include: a memory 401, a processor 402 and a computer program stored on the memory 401 and executable by the processor 402, wherein the processor 402 executes the computer program to implement the information identification method according to the embodiment of the present invention.
According to the computer equipment provided by the embodiment of the invention, before the preset information contained in the network flow is identified, whether the text content meets the preset identification requirement is judged from multiple dimensions, and if the text content meets the preset identification requirement, the identification processing is carried out, so that unnecessary identification operation can be reduced, the timeliness of the identification of the preset information is improved, the false alarm can be effectively reduced, and the operation efficiency of information leakage prevention is improved.
Embodiments of the present invention also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, perform a method for information identification, the method including:
converting the network flow data into text content;
determining whether the text content meets a preset identification requirement or not according to the total text length, the preset character number, the word segmentation number and the named entity number corresponding to the text content;
and if so, carrying out recognition processing aiming at preset information on the text content based on a preset recognition rule.
Storage medium-any of various types of memory devices or storage devices. The term "storage medium" is intended to include: mounting media such as CD-ROM, floppy disk, or tape devices; computer system memory or random access memory such as DRAM, DDRRAM, SRAM, EDORAM, Lanbas (Rambus) RAM, etc.; non-volatile memory such as flash memory, magnetic media (e.g., hard disk or optical storage); registers or other similar types of memory elements, etc. The storage medium may also include other types of memory or combinations thereof. In addition, the storage medium may be located in a first computer system in which the program is executed, or may be located in a different second computer system connected to the first computer system through a network (such as the internet). The second computer system may provide program instructions to the first computer for execution. The term "storage medium" may include two or more storage media that may reside in different locations, such as in different computer systems that are connected by a network. The storage medium may store program instructions (e.g., embodied as a computer program) that are executable by one or more processors.
Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the information identification operation described above, and may also perform related operations in the information identification method provided by any embodiment of the present invention.
The information identification device, the equipment and the storage medium provided in the above embodiments can execute the information identification method provided in any embodiment of the present invention, and have corresponding functional modules and beneficial effects for executing the method. Technical details that are not described in detail in the above embodiments may be referred to an information identification method provided in any embodiment of the present invention.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. An information identification method, comprising:
converting the network flow data into text content;
determining whether the text content meets a preset identification requirement or not according to the total text length, the preset character number, the word segmentation number and the named entity number corresponding to the text content;
and if so, carrying out recognition processing aiming at preset information on the text content based on a preset recognition rule.
2. The method of claim 1, wherein the predetermined characters include American Standard Code for Information Interchange (ASCII) characters and uncommon chinese characters;
determining whether the text content meets a preset identification requirement according to the total text length, the preset character number, the word segmentation number and the named entity number corresponding to the text content, including:
determining whether the text content meets a preset identification requirement according to at least one of the following items:
the method comprises the following steps of obtaining a ratio of the number of ASCII characters to the total length of a text, a ratio of the number of rarely used Chinese characters to a first preset numerical value, a ratio of the number of participles to the first preset numerical value, and the number of named entities, wherein the first preset numerical value is a difference value of the total length of the text and the number of ASCII characters.
3. The method of claim 2, wherein the text content is determined to meet a preset identification requirement when at least one of:
the ratio of the number of ASCII characters to the total length of the text is greater than a first preset threshold, and the number of named entities is greater than a second preset threshold; and the number of the first and second groups,
the ratio of the number of uncommon characters to the first preset value is smaller than a third preset threshold, and the ratio of the number of participles to the first preset value is smaller than a fourth preset threshold.
4. The method of claim 1, wherein the preset information includes a name; the identification processing aiming at the preset information is carried out on the text content based on the preset identification rule, and the identification processing comprises the following steps:
judging whether the participles obtained after the word segmentation processing is carried out on the text content comprise character entities or not by utilizing a Named Entity Recognition (NER) technology, and if yes, determining that the text content comprises preset information; and the word segmentation processing is carried out by adopting a Natural Language Processing (NLP) technology.
5. The method of claim 1, wherein the predetermined information comprises a predetermined string, wherein the predetermined string comprises at least one of an identification number, a bank card number, a telephone number, and a cell phone number; the identification processing aiming at the preset information is carried out on the text content based on the preset identification rule, and the identification processing comprises the following steps:
identifying target content in the text content by using a regular expression corresponding to preset information, judging whether adjacent characters of the target content are numbers or letters, and if not, determining that the text content contains the preset information.
6. The method of any of claims 1-5, wherein converting the network traffic data into textual content comprises:
acquiring network flow data;
determining a target protocol type according to a flow port number corresponding to the network flow data;
analyzing the network flow data based on the target protocol type, and identifying a target file contained in the network flow data;
performing content identification on the target file according to the file type corresponding to the target file, and converting the target file into file content in a preset text format according to a content identification result;
dividing the file content by adopting a preset dividing mode to obtain text content;
after the text content is subjected to recognition processing aiming at preset information based on a preset recognition rule, the method further comprises the following steps:
if the text content is determined to contain preset information, judging whether preset alarm conditions are met, and if so, carrying out alarm processing on the network traffic data.
7. The method according to claim 6, wherein the dividing the file content by a preset dividing manner to obtain text content comprises:
judging whether the ratio of the number of target type characters contained in the file content to the total number of characters of the file content exceeds a preset ratio value or not, if so, dividing the file content in a preset dividing mode to obtain text content, wherein the target type characters comprise ASCII characters and Chinese character internal code extension standard characters.
8. An information identifying apparatus, comprising:
the text conversion module is used for converting the network flow data into text contents;
the recognition requirement judging module is used for determining whether the text content meets preset recognition requirements or not according to the total text length, the preset character number, the word segmentation number and the named entity number corresponding to the text content;
and the identification processing module is used for carrying out identification processing on the text content aiming at preset information based on a preset identification rule when the judgment result of the identification requirement judging module is satisfied.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1-7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
CN202110761153.8A 2021-07-06 2021-07-06 Information identification method, device, equipment and storage medium Active CN113472686B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110761153.8A CN113472686B (en) 2021-07-06 2021-07-06 Information identification method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110761153.8A CN113472686B (en) 2021-07-06 2021-07-06 Information identification method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113472686A true CN113472686A (en) 2021-10-01
CN113472686B CN113472686B (en) 2024-03-08

Family

ID=77878400

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110761153.8A Active CN113472686B (en) 2021-07-06 2021-07-06 Information identification method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113472686B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115048907A (en) * 2022-05-31 2022-09-13 北京深言科技有限责任公司 Text data quality determination method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110209892A (en) * 2019-04-17 2019-09-06 深圳壹账通智能科技有限公司 Sensitive information recognition methods, device, electronic equipment and storage medium
CN110222170A (en) * 2019-04-25 2019-09-10 平安科技(深圳)有限公司 A kind of method, apparatus, storage medium and computer equipment identifying sensitive data
CN111539206A (en) * 2020-04-27 2020-08-14 中国银行股份有限公司 Method, device and equipment for determining sensitive information and storage medium
US20200336501A1 (en) * 2019-04-19 2020-10-22 Microsoft Technology Licensing, Llc Sensitive data detection in communication data
CN112434331A (en) * 2020-11-20 2021-03-02 百度在线网络技术(北京)有限公司 Data desensitization method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110209892A (en) * 2019-04-17 2019-09-06 深圳壹账通智能科技有限公司 Sensitive information recognition methods, device, electronic equipment and storage medium
US20200336501A1 (en) * 2019-04-19 2020-10-22 Microsoft Technology Licensing, Llc Sensitive data detection in communication data
CN110222170A (en) * 2019-04-25 2019-09-10 平安科技(深圳)有限公司 A kind of method, apparatus, storage medium and computer equipment identifying sensitive data
CN111539206A (en) * 2020-04-27 2020-08-14 中国银行股份有限公司 Method, device and equipment for determining sensitive information and storage medium
CN112434331A (en) * 2020-11-20 2021-03-02 百度在线网络技术(北京)有限公司 Data desensitization method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115048907A (en) * 2022-05-31 2022-09-13 北京深言科技有限责任公司 Text data quality determination method and device
CN115048907B (en) * 2022-05-31 2024-02-27 北京深言科技有限责任公司 Text data quality determining method and device

Also Published As

Publication number Publication date
CN113472686B (en) 2024-03-08

Similar Documents

Publication Publication Date Title
CN109582861B (en) Data privacy information detection system
US8762375B2 (en) Method for calculating entity similarities
CN111352907A (en) Method and device for analyzing pipeline file, computer equipment and storage medium
US20170289082A1 (en) Method and device for identifying spam mail
US11677783B2 (en) Analysis of potentially malicious emails
CN112163072B (en) Data processing method and device based on multiple data sources
US20120110003A1 (en) Conditional execution of regular expressions
CN112579931A (en) Network access analysis method and device, computer equipment and storage medium
US11934556B2 (en) Identifying sensitive content in electronic files
CN113472686B (en) Information identification method, device, equipment and storage medium
CN110972086A (en) Short message processing method and device, electronic equipment and computer readable storage medium
CN116055067B (en) Weak password detection method, device, electronic equipment and medium
US9584537B2 (en) System and method for detecting mobile cyber incident
CN109918638B (en) Network data monitoring method
CN112492606A (en) Classification and identification method and device for spam messages, computer equipment and storage medium
CN115982675A (en) Document processing method, device, electronic equipment and storage medium
US11936686B2 (en) System, device and method for detecting social engineering attacks in digital communications
US11681966B2 (en) Systems and methods for enhanced risk identification based on textual analysis
CN115294586A (en) Invoice identification method and device, storage medium and electronic equipment
CN114363839A (en) Fraud data early warning method, device, equipment and storage medium
CN112199948A (en) Text content identification and illegal advertisement identification method and device and electronic equipment
CN113645222A (en) Message flow detection method, system, device and computer readable storage medium
Shravasti et al. Smishing detection: Using artificial intelligence
CN117082021B (en) Mail intervention method, device, equipment and medium
KR101060122B1 (en) Method and device for processing spam message

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant