CN112613031A - Data stream detection method and device - Google Patents

Data stream detection method and device Download PDF

Info

Publication number
CN112613031A
CN112613031A CN202011350261.8A CN202011350261A CN112613031A CN 112613031 A CN112613031 A CN 112613031A CN 202011350261 A CN202011350261 A CN 202011350261A CN 112613031 A CN112613031 A CN 112613031A
Authority
CN
China
Prior art keywords
dlp
data stream
nlp
data
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011350261.8A
Other languages
Chinese (zh)
Inventor
杜满
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou H3C Technologies Co Ltd
New H3C Technologies Co Ltd
Original Assignee
Hangzhou H3C Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou H3C Technologies Co Ltd filed Critical Hangzhou H3C Technologies Co Ltd
Priority to CN202011350261.8A priority Critical patent/CN112613031A/en
Publication of CN112613031A publication Critical patent/CN112613031A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/554Detecting local intrusion or implementing counter-measures involving event detection and direct action
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present specification provides a method and an apparatus for detecting a data stream, the method comprising: acquiring a collection sample of a data stream, processing the collection sample by using a Natural Language Processing (NLP), acquiring a preprocessed text, sending the preprocessed text to a data leakage prevention Digital Light Processing (DLP), acquiring a detection result output by the DLP, and judging whether the data stream is a divulged data stream according to the detection result. By the method, the situation that information is leaked due to the fact that the data contain the metaphorical words can be avoided.

Description

Data stream detection method and device
Technical Field
The present disclosure relates to the field of information security, and in particular, to a method and an apparatus for detecting a data stream.
Background
With the advent of the enterprise comprehensive information age, all business data are processed by a system to quickly form various documents such as drawings, codes, contracts, client data and the like required by the enterprise, and the documents become core assets of the enterprise. In the face of intense market competition, protection of various core data is of great importance. For the protection of these data, most enterprises adopt traditional security devices (such as firewall, IDS, IPS, etc.) to perform security protection, so as to prevent hackers, etc. from intruding and stealing the internal core data of the enterprise, and these security systems mainly aim at the protection of external access. In all the divulgence events, the divulgence from the inside is higher than the threat from the outside, causing more damage and cost.
At present, the existing scheme supports the internal leakage prevention of data and solves the problems, namely, the DLP data leakage prevention technical scheme widely used in the market at present has the core of content identification, and the content identification is mainly realized by means of keywords, regular expressions, database fingerprints and the like. Most of sensitive data can be basically recognized in the above modes, but the sensitive data are generally static and have fixed characteristic attributes, so that recognition is limited, intelligence and accuracy are insufficient, and for some metaphorical words in documents or words which can be clearly expressed only by combining context, the above modes cannot be accurately detected.
Disclosure of Invention
The embodiment of the specification provides a method and a device for detecting data streams, and by the method, the situation that information is leaked due to the fact that metaphorical words are contained in data can be avoided.
An embodiment of the present specification provides a method for detecting a data stream, where the method includes:
acquiring a collection sample of a data stream, processing the collection sample by using a Natural Language Processing (NLP), and acquiring a preprocessed text;
sending the preprocessed text to a data leakage-proof DLP, and obtaining a detection result output by the DLP;
and judging whether the data stream is a leaked data stream according to the detection result.
Optionally, the acquiring a sample of the data stream, processing the acquired sample by using natural language processing NLP, and obtaining a preprocessed text specifically includes:
acquiring a data stream sent by a client through the DLP to obtain a pre-acquired sample;
performing data restoration on the pre-collected sample to obtain a collected sample;
and the DLP calls the NLP so that the NLP processes the acquisition sample and obtains a preprocessed text.
Optionally, the DLP invoking NLP specifically includes: and the DLP calls the NLP through REST API.
Specifically, the preprocessing text includes: high frequency words.
Specifically, the sending the preprocessed text to a data leakage prevention DLP and obtaining a detection result of DLP output specifically includes:
and the DLP performs leakage-proof detection according to the high-frequency words in the preprocessed text, and detects the result.
Through the embodiment, the sample samples of the data stream can be deeply analyzed through the DLP and the NPL in the scheme, so that the condition that information leakage is caused by the fact that the data contain metaphorical words can be avoided.
An embodiment of the present specification further provides a device for detecting a data stream, where the device includes:
an acquisition module for acquiring acquisition samples of the data stream;
the processing module is used for processing the collected sample by using Natural Language Processing (NLP) and obtaining a preprocessed text;
the processing module is also used for sending the preprocessed text to the data leakage-proof DLP and obtaining a detection result output by the DLP;
and the judging module is used for judging whether the data stream is a divulged data stream according to the detection result.
Optionally, the obtaining module is specifically configured to collect a data stream sent by a client through the DLP, so as to obtain a pre-collection sample;
performing data restoration on the pre-collected sample to obtain a collected sample;
and the DLP calls the NLP so that the NLP processes the acquisition sample and obtains a preprocessed text.
Specifically, the processing module is specifically configured to invoke the NLP by the DLP through a REST API.
Specifically, the preprocessing text includes: high frequency words.
Specifically, the determining module is specifically configured to perform leak-proof detection on the DLP according to the high-frequency words in the preprocessed text, and detect a result.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present specification and together with the description, serve to explain the principles of the specification.
Fig. 1 is a schematic model diagram of a DLP module according to an embodiment of the present disclosure;
fig. 2 is a schematic flowchart of a data stream detection method according to an embodiment of the present disclosure;
fig. 3 is a schematic diagram of a model for text classification by NLP according to an embodiment of the present disclosure;
fig. 4 is a schematic diagram of a model for performing emotion analysis by NLP according to an embodiment of the present disclosure.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the specification, as detailed in the appended claims.
The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present specification. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
As shown in fig. 1, a Data Leakage Prevention (DLP) module is disposed at the boundary of an enterprise network in a bypass manner, and is used for detecting a Data stream sent by a client. When the data processing method is applied, equipment such as a switch and a firewall are utilized to mirror a data stream sent by a client into a DLP module for data reduction and data analysis (DLP obtains the mirrored data stream as network data, and performs data reduction on the network data to reduce the network data into original data).
When the data is analyzed, content identification is mainly carried out based on the keywords, and when the analyzed text content contains the keywords, the identification is successful. However, the actual divulger may insert some separators in the keywords, thereby avoiding detection; for example: the keyword is "transaction amount", the text content is "3 hundred million of transaction amount completed this year, the text will be detected as sensitive data, if the divulger modifies the above text into" 3 hundred million of transaction% completed this year ", then the method cannot be detected.
To solve the above technical problem, an embodiment of the present specification provides a method for detecting a data stream, as shown in fig. 2, the method includes:
s201, acquiring a collection sample of a data stream, processing the collection sample by using a Natural Language Processing (NLP), and acquiring a preprocessed text;
s202, sending the preprocessed text to a data leakage prevention DLP, and obtaining a detection result of DLP output;
s203, judging whether the data stream is a leaked data stream according to the detection result.
In step S201, when the sample is processed by NLP, text classification, emotion analysis, and intention recognition may be performed on the sample.
Specifically, the text classification specifically includes: text preprocessing and feature extraction, and text classification is carried out according to the extracted features.
Obtaining a text classification result, and executing emotion analysis, wherein the emotion analysis mainly comprises the following steps: sentence input, automatic word segmentation, emotion dictionary and emotion classification.
Obtaining emotion classification, and performing intention identification, wherein the intention identification mainly comprises the following steps: and performing intention recognition based on a dictionary and a template rule, or performing intention recognition based on a classification model.
As shown in fig. 3, the main process of text classification includes:
a, when classifying the collected sample, preprocessing the collected sample through a training set, wherein the preprocessing content comprises: dictionary-based Chinese word segmentation, mid-level Chinese word segmentation and stop word removal.
And b, after the preprocessing is successful, extracting the characteristics of the preprocessed information, such as a word bag model represented by one-hot, TF-IDF text characteristic extraction or characteristic extraction of the preprocessed information based on word vector characteristics.
And c, carrying out classification expression on the preprocessed information through a KNN classification model, a fastText deep learning model or a TextCNN deep learning model.
The process of emotion analysis on the information classified by the text mainly comprises the following steps:
and a, acquiring the classified information, and automatically segmenting words.
And b, after word segmentation is successful, utilizing a pre-constructed emotion dictionary. And (5) loading and matching the vocabulary.
And c, after dictionary analysis processing, carrying out emotion classification according to a matching algorithm to distinguish positive emotion, negative emotion and neutral emotion.
Finally, according to the results obtained in the two steps, based on a rule template method, a statistical method, a machine learning method or a deep learning method, intention recognition is performed on the input text, wherein the information after intention recognition includes high-frequency words, such as: the input text is 'how I want to make Chinese chive steamed stuffed bun'. The intention is to make steamed stuffed buns instead of leeks. Then, continuously distinguishing the steamed stuffed buns with the Chinese chive stuffing according to the steamed stuffed buns stuffing; for another example, the input text is that "the percentage of the completed transaction of three completed transactions this year is 3 hundred million" and after being processed by NLP, the intentions of "transaction amount", "3 hundred million" and the like can be obtained.
Then, in step S202, the pre-processed text containing the intention is sent to the data leakage prevention DLP, and the DLP recognizes the intention content in the pre-processed text, so as to obtain a detection result of leakage prevention detection.
Whether the data flow is a divulged data flow can be determined according to the detection result through step S203.
In this embodiment, a DLP is usually deployed at an enterprise network boundary by-pass, and the method for NLP to acquire collected samples of a data stream may include:
the data stream sent by the client is collected through the DLP to obtain a pre-collected sample, and after the pre-collected sample is obtained, the DLP can restore the data of the pre-collected sample to obtain a collected sample.
DLP calls NLP methods through REST API and the like, and sends the collected samples to NLP so that the NLP analyzes the collected samples to obtain the preprocessed text.
Wherein, the preprocessing text comprises the following steps: high frequency words, which can be set by the administrator in the NLP model, will typically be related to the environment to be detected, for example, when in the financial field, the high frequency words may include: "transaction amount", "stock", "fund" and the like.
In step S202, the DLP may perform leakage prevention detection according to the high-frequency words in the processed text, and data the detection result.
For example, when "transaction amount" occurs, it may be considered as belonging to a high risk vocabulary, and a high risk detection result may be output.
According to the embodiment, the technical problem that leakage prevention detection cannot be performed on the data stream due to the fact that the DLP cannot solve the problem that the DLP contains metaphorical words such as separators and word chaos is solved by combining the NLP with the DLP.
Based on the foregoing method embodiment, this specification further provides an apparatus for detecting data flow, where the apparatus may be deployed in an anti-leakage detection server, and the apparatus includes:
an acquisition module for acquiring acquisition samples of the data stream;
the processing module is used for processing the collected sample by using Natural Language Processing (NLP) and obtaining a preprocessed text;
the processing module is also used for sending the preprocessed text to the data leakage-proof DLP and obtaining a detection result output by the DLP;
and the judging module is used for judging whether the data stream is a divulged data stream according to the detection result.
The acquisition module is specifically configured to acquire a data stream sent by a client through the DLP to obtain a pre-acquisition sample;
performing data restoration on the pre-collected sample to obtain a collected sample;
and the DLP calls the NLP so that the NLP processes the acquisition sample and obtains a preprocessed text.
The processing module is specifically configured to invoke the NLP by the DLP through a REST API.
The judgment module is specifically used for performing leakage-proof detection on the DLP according to the high-frequency words in the preprocessed text and detecting a result.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Other embodiments of the present description will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This specification is intended to cover any variations, uses, or adaptations of the specification following, in general, the principles of the specification and including such departures from the present disclosure as come within known or customary practice within the art to which the specification pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the specification being indicated by the following claims.
It will be understood that the present description is not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present description is limited only by the appended claims.
The above description is only a preferred embodiment of the present disclosure, and should not be taken as limiting the present disclosure, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims (10)

1. A method for detecting a data stream, the method comprising:
acquiring a collection sample of a data stream, processing the collection sample by using a Natural Language Processing (NLP), and acquiring a preprocessed text;
sending the preprocessed text to a data leakage-proof DLP, and obtaining a detection result output by the DLP;
and judging whether the data stream is a leaked data stream according to the detection result.
2. The method according to claim 1, wherein the obtaining of the collected samples of the data stream, the processing of the collected samples with Natural Language Processing (NLP), and the obtaining of the preprocessed text, specifically comprises:
acquiring a data stream sent by a client through the DLP to obtain a pre-acquired sample;
performing data restoration on the pre-collected sample to obtain a collected sample;
and the DLP calls the NLP so that the NLP processes the acquisition sample and obtains a preprocessed text.
3. The method according to claim 2, wherein the DLP invoking NLP specifically comprises:
and the DLP calls the NLP through REST API.
4. The method of claim 1, wherein preprocessing the text comprises: high frequency words.
5. The method according to claim 4, wherein the sending the preprocessed text to a data leakage prevention DLP and obtaining a detection result of a DLP output, specifically comprises:
and the DLP performs leakage-proof detection according to the high-frequency words in the preprocessed text, and detects the result.
6. An apparatus for detecting a data stream, the apparatus comprising:
an acquisition module for acquiring acquisition samples of the data stream;
the processing module is used for processing the collected sample by using Natural Language Processing (NLP) and obtaining a preprocessed text;
the processing module is also used for sending the preprocessed text to the data leakage-proof DLP and obtaining a detection result output by the DLP;
and the judging module is used for judging whether the data stream is a divulged data stream according to the detection result.
7. The apparatus of claim 6,
the acquisition module is specifically used for acquiring a data stream sent by a client through the DLP to obtain a pre-acquisition sample;
performing data restoration on the pre-collected sample to obtain a collected sample;
and the DLP calls the NLP so that the NLP processes the acquisition sample and obtains a preprocessed text.
8. The apparatus of claim 7,
the processing module is specifically configured to invoke the NLP by the DLP through a REST API.
9. The apparatus of claim 6, wherein preprocessing the text comprises: high frequency words.
10. The apparatus of claim 9,
and the judgment module is specifically used for performing anti-leakage detection on the DLP according to the high-frequency words in the preprocessed text and detecting the result of the data.
CN202011350261.8A 2020-11-26 2020-11-26 Data stream detection method and device Pending CN112613031A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011350261.8A CN112613031A (en) 2020-11-26 2020-11-26 Data stream detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011350261.8A CN112613031A (en) 2020-11-26 2020-11-26 Data stream detection method and device

Publications (1)

Publication Number Publication Date
CN112613031A true CN112613031A (en) 2021-04-06

Family

ID=75225373

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011350261.8A Pending CN112613031A (en) 2020-11-26 2020-11-26 Data stream detection method and device

Country Status (1)

Country Link
CN (1) CN112613031A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104506545A (en) * 2014-12-30 2015-04-08 北京奇虎科技有限公司 Data leakage prevention method and data leakage prevention device
CN110990836A (en) * 2019-12-18 2020-04-10 南京富士通南大软件技术有限公司 Code leakage detection system and method based on natural language processing technology
CN111079184A (en) * 2019-12-19 2020-04-28 北京明朝万达科技股份有限公司 Method, system, device and storage medium for protecting data leakage

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104506545A (en) * 2014-12-30 2015-04-08 北京奇虎科技有限公司 Data leakage prevention method and data leakage prevention device
CN110990836A (en) * 2019-12-18 2020-04-10 南京富士通南大软件技术有限公司 Code leakage detection system and method based on natural language processing technology
CN111079184A (en) * 2019-12-19 2020-04-28 北京明朝万达科技股份有限公司 Method, system, device and storage medium for protecting data leakage

Similar Documents

Publication Publication Date Title
CN107515877B (en) Sensitive subject word set generation method and device
CN105426356B (en) A kind of target information recognition methods and device
Mishra et al. FACTIFY: A Multi-Modal Fact Verification Dataset.
US20220004878A1 (en) Systems and methods for synthetic document and data generation
US20120136812A1 (en) Method and system for machine-learning based optimization and customization of document similarities calculation
CN109902223B (en) Bad content filtering method based on multi-mode information characteristics
CN110019792A (en) File classification method and device and sorter model training method
WO2017173093A1 (en) Method and device for identifying spam mail
CN109213862B (en) Object recognition method and device, and computer-readable storage medium
CN115544240B (en) Text sensitive information identification method and device, electronic equipment and storage medium
CN110046648B (en) Method and device for classifying business based on at least one business classification model
CN112487422B (en) Malicious document detection method and device, electronic equipment and storage medium
WO2021121279A1 (en) Text document categorization using rules and document fingerprints
CN111866004A (en) Security assessment method, apparatus, computer system, and medium
Kopeykina et al. Automatic privacy detection in scanned document images based on deep neural networks
US11934556B2 (en) Identifying sensitive content in electronic files
CN111310467A (en) Topic extraction method and system combining semantic inference in long text
CN108462624A (en) A kind of recognition methods of spam, device and electronic equipment
CN109858017B (en) Data processing method and electronic equipment
CN110955796A (en) Case characteristic information extraction method and device based on record information
CN112613031A (en) Data stream detection method and device
CN114417881B (en) Sensitive word detection method and device, electronic equipment and storage medium
CN109992666A (en) Method, apparatus and non-transitory machine readable media for processing feature library
CN112035670B (en) Multi-modal rumor detection method based on image emotional tendency
CN115034292A (en) Multi-mode-based internal threat detection method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210406