CN112613031A

CN112613031A - Data stream detection method and device

Info

Publication number: CN112613031A
Application number: CN202011350261.8A
Authority: CN
Inventors: 杜满
Original assignee: Hangzhou H3C Technologies Co Ltd
Current assignee: Hangzhou H3C Technologies Co Ltd; New H3C Technologies Co Ltd
Priority date: 2020-11-26
Filing date: 2020-11-26
Publication date: 2021-04-06

Abstract

The present specification provides a method and an apparatus for detecting a data stream, the method comprising: acquiring a collection sample of a data stream, processing the collection sample by using a Natural Language Processing (NLP), acquiring a preprocessed text, sending the preprocessed text to a data leakage prevention Digital Light Processing (DLP), acquiring a detection result output by the DLP, and judging whether the data stream is a divulged data stream according to the detection result. By the method, the situation that information is leaked due to the fact that the data contain the metaphorical words can be avoided.

Description

Data stream detection method and device

Technical Field

The present disclosure relates to the field of information security, and in particular, to a method and an apparatus for detecting a data stream.

Background

With the advent of the enterprise comprehensive information age, all business data are processed by a system to quickly form various documents such as drawings, codes, contracts, client data and the like required by the enterprise, and the documents become core assets of the enterprise. In the face of intense market competition, protection of various core data is of great importance. For the protection of these data, most enterprises adopt traditional security devices (such as firewall, IDS, IPS, etc.) to perform security protection, so as to prevent hackers, etc. from intruding and stealing the internal core data of the enterprise, and these security systems mainly aim at the protection of external access. In all the divulgence events, the divulgence from the inside is higher than the threat from the outside, causing more damage and cost.

At present, the existing scheme supports the internal leakage prevention of data and solves the problems, namely, the DLP data leakage prevention technical scheme widely used in the market at present has the core of content identification, and the content identification is mainly realized by means of keywords, regular expressions, database fingerprints and the like. Most of sensitive data can be basically recognized in the above modes, but the sensitive data are generally static and have fixed characteristic attributes, so that recognition is limited, intelligence and accuracy are insufficient, and for some metaphorical words in documents or words which can be clearly expressed only by combining context, the above modes cannot be accurately detected.

Disclosure of Invention

The embodiment of the specification provides a method and a device for detecting data streams, and by the method, the situation that information is leaked due to the fact that metaphorical words are contained in data can be avoided.

An embodiment of the present specification provides a method for detecting a data stream, where the method includes:

acquiring a collection sample of a data stream, processing the collection sample by using a Natural Language Processing (NLP), and acquiring a preprocessed text;

sending the preprocessed text to a data leakage-proof DLP, and obtaining a detection result output by the DLP;

and judging whether the data stream is a leaked data stream according to the detection result.

Optionally, the acquiring a sample of the data stream, processing the acquired sample by using natural language processing NLP, and obtaining a preprocessed text specifically includes:

acquiring a data stream sent by a client through the DLP to obtain a pre-acquired sample;

performing data restoration on the pre-collected sample to obtain a collected sample;

and the DLP calls the NLP so that the NLP processes the acquisition sample and obtains a preprocessed text.

Optionally, the DLP invoking NLP specifically includes: and the DLP calls the NLP through REST API.

Specifically, the preprocessing text includes: high frequency words.

Specifically, the sending the preprocessed text to a data leakage prevention DLP and obtaining a detection result of DLP output specifically includes:

and the DLP performs leakage-proof detection according to the high-frequency words in the preprocessed text, and detects the result.

Through the embodiment, the sample samples of the data stream can be deeply analyzed through the DLP and the NPL in the scheme, so that the condition that information leakage is caused by the fact that the data contain metaphorical words can be avoided.

An embodiment of the present specification further provides a device for detecting a data stream, where the device includes:

an acquisition module for acquiring acquisition samples of the data stream;

the processing module is used for processing the collected sample by using Natural Language Processing (NLP) and obtaining a preprocessed text;

the processing module is also used for sending the preprocessed text to the data leakage-proof DLP and obtaining a detection result output by the DLP;

and the judging module is used for judging whether the data stream is a divulged data stream according to the detection result.

Optionally, the obtaining module is specifically configured to collect a data stream sent by a client through the DLP, so as to obtain a pre-collection sample;

Specifically, the processing module is specifically configured to invoke the NLP by the DLP through a REST API.

Specifically, the preprocessing text includes: high frequency words.

Specifically, the determining module is specifically configured to perform leak-proof detection on the DLP according to the high-frequency words in the preprocessed text, and detect a result.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present specification and together with the description, serve to explain the principles of the specification.

Fig. 1 is a schematic model diagram of a DLP module according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a data stream detection method according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a model for text classification by NLP according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a model for performing emotion analysis by NLP according to an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the specification, as detailed in the appended claims.

The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present specification. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

As shown in fig. 1, a Data Leakage Prevention (DLP) module is disposed at the boundary of an enterprise network in a bypass manner, and is used for detecting a Data stream sent by a client. When the data processing method is applied, equipment such as a switch and a firewall are utilized to mirror a data stream sent by a client into a DLP module for data reduction and data analysis (DLP obtains the mirrored data stream as network data, and performs data reduction on the network data to reduce the network data into original data).

When the data is analyzed, content identification is mainly carried out based on the keywords, and when the analyzed text content contains the keywords, the identification is successful. However, the actual divulger may insert some separators in the keywords, thereby avoiding detection; for example: the keyword is "transaction amount", the text content is "3 hundred million of transaction amount completed this year, the text will be detected as sensitive data, if the divulger modifies the above text into" 3 hundred million of transaction% completed this year ", then the method cannot be detected.

To solve the above technical problem, an embodiment of the present specification provides a method for detecting a data stream, as shown in fig. 2, the method includes:

s201, acquiring a collection sample of a data stream, processing the collection sample by using a Natural Language Processing (NLP), and acquiring a preprocessed text;

s202, sending the preprocessed text to a data leakage prevention DLP, and obtaining a detection result of DLP output;

s203, judging whether the data stream is a leaked data stream according to the detection result.

In step S201, when the sample is processed by NLP, text classification, emotion analysis, and intention recognition may be performed on the sample.

Specifically, the text classification specifically includes: text preprocessing and feature extraction, and text classification is carried out according to the extracted features.

Obtaining a text classification result, and executing emotion analysis, wherein the emotion analysis mainly comprises the following steps: sentence input, automatic word segmentation, emotion dictionary and emotion classification.

Obtaining emotion classification, and performing intention identification, wherein the intention identification mainly comprises the following steps: and performing intention recognition based on a dictionary and a template rule, or performing intention recognition based on a classification model.

As shown in fig. 3, the main process of text classification includes:

a, when classifying the collected sample, preprocessing the collected sample through a training set, wherein the preprocessing content comprises: dictionary-based Chinese word segmentation, mid-level Chinese word segmentation and stop word removal.

And b, after the preprocessing is successful, extracting the characteristics of the preprocessed information, such as a word bag model represented by one-hot, TF-IDF text characteristic extraction or characteristic extraction of the preprocessed information based on word vector characteristics.

And c, carrying out classification expression on the preprocessed information through a KNN classification model, a fastText deep learning model or a TextCNN deep learning model.

The process of emotion analysis on the information classified by the text mainly comprises the following steps:

and a, acquiring the classified information, and automatically segmenting words.

And b, after word segmentation is successful, utilizing a pre-constructed emotion dictionary. And (5) loading and matching the vocabulary.

And c, after dictionary analysis processing, carrying out emotion classification according to a matching algorithm to distinguish positive emotion, negative emotion and neutral emotion.

Finally, according to the results obtained in the two steps, based on a rule template method, a statistical method, a machine learning method or a deep learning method, intention recognition is performed on the input text, wherein the information after intention recognition includes high-frequency words, such as: the input text is 'how I want to make Chinese chive steamed stuffed bun'. The intention is to make steamed stuffed buns instead of leeks. Then, continuously distinguishing the steamed stuffed buns with the Chinese chive stuffing according to the steamed stuffed buns stuffing; for another example, the input text is that "the percentage of the completed transaction of three completed transactions this year is 3 hundred million" and after being processed by NLP, the intentions of "transaction amount", "3 hundred million" and the like can be obtained.

Then, in step S202, the pre-processed text containing the intention is sent to the data leakage prevention DLP, and the DLP recognizes the intention content in the pre-processed text, so as to obtain a detection result of leakage prevention detection.

Whether the data flow is a divulged data flow can be determined according to the detection result through step S203.

In this embodiment, a DLP is usually deployed at an enterprise network boundary by-pass, and the method for NLP to acquire collected samples of a data stream may include:

the data stream sent by the client is collected through the DLP to obtain a pre-collected sample, and after the pre-collected sample is obtained, the DLP can restore the data of the pre-collected sample to obtain a collected sample.

DLP calls NLP methods through REST API and the like, and sends the collected samples to NLP so that the NLP analyzes the collected samples to obtain the preprocessed text.

Wherein, the preprocessing text comprises the following steps: high frequency words, which can be set by the administrator in the NLP model, will typically be related to the environment to be detected, for example, when in the financial field, the high frequency words may include: "transaction amount", "stock", "fund" and the like.

In step S202, the DLP may perform leakage prevention detection according to the high-frequency words in the processed text, and data the detection result.

For example, when "transaction amount" occurs, it may be considered as belonging to a high risk vocabulary, and a high risk detection result may be output.

According to the embodiment, the technical problem that leakage prevention detection cannot be performed on the data stream due to the fact that the DLP cannot solve the problem that the DLP contains metaphorical words such as separators and word chaos is solved by combining the NLP with the DLP.

Based on the foregoing method embodiment, this specification further provides an apparatus for detecting data flow, where the apparatus may be deployed in an anti-leakage detection server, and the apparatus includes:

an acquisition module for acquiring acquisition samples of the data stream;

The acquisition module is specifically configured to acquire a data stream sent by a client through the DLP to obtain a pre-acquisition sample;

The processing module is specifically configured to invoke the NLP by the DLP through a REST API.

The judgment module is specifically used for performing leakage-proof detection on the DLP according to the high-frequency words in the preprocessed text and detecting a result.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Other embodiments of the present description will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This specification is intended to cover any variations, uses, or adaptations of the specification following, in general, the principles of the specification and including such departures from the present disclosure as come within known or customary practice within the art to which the specification pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the specification being indicated by the following claims.

It will be understood that the present description is not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present description is limited only by the appended claims.

The above description is only a preferred embodiment of the present disclosure, and should not be taken as limiting the present disclosure, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A method for detecting a data stream, the method comprising:

2. The method according to claim 1, wherein the obtaining of the collected samples of the data stream, the processing of the collected samples with Natural Language Processing (NLP), and the obtaining of the preprocessed text, specifically comprises:

3. The method according to claim 2, wherein the DLP invoking NLP specifically comprises:

and the DLP calls the NLP through REST API.

4. The method of claim 1, wherein preprocessing the text comprises: high frequency words.

5. The method according to claim 4, wherein the sending the preprocessed text to a data leakage prevention DLP and obtaining a detection result of a DLP output, specifically comprises:

6. An apparatus for detecting a data stream, the apparatus comprising:

an acquisition module for acquiring acquisition samples of the data stream;

7. The apparatus of claim 6,

the acquisition module is specifically used for acquiring a data stream sent by a client through the DLP to obtain a pre-acquisition sample;

8. The apparatus of claim 7,

9. The apparatus of claim 6, wherein preprocessing the text comprises: high frequency words.

10. The apparatus of claim 9,

and the judgment module is specifically used for performing anti-leakage detection on the DLP according to the high-frequency words in the preprocessed text and detecting the result of the data.