CN110765329A - Data clustering method and electronic equipment - Google Patents

Data clustering method and electronic equipment Download PDF

Info

Publication number
CN110765329A
CN110765329A CN201911030402.5A CN201911030402A CN110765329A CN 110765329 A CN110765329 A CN 110765329A CN 201911030402 A CN201911030402 A CN 201911030402A CN 110765329 A CN110765329 A CN 110765329A
Authority
CN
China
Prior art keywords
data
clustering
generate
attribute
sets
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911030402.5A
Other languages
Chinese (zh)
Other versions
CN110765329B (en
Inventor
张首斌
薛智慧
潘季明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Original Assignee
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Topsec Technology Co Ltd, Beijing Topsec Network Security Technology Co Ltd, Beijing Topsec Software Co Ltd filed Critical Beijing Topsec Technology Co Ltd
Priority to CN201911030402.5A priority Critical patent/CN110765329B/en
Publication of CN110765329A publication Critical patent/CN110765329A/en
Application granted granted Critical
Publication of CN110765329B publication Critical patent/CN110765329B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data clustering method and electronic equipment, wherein the method comprises the following steps: acquiring target data, and carrying out classification processing on the target data to generate a data group containing multiple types of data; performing a first clustering operation on the data cluster based on the clustering characteristics to generate a plurality of data sets with different clustering characteristics; and performing a second clustering operation on the data set based on the attribute characteristics of the data set to generate a plurality of data subsets with different attribute characteristics. The method provided by the application can comprehensively and deeply acquire the target data, rapidly perform preliminary division on the data group, further perform fine division on the data set based on the purpose of finely dividing the data, accurately divide and generate the data subset, and intuitively reflect the distribution and the condition of the data.

Description

Data clustering method and electronic equipment
Technical Field
The application relates to the technical field of internet big data and security, in particular to a data clustering method and electronic equipment.
Background
With the popularization of internet information technology, intelligent manufacturing and internet of things technology, the construction of a network infrastructure platform and a network infrastructure system, the brands and the types of internet equipment become increasingly complicated, internet applications and services are gradually popularized, related industries and user planes are also increasingly wide, one product or service can be specific to different user groups, and one user can also use various products and services. Under the internet big data environment, data are quickly found and accurately clustered and divided, explicit and implicit relations among users, enterprises, communities, products, equipment and application services existing in a network are explored, the enterprises, operators and government decision-making organizations can be helped to better know the current network environment, and the method has a certain positive effect on perfecting and optimizing the network operation environment and protecting the network safety.
In the first prior art, technologies related to data acquisition and division mainly adopt an active detection technology, and can complete discovery of a specified type and a brand of target area networking. In the prior art, a crawler technology is also adopted to capture webpage text and link information, and the text information of each user is classified and counted, so that the interest preference and the focus of each user can be clearly analyzed, and the network community user groups are divided according to the same interest. The above prior art has the following drawbacks: the data source only adopts active detection or passive detection, and the source is single; the equipment detection and service monitoring only aim at specific application scenes, and the network service coverage rate is low; in addition, the data acquisition and division method in the prior art has a single use method, cannot rapidly divide multiple types of data, and the division result is not fine enough.
Disclosure of Invention
In order to solve the above technical problem, an embodiment of the present application provides a data clustering method, which is applied to an electronic device, and the method includes:
a method of clustering data, the method comprising:
acquiring target data, and carrying out classification processing on the target data to generate a data group containing multiple types of data;
performing a first clustering operation on the data cluster based on the clustering characteristics to generate a plurality of data sets with different clustering characteristics;
and performing a second clustering operation on the data set based on the attribute characteristics of the data set to generate a plurality of data subsets with different attribute characteristics.
Preferably, the performing a first clustering operation on the data group based on the clustering characteristics to generate a plurality of data sets with different clustering characteristics, performing the first clustering operation on the data group containing multiple types of data based on the clustering characteristics to generate a plurality of data sets with different clustering characteristics includes:
preprocessing the data group to enable the data group containing multiple types of data to have required standard attributes, wherein the preprocessing comprises the following steps;
and reading data units with the same category or similar categories in the data group containing the multi-category data, and performing aggregation operation on the data units with the same category or similar categories according to the density of each data unit to generate a plurality of data sets with different clustering characteristics.
Preferably, the performing a second clustering operation on the data set based on the attribute characteristics of the data set to generate a plurality of data subsets with different attribute characteristics includes:
clustering and dividing the data sets by using a second-order clustering algorithm to generate a plurality of data sets to be divided and outlier data, and further dividing the data sets to be divided based on the attribute characteristics of the data sets to be divided to generate a plurality of required data subsets;
wherein the outlier data has an attribute characteristic different from an attribute characteristic of the plurality of datasets to be partitioned.
Preferably, the method further comprises:
performing source tracing analysis on the outlier data;
updating the attribute characteristics of the outlier data based on the source tracing analysis result of the outlier data;
updating the data set based on the updated attribute characteristics of the outlier data so as to classify the updated data set.
Preferably, the target data includes network data, and the acquiring the target data includes:
network data acquisition is carried out by utilizing an active detection and/or passive detection mode;
wherein, the active detection mode comprises: and carrying out port detection on a network space corresponding to the network data, and capturing the network data according to a detection result.
Preferably, the classifying the target data to generate a data group including multiple types of data includes:
and carrying out flow message classification and/or extended data processing on the target data, and integrating the target data into a data group containing multiple types of data based on the class attribute of the target data.
Preferably, the method further comprises:
and performing visualization processing on the generated data subset.
The present application further provides an electronic device, comprising:
the classification module is used for acquiring target data, classifying the target data and generating a data group containing multiple types of data;
the preprocessing module is used for carrying out first clustering operation on the data group based on the clustering characteristics to generate a plurality of data sets with different clustering characteristics;
and the clustering module is used for performing a second clustering operation on the data set based on the attribute characteristics of the data set to generate a plurality of data subsets with different attribute characteristics.
Preferably, the preprocessing module is further configured to: preprocessing the data group to enable the data group containing multiple types of data to have required standard attributes, wherein the preprocessing comprises the following steps: and reading data units with the same category or similar categories in the data group containing the multi-category data, and performing aggregation operation on the data units with the same category or similar categories according to the density of each data unit to generate a plurality of data sets with different clustering characteristics.
Preferably, the clustering module is further configured to: clustering and dividing the data sets by using a second-order clustering algorithm to generate a plurality of data sets to be divided and outlier data, and further dividing the data sets to be divided based on the attribute characteristics of the data sets to be divided to generate a plurality of required data subsets; wherein the outlier data has an attribute characteristic different from an attribute characteristic of the plurality of datasets to be partitioned.
Compared with the prior art, the beneficial effects of the embodiment of the application lie in that:
the method can comprehensively and deeply acquire target data, utilizes different modes, preliminarily classifies the target data, quickly preliminarily divides a data group, generates a plurality of large-class data sets, further finely divides the data sets based on the purpose of finely dividing the data, accurately divides the data sets, and generates data subsets which are the final division results, more accurately and intuitively reflect the distribution and the condition of the data so as to be comprehensively evaluated and analyzed by enterprises, network operation and supervision organizations. Complex network data can be effectively analyzed and processed by utilizing a second-order clustering algorithm, the group division can be carried out on the network data in real time for the data attribute of a mixed type, and a new network data type can be found according to the outlier data; the depth breakthrough and the transverse expansion of data mining are realized, the data division dimension is wider, and the granularity is finer; the service type and the door type have strong expansion capability, the group discovery is not limited to a certain equipment and product any more, and the development and maintenance are convenient.
Drawings
FIG. 1 is a schematic flow chart illustrating a data clustering method according to an embodiment of the present application;
FIG. 2 is another schematic flow chart of a data clustering method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an electronic device in an embodiment of the application;
FIG. 4 is a schematic diagram of a specific framework of a data clustering method in an embodiment of the present application;
FIG. 5 is a schematic diagram of a data clustering method according to an embodiment of the present application, in which data is obtained by active probing;
fig. 6 is a schematic diagram illustrating a classification process performed on target data by a data clustering method in an embodiment of the present application;
fig. 7 is a schematic diagram of a clustering process for data in the embodiment of the present application.
Detailed Description
Specific embodiments of the present application will be described in detail below with reference to the accompanying drawings, but the present application is not limited thereto.
It will be understood that various modifications may be made to the embodiments disclosed herein. The following description is, therefore, not to be taken in a limiting sense, but is made merely as an exemplification of embodiments. Those skilled in the art will envision other modifications within the scope and spirit of the application.
The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the application and, together with a general description of the application given above and the detailed description of the embodiments given below, serve to explain the principles of the application.
These and other characteristics of the present application will become apparent from the following description of preferred forms of embodiment, given as non-limiting examples, with reference to the attached drawings.
It should also be understood that, although the present application has been described with reference to some specific examples, a person of skill in the art shall certainly be able to achieve many other equivalent forms of application, having the characteristics as set forth in the claims and hence all coming within the field of protection defined thereby.
The above and other aspects, features and advantages of the present application will become more apparent in view of the following detailed description when taken in conjunction with the accompanying drawings.
Specific embodiments of the present application are described hereinafter with reference to the accompanying drawings; however, it is to be understood that the disclosed embodiments are merely examples of the application, which can be embodied in various forms. Well-known and/or repeated functions and constructions are not described in detail to avoid obscuring the application of unnecessary or unnecessary detail. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a basis for the claims and as a representative basis for teaching one skilled in the art to variously employ the present application in virtually any appropriately detailed structure.
Hereinafter, embodiments of the present application will be described in detail with reference to the accompanying drawings.
As shown in fig. 1, an embodiment of the present application provides a data clustering method, including:
s1: acquiring target data, and carrying out classification processing on the target data to generate a data group containing multiple types of data;
specifically, in the present embodiment, first, target data is acquired in different manners, and the target data may be network data, local data, historical input data, and the like. The data acquisition mode comprises methods such as a web crawler technology, an internet deep analysis technology and the like, and the target network data is comprehensively and deeply acquired and mined. And carrying out preliminary classification on the obtained various target data so as to be beneficial to subsequent clustering operation. The classification processing may specifically include classifying according to the type of the target data, for example, performing packet classification processing on the acquired internet service data. After the obtained multiple target data are classified, a data group containing multiple types of data is generated, the data group is a set of mixed data containing multiple types of data of different types, and the generated data group contains data of different types. For example, when the internet service data is the enterprise service data, the enterprise service data is classified to generate a data group including data related to the enterprise service information.
S2: performing a first clustering operation on the data cluster based on the clustering characteristics to generate a plurality of data sets with different clustering characteristics;
specifically, in this embodiment, the first clustering operation is performed on the data group based on the clustering characteristics, and the generated data group may be clustered and divided. The clustering features comprise data structures, contents, objects and the like, the generated data group has different clustering features, such as WEB data application classification, server type, industry classification of the data and the like, first-class clustering operation is performed on the data group according to data containing different clustering features in the data group, the first-class clustering operation is used for preprocessing the generated data group so as to perform preliminary division on the data group, the data group is rapidly divided through the first-class clustering operation, and a plurality of large-class data sets are generated, wherein the first-class clustering operation is performed based on the data with different clustering features in the data group. For example, when the acquired data group is a data group of data related to enterprise service information, after the first type of clustering operation is performed, the data group of the data related to the enterprise service information may be divided into an equipment type data set, a service type data set, a region data set, a WEB portal data set, an enterprise organization data set, and the like.
S3: and performing a second clustering operation on the data set based on the attribute characteristics of the data set to generate a plurality of data subsets with different attribute characteristics.
Specifically, in the present embodiment, in order to further divide the generated data set, the data set is subjected to a second clustering operation based on attribute characteristics of the data set, the attribute characteristics including attribution, specific type, refined category, and the like of the data. The generated data set is further clustered and divided by a second clustering operation, wherein the second clustering operation is performed based on data with different attribute characteristics in the data set, and the second clustering operation is more detailed compared with the first clustering operation. For example, when the generated data set is a device type data set, a service type data set, a region data set, a WEB portal data set, or an enterprise organization data set, after performing the second clustering operation, the plurality of data sets may generate a plurality of data subsets, which are sets of data having different attribute characteristics, according to the attribute characteristics, and the data included in the data subsets have attribute categories further refined with respect to the data sets. For example, a vendor data subset, a model data subset, an industrial control data subset, a model data subset, a financial data subset. And so on. The generated plurality of data subsets may have different attribute characteristics. The data subset has the characteristic of being very accurate and visually reflects the distribution and the condition of the data.
The data clustering method can comprehensively and deeply acquire target data, utilizes different modes to preliminarily classify the target data, quickly preliminarily divides a data group to generate a plurality of large-class data sets, further finely divides the data sets based on the purpose of finely dividing the data, accurately divides the data sets to generate data subsets, and the data subsets are the final division results, so that the distribution and the condition of the data are more accurately and visually reflected, and the data subsets can be comprehensively evaluated and analyzed by enterprises, network operation and supervision organizations.
In an embodiment of the present application, as shown in fig. 1, 4 and 7, the method for clustering data further includes the following steps:
preprocessing a data group to make the data group containing multiple types of data have required standard attributes, wherein the preprocessing comprises:
and reading data units with the same category or similar categories in a data cluster containing multiple categories of data, and performing aggregation operation on the data units with the same category or similar categories according to the density of each data unit to generate multiple data sets with different clustering characteristics.
Specifically, in this embodiment, the first clustering operation is performed on the data group based on the clustering characteristics, and the generated data group may be clustered and divided. According to data containing different clustering characteristics in the data group, performing a first-class clustering operation on the data group, wherein the first-class clustering operation is used for preprocessing the generated data group, and the preprocessing is used for primarily dividing the data group. The method comprises the steps of firstly carrying out preliminary processing on a data group with multiple dimensions, such as missing value processing, noise data removal, redundant feature attribute combination and the like. The preprocessing enables the data group to have required standard attributes, the standard attributes are customized by a user according to the use environment, for example, the equipment category standard, the service category standard, the WEB gate standard and the like, and the preprocessing specifically comprises the following steps; firstly, data units with the same category or similar categories in a data group containing multiple categories of data are read, and the data units are small data sets with the same or similar categories of data in the data group. And carrying out standardization and normalization processing on the data group after the preliminary processing, and reading data units in a unified dimension in the data group. According to the density of each data unit, carrying out aggregation operation on data units with the same category or similar categories, wherein data in a set of multidimensional data in the data units are read one by one, wherein the data comprises an application service data unit, a position data unit, a WEB service data unit and the like, different data units comprise continuous attributes such as service release time, position area number and the like, and classified attributes such as WEB application classification, server type, various middleware categories, industry classification to which text content belongs, and the like. And finally, carrying out aggregation operation on the data units with the same category or similar categories to generate a plurality of data sets with different clustering characteristics. For example, when the data cluster is a data cluster of data related to enterprise service information, after the first type of clustering operation is performed, the data cluster of data related to enterprise service information may be divided into a device type data set, a service type data set, a region data set, a WEB type data set, an enterprise organization data set, and the like, and a plurality of data sets with different clustering characteristics are generated by preprocessing the data cluster and the first type of clustering operation, where the data sets are sets of data with different clustering characteristics, and the generated plurality of data sets include data with different clustering characteristics.
In an embodiment of the present application, as shown in fig. 1, 4 and 7, the method for clustering data further includes the following steps:
clustering and dividing the plurality of data sets by using a second-order clustering algorithm to generate a plurality of data sets to be divided and outlier data, and further dividing the data sets to be divided based on the attribute characteristics of the data sets to be divided to generate a plurality of required data subsets;
wherein the outlier data has an attribute characteristic different from an attribute characteristic of the plurality of datasets to be partitioned.
Specifically, in the present embodiment, data sets are combined one by using an aggregation method until a desired number of subsets are obtained. The number of desired subsets generated is a number of data subsets having different attribute characteristics.
In this embodiment, the data set is used as the input of the algorithm module to perform the second-order clustering algorithm modeling. The model output is a plurality of data sets to be divided and outlier data, the outlier data is extracted, the data sets to be divided are further divided, a plurality of data subsets are obtained and serve as clustering division results, and the number of the groups and the division granularity can be determined according to a method of matching rough estimation and fine estimation in an algorithm. The divided result is related to the input data, for example, the obtained enterprise and manufacturing related data can be output by an industrial control group, a light industry group, an energy group and the like according to the division granularity after the group division. After the second clustering operation, the plurality of data sets may generate a plurality of data subsets according to the attribute characteristics, where a data subset is a set of data with different attribute characteristics, and the data included in the data subset has further refined attribute categories with respect to the data set. Such as a vendor data subset, a model data subset, an industrial control data subset, a model data subset, a financial data subset, etc. The generated plurality of data subsets may have different attribute characteristics. The data subset has the characteristic of being very accurate and visually reflects the distribution and the condition of the data. In the embodiment, an emerging security industry group is discovered through outlier data, the emerging security industry group is mainly concentrated on a camera manufacturer, new characteristic attributes are determined again, such as Hikvision-Webs, DVRDVS-Webs and the like are added in a flow protocol layer of an application service data set, and Hikvision, Dahua and the like are added in WEB service data.
As shown in fig. 2 and fig. 7, an embodiment of the present application provides a method for clustering data, where the method further includes:
s4: performing source tracing analysis on the outlier data;
updating the attribute characteristics of the outlier data based on the source tracing analysis result of the outlier data;
and updating the data set based on the attribute characteristics of the updated outlier data so as to classify the updated data set.
In this embodiment, for outlier data, that is, data with an output of outliers, original sample data to which the outlier data belongs needs to be retrieved by tracing, then each attribute feature and original record information of the outliers are analyzed, whether the outliers belong to an emerging group is artificially determined, the data determined as the emerging group is used as a new found group sample, the feature attributes to be extracted by the new group are re-determined and fed back to the data processing stage, so that the data of the emerging group can be processed in the next partitioning process. For example, emerging security industry groups are discovered through outlier data, mainly concentrated on camera manufacturers, new characteristic attributes are determined again, such as Hikvision-Webs, DVRDVS-Webs and the like are added in a flow protocol layer of an application service data set, and Hikvision, Dahua and the like are added in WEB service data. And the source tracing analysis of the outlier data determines the attribute characteristics of the outlier data and updates a new data set. In this way, the outlier data of the security emerging industry group can be subjected to the aggregation operation again until a desired number of data subsets containing their attribute features are generated.
As shown in fig. 4 and 5, in an embodiment of the present application, the target data includes network data, and the acquiring the target data includes the following steps:
network data acquisition is carried out by utilizing an active detection and/or passive detection mode;
the active detection mode comprises the following steps: and carrying out port detection on a network space corresponding to the network data, and capturing the network data according to a detection result.
Specifically, in this embodiment, the network data collection includes active service detection, traffic packet collection, and history log and extension data. For example, network data acquisition is performed by using an active detection mode, that is, data collection can be performed in a distributed node mode, detection depends on a fingerprint information base, firstly, data to be detected is input, port set screening is performed, a port set to be detected is integrated according to fingerprint information, then, port survival detection and protocol detection are performed on a specified IP address space or a specified global address space according to requirements, if port survival is detected, a specific detection message is constructed according to equipment and service characteristics in the fingerprint base to initiate detection, finally, response message data of a service end is collected, and network data is captured according to a detection result. Further, network data traffic collection may utilize passive collection, typically deployed at an enterprise, operator gateway, or other network outlet as a bypass, and the collected messages and active probe message data are further classified and processed in the next step. The historical log and the expanded data are mainly used for further enlarging the data acquisition scale, and can contain one or more of IP, URL and domain name information for data extraction in the data classification stage and subsequent address library and domain name library query.
In an embodiment of the present application, as shown in fig. 4 and fig. 6, the classifying the target data to generate a data group including multiple classes of data includes the following steps:
classifying the target data to generate a data group containing multiple types of data, comprising:
and carrying out flow message classification and/or extended data processing on the target data, and integrating the target data into a data group containing multiple types of data based on the class attribute of the target data.
Specifically, in this embodiment, classification processing is performed on the acquired target data, including traffic processing, log processing, and extended data processing. For example, the original data of the message is firstly classified by the message classification to carry out the classification of the traditional service (such as ssl, http, ftp, telnet, dns, samba and the like), the processing efficiency can be improved by adopting a port identification mode during the classification, the accuracy can also be improved by combining with application identification, and the unknown service data is classified according to TCP/UDP; after classification, the application identification module carries out secondary deep scanning on all service data carrying upper-layer application, such as TCP/UDP flow, HTTP, SLL and the like, and marks the service category (IM, file transmission, traditional protocol, encrypted tunnel, remote control, industrial control and the like) and application protocol name (QQ, Baidu network disk, thunder encryption, TeamView, Modbus and the like) of each piece of session information; after identification, according to the difference of each service and each protocol, a protocol decoder is simultaneously constructed by combining a fingerprint library, application service fingerprint identification and application service protocol field analysis are carried out, identified service data and application message data are further mined, such as software versions, WEB service version types (nginx, apache and the like), equipment information (manufacturers, firmware versions, software versions, specification models, configuration and the like), and finally equipment data, application data and service data are aggregated. Sending the flow data analyzed by the protocol, the log and the extension data into an IP/URL/Domain extraction module, and extracting IP, URL or Host/Domain information possibly existing in a server; and inquiring by an IP address library or a URL address library to acquire the current network service position, attribution and associated WEB service gate information. And in the final stage of data classification processing, performing data deduplication and screening on all the collected application service information data, position data information data and WEB service data to finish the division and aggregation of the data. And finally integrating the target data into a data group containing multiple types of data based on the class attributes of the target data.
As shown in fig. 2 and fig. 4, an embodiment of the present application provides a data clustering method, where the method further includes:
and S5, performing visualization processing on the generated data subset.
In this embodiment, the finally generated data subset has the characteristics of refinement and obvious attributes, and the data subset can be stored and visualized, so that a user can visually view the finally generated data subset, and other processing can be performed by using the data subset. The visualization is an application of a data storage and service interface in OLAP (On-Line Analytical Processing, which aims to explore and mine data value as a decision as a reference), and mainly adopts an open-source Zeppelin framework to provide a data report which can be driven by data, interacted and cooperated; visual classification can be provided for different types of data subsets of different levels in the network service through a visualization technology.
The second-order clustering algorithm is explained in detail with reference to a specific embodiment, and the principle of the second-order clustering algorithm is as follows: setting an input data set
Figure BDA0002249973030000101
In which there are N data
Figure BDA0002249973030000102
The method can specifically correspond to application type data, device type data, service type data, middleware data and the like, wherein each data is characterized by D attributes, specifically comprising a port attribute, a protocol attribute, an IP attribute, a device fingerprint attribute, a middleware object attribute and the like, wherein D is1A continuous attribute and D2A categorical attribute (i.e., a categorical attribute)
Figure BDA0002249973030000103
Wherein
Figure BDA0002249973030000104
Indicates the attribute value of the nth data under the s-th continuous type attribute,
Figure BDA0002249973030000105
representing the value of the nth data attribute under the tth categorical attribute, knowing that the tth categorical attribute has epsilontThe possible values are 29 types of industrial control protocols, such as Modbus/DNP3/Profinet/OPCUA/Omron _ fins/Siemens _ S7 and the like. CJ={C1,...,CJDenotes the input data set
Figure BDA0002249973030000111
The number of subsets (i.e., the number of clusters) of J, where CjRepresents a cluster CJThe jth subset (i.e., cluster) of (A, B), let set CjIn which is NjThe number of the data is one,
Figure BDA0002249973030000112
in the pre-polymerization stage, the data sets are first individually set
Figure BDA0002249973030000113
The middle data is inserted into a clustering feature tree (CF tree) to realize the growth of the CF tree; when the volume of the CF tree exceeds a set size, potential outlier data (namely outliers) on the current CF tree are removed, then a space threshold value is increased, slimming (resetting) is carried out on the CF tree, and then the outlier data of the CF tree volume without increasing slimming is inserted into the CF tree; when all data is traversed, potential outlier data that cannot be inserted into the CF tree is the true outlier (in the finished dataset)
Figure BDA0002249973030000114
After the insertion of all the data in the CF tree, the data are still the elements of the potential outlier data and are regarded as the final outlier data); and finally, outputting the clustering characteristics of the final corresponding subset of the CF leaf element (leaf entry) to the next stage of the algorithm.
The input of the clustering stage is a subset (sub-cluster) of leaf elements of the final CF tree output by the pre-clustering stage, and the subset is marked as C1,...,CJ0In fact, not the subsets containing specific data points, but the clustering characteristics of the subsets:
Figure BDA0002249973030000115
the operation of this stage is therefore based on the input data
Figure BDA0002249973030000116
For subset C1,...,CJ0And performing two-degree clustering to finally realize the clustering result of the expected subset number. The automatic determination of the optimal subset number of the clusters is one of the characteristics of the second-order clustering algorithm. The effect of accurately determining the number of the clustering optimal subsets is achieved mainly through rough estimation and fine determination. The rough estimation is mainly based on Bayesian Information Criterion (BIC) to find the best oneApproximate range of preferred subset numbers. The refinement is to precisely locate the optimal subset number according to the ratio of the nearest subset distances in the two clusters from the initial estimation of the optimal subset number.
Through the above calculation, the final cluster is obtained
Figure BDA00022499730300001111
However, it is not known that each subset in the cluster specifically contains the data, and only the cluster characteristics of each subset are known, so that the data set needs to be completed through the step
Figure BDA0002249973030000117
To a corresponding subset. Since outliers are considered in the implementation, a threshold is set
Where ρ issRepresents the value range of the s-th continuous attribute, epsilontRepresenting the number of values of the t-th categorical attribute. For data
Figure BDA0002249973030000119
In a word, ifAnd is
Figure BDA0002249973030000121
Will be provided withAssigning to a subset
Figure BDA0002249973030000124
Performing the following steps; otherwise look at the data
Figure BDA0002249973030000123
Are outliers.
An embodiment of the present application further provides an electronic device, including: the classification module is used for acquiring target data, classifying the target data and generating a data group containing multiple types of data; the preprocessing module is used for carrying out first clustering operation on the data group based on the clustering characteristics to generate a plurality of data sets with different clustering characteristics; and the clustering module is used for performing a second clustering operation on the data set based on the attribute characteristics of the data set to generate a plurality of data subsets with different attribute characteristics.
Specifically, as shown in fig. 3, in the present embodiment, the classification module acquires target data in different manners, where the target data may be network data, local data, historical input data, and the like. The data acquisition mode comprises methods such as a web crawler technology, an internet deep analysis technology and the like, and the target network data is comprehensively and deeply acquired and mined. And carrying out preliminary classification on the obtained various target data so as to be beneficial to subsequent clustering operation. The classification processing may specifically include classifying according to the type of the target data, for example, performing packet classification processing on the acquired internet service data. After the obtained multiple target data are classified, a data group containing multiple types of data is generated, the data group is a set of mixed data containing multiple types of data of different types, and the generated data group contains data of different types. For example, when the internet service data is the enterprise service data, the enterprise service data is classified to generate a data group including data related to the enterprise service information.
Specifically, in this embodiment, the preprocessing module is configured to perform a first clustering operation on the data group based on the clustering characteristic, and may perform cluster division on the generated data group. The clustering features comprise data structures, contents, objects and the like, the generated data group has different clustering features, such as WEB data application classification, server type, industry classification of the data and the like, first-class clustering operation is performed on the data group according to data containing different clustering features in the data group, the first-class clustering operation is used for preprocessing the generated data group so as to perform preliminary division on the data group, the data group is rapidly divided through the first-class clustering operation, and a plurality of large-class data sets are generated, wherein the first-class clustering operation is performed based on the data with different clustering features in the data group. For example, when the acquired data group is a data group of data related to enterprise service information, after the first type of clustering operation is performed, the data group of the data related to the enterprise service information may be divided into an equipment type data set, a service type data set, a region data set, a WEB portal data set, an enterprise organization data set, and the like.
Specifically, in the present embodiment, in order to further divide the generated data set, the clustering module is configured to perform a second clustering operation on the data set based on attribute characteristics of the data set, where the attribute characteristics include attribution, specific type, refined category, and the like of the data. The generated data set is further clustered and divided by a second clustering operation, wherein the second clustering operation is performed based on data with different attribute characteristics in the data set, and the second clustering operation is more detailed compared with the first clustering operation. For example, when the generated data set is a device type data set, a service type data set, a region data set, a WEB portal data set, or an enterprise organization data set, after performing the second clustering operation, the plurality of data sets may generate a plurality of data subsets, which are sets of data having different attribute characteristics, according to the attribute characteristics, and the data included in the data subsets have attribute categories further refined with respect to the data sets. For example, a vendor data subset, a model data subset, an industrial control data subset, a model data subset, a financial data subset. And so on. The generated plurality of data subsets may have different attribute characteristics. The data subset has the characteristic of being very accurate and visually reflects the distribution and the condition of the data.
Specifically, in this embodiment, the preprocessing module is further configured to: preprocessing a data group to enable the data group containing multiple types of data to have required standard attributes, wherein the preprocessing comprises the following steps: and reading data units with the same category or similar categories in a data cluster containing multiple categories of data, and performing aggregation operation on the data units with the same category or similar categories according to the density of each data unit to generate multiple data sets with different clustering characteristics.
Specifically, in this embodiment, the first clustering operation is performed on the data group based on the clustering characteristics, and the generated data group may be clustered and divided. According to data containing different clustering characteristics in the data group, performing a first-class clustering operation on the data group, wherein the first-class clustering operation is used for preprocessing the generated data group, and the preprocessing is used for primarily dividing the data group. The method comprises the steps of firstly carrying out preliminary processing on a data group with multiple dimensions, such as missing value processing, noise data removal, redundant feature attribute combination and the like. The preprocessing enables the data group to have required standard attributes, the standard attributes are customized by a user according to the use environment, for example, the equipment category standard, the service category standard, the WEB gate standard and the like, and the preprocessing specifically comprises the following steps; firstly, data units with the same category or similar categories in a data group containing multiple categories of data are read, and the data units are small data sets with the same or similar categories of data in the data group. And carrying out standardization and normalization processing on the data group after the preliminary processing, and reading data units in a unified dimension in the data group. According to the density of each data unit, carrying out aggregation operation on data units with the same category or similar categories, wherein data in a set of multidimensional data in the data units are read one by one, wherein the data comprises an application service data unit, a position data unit, a WEB service data unit and the like, different data units comprise continuous attributes such as service release time, position area number and the like, and classified attributes such as WEB application classification, server type, various middleware categories, industry classification to which text content belongs, and the like. And finally, carrying out aggregation operation on the data units with the same category or similar categories to generate a plurality of data sets with different clustering characteristics. For example, when the data cluster is a data cluster of data related to enterprise service information, after the first type of clustering operation is performed, the data cluster of data related to enterprise service information may be divided into a device type data set, a service type data set, a region data set, a WEB type data set, an enterprise organization data set, and the like, and a plurality of data sets with different clustering characteristics are generated by preprocessing the data cluster and the first type of clustering operation, where the data sets are sets of data with different clustering characteristics, and the generated plurality of data sets include data with different clustering characteristics.
Specifically, in this embodiment, the clustering module is further configured to: clustering and dividing the plurality of data sets by using a second-order clustering algorithm to generate a plurality of data sets to be divided and outlier data, and further dividing the data sets to be divided based on the attribute characteristics of the data sets to be divided to generate a plurality of required data subsets; wherein the outlier data has an attribute characteristic different from an attribute characteristic of the plurality of data sets to be partitioned.
Specifically, in the present embodiment, data sets are combined one by using an aggregation method until a desired number of subsets are obtained. The number of desired subsets generated is a number of data subsets having different attribute characteristics.
In this embodiment, the data set is used as the input of the algorithm module to perform the second-order clustering algorithm modeling. The model output is a plurality of data sets to be divided and outlier data, the outlier data is extracted, the data sets to be divided are further divided, a plurality of data subsets are obtained and serve as clustering division results, and the number of the groups and the division granularity can be determined according to a method of matching rough estimation and fine estimation in an algorithm. The divided result is related to the input data, for example, the obtained enterprise and manufacturing related data can be output by an industrial control group, a light industry group, an energy group and the like according to the division granularity after the group division. After the second clustering operation, the plurality of data sets may generate a plurality of data subsets according to the attribute characteristics, where a data subset is a set of data with different attribute characteristics, and the data included in the data subset has further refined attribute categories with respect to the data set. Such as a vendor data subset, a model data subset, an industrial control data subset, a model data subset, a financial data subset, etc. The generated plurality of data subsets may have different attribute characteristics. The data subset has the characteristic of being very accurate and visually reflects the distribution and the condition of the data. In the embodiment, an emerging security industry group is discovered through outlier data, the emerging security industry group is mainly concentrated on a camera manufacturer, new characteristic attributes are determined again, such as Hikvision-Webs, DVRDVS-Webs and the like are added in a flow protocol layer of an application service data set, and Hikvision, Dahua and the like are added in WEB service data.
The beneficial effect that this application technical scheme brought:
the method and the device can comprehensively and deeply acquire the target data, utilize different modes, preliminarily classify the target data, quickly preliminarily divide the data group, generate a plurality of large-class data sets, further finely divide the data sets based on the purpose of finely dividing the data, accurately divide and generate the data subsets, and the data subsets are more accurate as the final division result. According to the method, two modes of active detection and passive detection are adopted to carry out data acquisition on a specific internet space, and a server log and third-party extension data are supported simultaneously, so that a data group can be further improved; the protocol identification enables the service not to be limited in typical application scenes such as traditional WEB service and the like, supports encrypted traffic classification, and meanwhile, the Internet service class is further refined to the application level, and the number reaches thousands of levels; by means of a deep analysis technology, service and equipment information in flow is fully mined, and application and service category attributes are expanded. The multi-dimensional division method adopts an OLAP multi-dimensional analysis technology, combines a second-order clustering algorithm to enable final division to be more accurate, and can reflect service distribution and application conditions in the network more truly so as to be used for comprehensive evaluation and analysis of enterprises and network operation and supervision authorities.
Complex network data can be effectively analyzed and processed by combining an OLAP multidimensional analysis technology and utilizing a second-order clustering algorithm, the group division can be carried out on the network data in real time for the data attribute of a mixed type, and a new network data type can be found according to outlier data; data analysis is carried out based on fingerprint protocol identification and protocol analysis technologies, and by combining IP address position query, URL and domain name library query, depth breakthrough and transverse expansion of data mining are realized, the data division dimension is wider, and the granularity is finer; the whole device expansion depends on a fingerprint library, an IP address library and a URL classification library of network services, the service type and gate type expansion capability is strong, group discovery is not limited to a certain type of equipment and products any more, and development and maintenance are facilitated.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (10)

1. A method of clustering data, the method comprising:
acquiring target data, and carrying out classification processing on the target data to generate a data group containing multiple types of data;
performing a first clustering operation on the data cluster based on the clustering characteristics to generate a plurality of data sets with different clustering characteristics;
and performing a second clustering operation on the data set based on the attribute characteristics of the data set to generate a plurality of data subsets with different attribute characteristics.
2. The method of claim 1, wherein the performing a first clustering operation on the data clusters based on the clustering characteristics to generate a plurality of data sets with different clustering characteristics comprises:
preprocessing the data group to make the data group containing multiple types of data have required standard attributes, wherein the preprocessing comprises:
and reading data units with the same category or similar categories in the data group containing the multi-category data, and performing aggregation operation on the data units with the same category or similar categories according to the density of each data unit to generate a plurality of data sets with different clustering characteristics.
3. The method of claim 1, wherein the second clustering operation on the data set based on the attribute characteristics of the data set to generate a plurality of data subsets having different attribute characteristics comprises:
clustering and dividing the data sets by using a second-order clustering algorithm to generate a plurality of data sets to be divided and outlier data, and further dividing the data sets to be divided based on the attribute characteristics of the data sets to be divided to generate a plurality of required data subsets;
wherein the outlier data has an attribute characteristic different from an attribute characteristic of the plurality of datasets to be partitioned.
4. The method of claim 3, further comprising:
performing source tracing analysis on the outlier data;
updating the attribute characteristics of the outlier data based on the source tracing analysis result of the outlier data;
updating the data set based on the updated attribute characteristics of the outlier data so as to classify the updated data set.
5. The method of claim 1, wherein the target data comprises network data, and wherein obtaining target data comprises:
network data acquisition is carried out by utilizing an active detection and/or passive detection mode;
wherein, the active detection mode comprises: and carrying out port detection on a network space corresponding to the network data, and capturing the network data according to a detection result.
6. The method of claim 1, wherein the classifying the target data to generate a data cluster comprising a plurality of classes of data comprises:
and carrying out flow message classification and/or extended data processing on the target data, and integrating the target data into a data group containing multiple types of data based on the class attribute of the target data.
7. The method of claim 1, further comprising:
and performing visualization processing on the generated data subset.
8. An electronic device, comprising:
the classification module is used for acquiring target data, classifying the target data and generating a data group containing multiple types of data;
the preprocessing module is used for carrying out first clustering operation on the data group based on the clustering characteristics to generate a plurality of data sets with different clustering characteristics;
and the clustering module is used for performing a second clustering operation on the data set based on the attribute characteristics of the data set to generate a plurality of data subsets with different attribute characteristics.
9. The electronic device of claim 8, wherein the preprocessing module is further configured to: preprocessing the data group to enable the data group containing multiple types of data to have required standard attributes, wherein the preprocessing comprises the following steps: and reading data units with the same category or similar categories in the data group containing the multi-category data, and performing aggregation operation on the data units with the same category or similar categories according to the density of each data unit to generate a plurality of data sets with different clustering characteristics.
10. The electronic device of claim 8, wherein the clustering module is further configured to: clustering and dividing the data sets by using a second-order clustering algorithm to generate a plurality of data sets to be divided and outlier data, and further dividing the data sets to be divided based on the attribute characteristics of the data sets to be divided to generate a plurality of required data subsets; wherein the outlier data has an attribute characteristic different from an attribute characteristic of the plurality of datasets to be partitioned.
CN201911030402.5A 2019-10-28 2019-10-28 Data clustering method and electronic equipment Active CN110765329B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911030402.5A CN110765329B (en) 2019-10-28 2019-10-28 Data clustering method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911030402.5A CN110765329B (en) 2019-10-28 2019-10-28 Data clustering method and electronic equipment

Publications (2)

Publication Number Publication Date
CN110765329A true CN110765329A (en) 2020-02-07
CN110765329B CN110765329B (en) 2022-09-23

Family

ID=69334197

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911030402.5A Active CN110765329B (en) 2019-10-28 2019-10-28 Data clustering method and electronic equipment

Country Status (1)

Country Link
CN (1) CN110765329B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111711633A (en) * 2020-06-22 2020-09-25 中国科学技术大学 Multi-stage fused encrypted traffic classification method
CN111786903A (en) * 2020-05-28 2020-10-16 西安电子科技大学 Network traffic classification method based on constrained fuzzy clustering and particle computation
CN112884091A (en) * 2021-04-28 2021-06-01 睿至科技集团有限公司 Intelligent data analysis method based on big data and terminal equipment thereof

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101808339A (en) * 2010-04-06 2010-08-18 哈尔滨工业大学 Telephone traffic subdistrict self-adaptive classification method applying K-MEANS and prior knowledge
CN102314519A (en) * 2011-10-11 2012-01-11 中国软件与技术服务股份有限公司 Information searching method based on public security domain knowledge ontology model
CN103927643A (en) * 2014-04-30 2014-07-16 洪剑 Optimization method for large-scale order processing and distributing route
CN104573050A (en) * 2015-01-20 2015-04-29 安徽科力信息产业有限责任公司 Continuous attribute discretization method based on Canopy clustering and BIRCH hierarchical clustering
CN105471670A (en) * 2014-09-11 2016-04-06 中兴通讯股份有限公司 Flow data classification method and device
CN107609102A (en) * 2017-09-12 2018-01-19 电子科技大学 A kind of short text on-line talking method
CN107633007A (en) * 2017-08-09 2018-01-26 五邑大学 A kind of comment on commodity data label system and method based on stratification AP clusters
CN109508748A (en) * 2018-11-22 2019-03-22 北京奇虎科技有限公司 A kind of clustering method and device
US20190213357A1 (en) * 2016-08-10 2019-07-11 Siemens Aktiengesellschaft Big Data K-Anonymizing by Parallel Semantic Micro-Aggregation
CN110348526A (en) * 2019-07-15 2019-10-18 武汉绿色网络信息服务有限责任公司 A kind of device type recognition methods and device based on semi-supervised clustering algorithm

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101808339A (en) * 2010-04-06 2010-08-18 哈尔滨工业大学 Telephone traffic subdistrict self-adaptive classification method applying K-MEANS and prior knowledge
CN102314519A (en) * 2011-10-11 2012-01-11 中国软件与技术服务股份有限公司 Information searching method based on public security domain knowledge ontology model
CN103927643A (en) * 2014-04-30 2014-07-16 洪剑 Optimization method for large-scale order processing and distributing route
CN105471670A (en) * 2014-09-11 2016-04-06 中兴通讯股份有限公司 Flow data classification method and device
CN104573050A (en) * 2015-01-20 2015-04-29 安徽科力信息产业有限责任公司 Continuous attribute discretization method based on Canopy clustering and BIRCH hierarchical clustering
US20190213357A1 (en) * 2016-08-10 2019-07-11 Siemens Aktiengesellschaft Big Data K-Anonymizing by Parallel Semantic Micro-Aggregation
CN107633007A (en) * 2017-08-09 2018-01-26 五邑大学 A kind of comment on commodity data label system and method based on stratification AP clusters
CN107609102A (en) * 2017-09-12 2018-01-19 电子科技大学 A kind of short text on-line talking method
CN109508748A (en) * 2018-11-22 2019-03-22 北京奇虎科技有限公司 A kind of clustering method and device
CN110348526A (en) * 2019-07-15 2019-10-18 武汉绿色网络信息服务有限责任公司 A kind of device type recognition methods and device based on semi-supervised clustering algorithm

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111786903A (en) * 2020-05-28 2020-10-16 西安电子科技大学 Network traffic classification method based on constrained fuzzy clustering and particle computation
CN111711633A (en) * 2020-06-22 2020-09-25 中国科学技术大学 Multi-stage fused encrypted traffic classification method
CN111711633B (en) * 2020-06-22 2021-08-13 中国科学技术大学 Multi-stage fused encrypted traffic classification method
CN112884091A (en) * 2021-04-28 2021-06-01 睿至科技集团有限公司 Intelligent data analysis method based on big data and terminal equipment thereof

Also Published As

Publication number Publication date
CN110765329B (en) 2022-09-23

Similar Documents

Publication Publication Date Title
CN110765329B (en) Data clustering method and electronic equipment
CN111506599B (en) Industrial control equipment identification method and system based on rule matching and deep learning
CN116739389A (en) Smart city management method and system based on cloud computing
US20160253495A1 (en) Cyber security
US10691795B2 (en) Quantitative unified analytic neural networks
CN111565205A (en) Network attack identification method and device, computer equipment and storage medium
CN110046297B (en) Operation and maintenance violation identification method and device and storage medium
CN113328985B (en) Passive Internet of things equipment identification method, system, medium and equipment
CN113612763B (en) Network attack detection device and method based on network security malicious behavior knowledge base
US11449604B2 (en) Computer security
CN107483451B (en) Method and system for processing network security data based on serial-parallel structure and social network
CN111294233A (en) Network alarm statistical analysis method, system and computer readable storage medium
CN114553591B (en) Training method of random forest model, abnormal flow detection method and device
CN117216660A (en) Method and device for detecting abnormal points and abnormal clusters based on time sequence network traffic integration
CN114172688A (en) Encrypted traffic network threat key node automatic extraction method based on GCN-DL
CN113205134A (en) Network security situation prediction method and system
US11477225B2 (en) Pre-emptive computer security
GB2582609A (en) Pre-emptive computer security
US11436320B2 (en) Adaptive computer security
CN111935185A (en) Method and system for constructing large-scale trapping scene based on cloud computing
CN109344913B (en) Network intrusion behavior detection method based on improved MajorCluster clustering
CN110311870B (en) SSL VPN flow identification method based on density data description
Bista et al. DDoS attack detection using heuristics clustering algorithm and naïve bayes classification
CN116502171B (en) Network security information dynamic detection system based on big data analysis algorithm
US20160239264A1 (en) Re-streaming time series data for historical data analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant