CN114492613A - Internet of things and non-Internet of things equipment identification method, system, terminal and readable storage medium - Google Patents

Internet of things and non-Internet of things equipment identification method, system, terminal and readable storage medium Download PDF

Info

Publication number
CN114492613A
CN114492613A CN202210067478.0A CN202210067478A CN114492613A CN 114492613 A CN114492613 A CN 114492613A CN 202210067478 A CN202210067478 A CN 202210067478A CN 114492613 A CN114492613 A CN 114492613A
Authority
CN
China
Prior art keywords
equipment
internet
things
random forest
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210067478.0A
Other languages
Chinese (zh)
Inventor
杨家海
樊琳娜
韩鹍
李国朋
耿君峰
杨洋
时晨
刘晶
武备
王喆
冉淏丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202210067478.0A priority Critical patent/CN114492613A/en
Publication of CN114492613A publication Critical patent/CN114492613A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16YINFORMATION AND COMMUNICATION TECHNOLOGY SPECIALLY ADAPTED FOR THE INTERNET OF THINGS [IoT]
    • G16Y20/00Information sensed or collected by the things
    • G16Y20/20Information sensed or collected by the things relating to the thing itself
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16YINFORMATION AND COMMUNICATION TECHNOLOGY SPECIALLY ADAPTED FOR THE INTERNET OF THINGS [IoT]
    • G16Y40/00IoT characterised by the purpose of the information processing
    • G16Y40/10Detection; Monitoring
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16YINFORMATION AND COMMUNICATION TECHNOLOGY SPECIALLY ADAPTED FOR THE INTERNET OF THINGS [IoT]
    • G16Y40/00IoT characterised by the purpose of the information processing
    • G16Y40/50Safety; Security of things, users, data or systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Computer Security & Cryptography (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a method, a system, a terminal and a readable storage medium for identifying equipment of Internet of things and non-Internet of things, which realize the identification of the equipment of Internet of things and non-Internet of things through multiple improvements of feature extraction, representative equipment selection, model updating and feature reduction; extracting flow characteristics and protocol characteristics from network flow to establish an initial random forest model; and then, feature extraction is carried out on newly added equipment, representative equipment is selected from the newly added equipment for mark verification and is used for model updating, and the importance of the model is judged based on feature reduction in the model updating process, so that unimportant features are deleted, and the identification precision of the model is further improved. The method has the function of automatically updating the model, and can obtain higher equipment identification precision aiming at newly added equipment.

Description

Internet of things and non-Internet of things equipment identification method, system, terminal and readable storage medium
Technical Field
The invention belongs to the technical field of Internet of things identification, and particularly relates to a method, a system, a terminal and a readable storage medium for identifying Internet of things and non-Internet of things equipment.
Background
The rapid development of the internet of things technology brings various conveniences to daily life and industrial production, and also provides wide market space for equipment manufacturers, internet service providers and application developers. However, the development of internet of things also brings various challenges to network management and network security. From the perspective of network management, a network administrator generally needs to know the number and types of internet of things devices in a network to facilitate management, but a large number of internet of things devices are difficult to manage, and some devices are located in remote locations away from enterprise-centric facilities. On the other hand, because the internet of things devices have limited hardware and software resources, traditional defense measures are difficult to deploy, and therefore the internet of things devices are becoming targets for attack by attackers. Identifying internet of things devices and monitoring their status is of great significance to asset management and security management.
At present, methods for identifying internet of things equipment from passive traffic are mainly divided into three types of methods based on equipment information, static rules and machine learning.
The identification method based on the device information is to identify the device manufacturer information from the MAC OUI (organization uniform Identifier), and identify the device information from the user-agent field of the HTTP request or from the host name in the DHCP protocol message. However, since the nic (network Interface controller) of the internet of things device is usually provided by a third party, most MAC OUIs do not contain device manufacturer information; HTTP requests are typically encrypted, making it difficult to identify device information from the user-agent field; since a host name is not set in a DHCP request of many devices, it is generally difficult to identify a device from the DHCP request.
The identification method based on static rules is to identify the same device in passive traffic using a rule method, i.e. from the server IP address or domain name in DNS request to which the device is known to connect, but this method is difficult to extend and to distinguish different devices of the same device manufacturer.
The machine learning-based identification method mainly utilizes supervised learning, is a hot point of current research, and can achieve classification accuracy of over 99%. However, for newly added devices, the above three identification methods cannot effectively identify whether the newly added devices are internet of things devices or non-internet of things devices.
Disclosure of Invention
The invention aims to provide a method, a system, a terminal and a readable storage medium for identifying Internet of things equipment and non-Internet of things equipment.
On one hand, the invention provides an Internet of things and non-Internet of things equipment identification method, which comprises the following steps:
establishing an identification model of the Internet of things equipment and the non-Internet of things equipment based on the random forest model based on the extracted features and the equipment labels; the equipment label is a classification label of equipment which is equipment of the Internet of things or non-Internet of things, and the characteristics at least comprise flow characteristics;
extracting features of newly added equipment, and identifying the equipment as Internet of things equipment or non-Internet of things equipment based on the identification model;
and updating the random forest model by using the flow characteristics of the new equipment.
From the above, the identification model based on the random forest model is dynamically updated by using the data of the new device, so that the identification model can be continuously updated, and the internet of things device or the non-internet of things device can be more accurately identified for the new device. Meanwhile, the concept drift problem caused by continuous addition of new equipment can be solved.
Further optionally, the model update process comprises: online sampling and tree destruction detection;
the online sampling is to collect the flow characteristics in the time window of the new equipment and to mark the flow characteristics as an example;
the tree destruction detection is: and updating the trees in the random forest model according to the updating times of the examples, judging whether each tree is destroyed after the random forest is updated by using the examples of the new equipment, and building a new tree by combining the original data set and the new equipment data aiming at the destroyed trees so as to update the random forest model.
Further optionally, whether each tree is destroyed is judged according to the following standard 1 and/or standard 2;
standard 1: calculating the change values delta Gini and delta Gini 'of the kini coefficients before and after updating aiming at each internal node of the trees in the random forest, if the internal nodes exist, the change values delta Gini and delta Gini' meet the requirement
Figure BDA0003480716600000021
The corresponding tree needs to be destroyed; alpha represents preset hyper-parameter, alpha is more than 0 and less than 1;
standard 2: aiming at leaf nodes of trees in random forest, if the leaf nodes exist, the p' is satisfiedmaincIf beta, the corresponding tree needs to be destroyed, wherein maincBefore the representation tree is updated, the leaf nodePoint correspondence corresponds to the device labels, p' to which most classes belongmaincAfter the representation tree is updated, the leaf nodes correspond to the corresponding device tags maincThe ratio of the following.
Further optionally, the new device when updating the random forest model is a representative device selected based on the following rule;
obtaining a feature set X of the new device as X1,x2,...,xN},xiFeatures representing the ith new device;
clustering the feature set X of the new equipment, and calculating the contour coefficient sc of each round of clusteringkThe number of the clustering centers of each round of clustering is recorded as k, the k value is sequentially taken from 2 to BUDGET, and the BUDGET is the set maximum number of the identification devices;
based on the maximum profile coefficient sckDetermining the number K of the optimal clustering centers;
and selecting specific new equipment from each cluster as representative equipment based on the clustering result corresponding to the optimal clustering center number K.
Selecting representative equipment for model updating, and on one hand, selecting more representative new equipment so that the updated random forest model has higher identification precision; in the two aspects, the marking work of an administrator can be reduced, and the calculation amount of subsequent calculation can also be reduced.
Further optionally, selecting a specific new device from each cluster as the representative device, and selecting a new device with the largest prediction probability variance from each cluster as the representative device;
wherein the prediction probability variance of the ith device is variable _ probsi=Variance(probi) Variance is a Variance function, probiAnd the probability that the ith device is predicted as the device of the Internet of things through each tree obtained after the ith device passes through the random forest is represented.
Further optionally, the method further comprises: judging feature importance based on feature reduction, and deleting unimportant features based on feature importance; wherein the importance of the features is characterized according to the following formula;
feature importance Import (T, f, T) of feature f in tree T of random foresti) Expressed as:
Import(t,f,Ti)=∑n∈t1(split(n)=f)·ΔGinin
wherein split (n) represents a division characteristic of the node n, TiIndicates the time, Δ Ginin1 represents an indicator function for the kiney coefficient of the node n;
feature importance of feature f in random forest Import (f, T)i) Expressed as:
Figure BDA0003480716600000031
in the formula, | trees | represent the number of trees in the random forest;
feature importance score (f, T) of feature f throughout the processi) Expressed as:
score(f,Ti)=score(f,Ti-1)·γ+Import(f,Ti)·(1-γ)
wherein, score (f, T)i)=Import(f,T1) γ is a hyperreference, representing a discount factor for the history score, 0 < γ < 1.
Further optionally, the features of the device comprise traffic features and protocol features;
wherein the flow characteristics are: some or all of the packet numbers, transmission byte numbers, packet length types, packet length average values, variances, and entropy of packet lengths of the ingress flow, the egress flow, and the bidirectional flow;
the protocol is characterized in that: the number of destination IP addresses, the number of IPv4 packets, the number of IPv6 packets, the number of TCP packets, the number of UDP packets, the number of local and remote TCP/UDP port numbers falling within the interval of [0,1024 ], [1024,49152 ], [49152,65535], the number of port numbers, the number of entropy, the maximum value, the minimum value, the number of types, the entropy of TCP window size, the number of domain names, the number of accesses, the entropy, the partial or total combination of TLS handshake times.
In a second aspect, the present invention provides a system based on the above method, which includes:
the characteristic extraction module is used for extracting the characteristics of the equipment;
the identification model construction module is used for constructing identification models of the Internet of things equipment and the non-Internet of things equipment based on the random forest model based on the extracted features and the equipment labels;
and the model updating module is used for updating the random forest model by utilizing the flow characteristics of the new equipment.
In a third aspect, the present invention provides an electronic terminal having at least one or more processors
One or more memories;
wherein the memory stores a computer program that the processor invokes to implement:
disclosed are a method for identifying equipment in Internet of things and non-Internet of things.
In a fourth aspect, the present invention provides a readable storage medium storing a computer program for invocation by a processor to implement:
disclosed are a method for identifying equipment in Internet of things and non-Internet of things.
Advantageous effects
1. The invention provides a method for identifying Internet of things and non-Internet of things equipment, which is used for constructing an identification model on the basis of a random forest model and identifying the equipment as Internet of things equipment or non-Internet of things equipment. And secondly, updating the random forest model by using new equipment data so that the identification model can keep better identification precision, and particularly aiming at new equipment, compared with the existing model, the random forest model can be automatically updated, so that the identification precision of the new equipment is effectively improved, and the problem of concept deviation caused by the continuous addition of the new equipment can be solved.
2. In a further optimization scheme of the invention, representative new equipment is selected from a series of new equipment to update the random forest model, on one hand, the selected new equipment is representative, so that the reliability of the updated model is higher; in both aspects, the amount of computation can be reduced and the marking effort of the administrator can be mitigated.
3. In a further preferred scheme of the invention, the method also comprises the steps of executing feature reduction operation, deleting unimportant features and reserving more important features, so that the recognition precision of the random forest model is higher and the recognition result is more reliable.
Drawings
Fig. 1 is a schematic diagram of an identification method for devices in the internet of things and devices in the non-internet of things in an open environment according to an embodiment of the present invention.
Fig. 2 is a flowchart of a method for identifying devices in the internet of things and devices in the non-internet of things in an open environment according to embodiment 1 of the present invention.
Fig. 3 is a schematic diagram of model update provided in the embodiment of the present invention.
Fig. 4 is a schematic diagram of the significance of features provided by an embodiment of the invention.
FIG. 5 is a schematic diagram of the generalization performance of the model provided by the embodiment of the present invention.
FIG. 6 is a diagram illustrating the accuracy of a model for deleting 30% of unimportant features according to an embodiment of the invention.
FIG. 7 is a diagram illustrating the accuracy of a model for deleting 50% of insignificant features according to an embodiment of the present invention.
Detailed Description
The invention provides a method for identifying Internet of things equipment and non-Internet of things equipment, which is used for identifying the equipment as the Internet of things equipment or the non-Internet of things equipment. In order to improve the identification precision of the model and deal with the identification problem of new equipment, the invention updates the model by using the data of the new equipment, thereby effectively improving the identification precision of the new equipment and improving the concept deviation problem caused by the continuous addition of the new equipment. On the basis of not departing from the concept of the invention, the invention also introduces multiple optimization means, including feature reduction, model updating by selecting representative new equipment and the like. The following embodiment 1 will be described by taking as an example a plurality of improvement techniques including selecting a representative new device, model features, feature reduction, etc., and it should be understood that other optimization improvements may be made on the basic scheme or some improvement techniques in the embodiment 1 may be selected without departing from the concept of the present invention, and the present invention is not limited specifically.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. The present invention will be further described with reference to the following examples.
Example 1:
the method for identifying the internet of things and non-internet of things equipment provided by the embodiment comprises the following steps:
s101: and extracting features from the network flow, and establishing an initial random forest model according to the extracted flow features, the protocol features and the label data.
In this embodiment, the features of the device are defined as: traffic characteristics and protocol characteristics, and are both extracted from the network traffic of the device. And the flow characteristics in the embodiment include the packet numbers of the incoming flow, the outgoing flow and the bidirectional flow, the transmission byte number, the packet length type number, the packet length average value, the variance and the entropy of the packet length. The protocol characteristics comprise destination IP address type number, IPv4 packet number, IPv6 packet number, TCP packet number, UDP packet number, the number of local and remote TCP/UDP port numbers falling into the intervals of [0,1024 ], [1024,49152 ] and [49152,65535], the type number, number and entropy of the port numbers, the maximum value, the minimum value, the type number and entropy of TCP window size, domain name type number, access times, entropy and TLS handshake times. It should be understood that in other possible embodiments, the flow characteristics and the configuration of the protocol characteristics may be adjusted according to actual needs and effects, such as the above-mentioned flow characteristic point partial combination or the above-mentioned protocol characteristic partial combination.
Regarding the initial random forest model, the device with a known device label (namely, the device is known as internet of things device or non-internet of things device) is used as an original sample, the random forest model is trained by using the characteristics of the device and the device label to obtain an identification model, and the identification model can identify the device label of the device by using the characteristics of the device, namely, the device is a classification result of the internet of things device or the non-internet of things device.
In this embodiment, step S101 also adopts the same feature extraction means to extract features of the new device. It should be understood that in other possible embodiments, the feature extraction of the new device may be performed in the subsequent step S102.
Step S102: and selecting part of representative equipment from the new equipment for marking verification, collecting the flow of the representative equipment, and updating the model.
The feature of the new device is input into the random forest model to obtain a recognition result, and whether the recognition result is correct or not can be further verified. In order to improve the model updating efficiency and reduce the marking work of the administrator in the embodiment, a representative device is selected from a series of new devices to update the model, and the representative device is selected in the following manner. It is to be understood that this embodiment is the best mode of realisation, but the invention is not restricted to having to select a representative device or to how it can be selected without departing from the basic concept.
Wherein, the representative equipment is selected based on the following rules;
(1) obtaining a feature set X of the new device as X1,x2,...,xN},xiFeatures representing the ith new device; and obtaining a prediction probability pre _ probs ═ { prob) based on the random forest model1,prob2,...,probN}. Wherein N is the number of new devices, probi=(probi,t`,1,probi,t`,2,...,probi,t`,T) T represents the number of trees in the random forest, T' represents the time, probiRepresenting the probability that the ith equipment is predicted as the equipment of the Internet of things by each tree obtained after the ith equipment passes through a random forest, probi,t`,2And the probability that the ith device is predicted as the device of the Internet of things by the 2 nd tree at the t' th moment obtained after the ith device passes through the random forest is shown.
(2) Clustering the feature set X of the new equipment, and calculating the contour coefficient sc of each round of clusteringkAnd recording the number of the clustering centers of each round of clustering as k, sequentially taking values of the k values from 2 to BUDGET, wherein the BUDGET is the set maximum number of the identification devices. Its profile coefficient formula sck(b-a)/max (a, b), a is Xi and the same cluster thereofThe average distance of his samples, called the degree of agglomeration, and b is the average distance of Xi from all samples in the nearest cluster, called the degree of separation.
(3) Based on the maximum profile coefficient sckDetermining the number K of the optimal clustering centers;
(4) and selecting specific new equipment from each cluster as representative equipment based on the clustering result corresponding to the optimal clustering center number K. In this embodiment, the device with the largest variance of the variance _ probs is found from each cluster as the representative device, and the prediction probability variance of the ith device is the variance of the variance _ probsi=Variance(probi)。
Regarding model updating, the updating process thereof includes: online sampling and tree destruction detection. The online sampling is to collect the flow in the time window of the new device and extract the characteristics, and is recorded as an example. The tree destruction is detected as: after the updating times Q of the examples are determined, updating the trees in the random forest according to the updating times Q; after the random forest new equipment instance is updated, judging whether each tree is destroyed or not based on a judgment standard; and for the destroyed tree, newly building a tree by combining the original data set with the new equipment flow data to update the random forest.
Wherein, the process of updating the trees in the random forest according to the updating times Q comprises the following steps:
and according to the segmentation characteristics and the threshold value in the tree, finding a path from the root node of the tree to the leaf node of the example, namely adding the example to the intermediate node of the tree. That is, the new device instance falls from the root node to the leaf node in each tree of the random forest, so that the node in each previous tree is added to the new device instance.
And after the random forest new equipment instance is updated, judging whether each tree is destroyed, wherein the judging standard comprises two standards.
Standard 1: calculating the change value delta Gini of the Gini coefficient before updating and the change value delta Gini' of the Gini coefficient after updating for each internal node of the random forest tree, and if the internal node exists, satisfying the requirement
Figure BDA0003480716600000071
The tree needs to be destroyed, alpha represents the preset hyper-parameter, and alpha is more than 0 and less than 1;
standard 2: for leaf nodes of trees in random forest, if there are leaf nodes satisfying p ″maincIf beta is less than beta, the tree should be destroyed; maincDevice label, p' representing most classes to which the leaf node belongs before tree updatemaincRepresenting the proportion of most classes before the tree is updated, if the proportion is less than beta, destroying the tree, and if the proportion is more than or equal to 0.5 and less than 1, namely beta represents preset super parameters.
S103: and judging the importance of the features in the model updating process based on the feature reduction, and deleting the unimportant features. Specifically, the invention characterizes feature importance according to the following formula:
1) calculating the characteristic f at T based oniFeature importance in the time-of-day tree t:
Import(t,f,Ti)=∑n∈t1(split(n)=f)·ΔGinin
wherein split (n) represents the segmentation feature of the node n, and 1 is an indication function;
(2) calculating the characteristic f at T based oniFeature importance in random forest at time:
Figure BDA0003480716600000072
wherein, trees represents the number of trees in the random forest;
(3) the feature importance of feature f in the overall process is calculated based on the following equation:
score(f,Ti)=score(f,Ti-1)·γ+Import(f,Ti)·(1-γ);
wherein, score (f, T)i)=Import(f,T1) γ is a hyperreference, representing a discount factor for the history score, 0 < γ < 1. In this embodiment, the feature importance calculated in step (3) is used as an index for determining whether the feature is important, and a specific criterion is that a threshold may be set according to the accuracy requirement and an experiment. Other possible embodiments utilize the results of step (1) and/or step (2)Other calculation ways for determining the final indicator are also possible. Note also that TiThe time of day is a virtual finger and the invention is not restricted to specific times.
It should be further noted that step S103 is a feature reduction process, and although the above embodiment is described in the order of steps S101 to S103, the application process of device identification using the present invention is not limited to the above execution order. For example, step S103 may select a model to be run for a period of time and then perform feature pruning. In the model updating process, the updating can also be performed after a certain number of new devices are accumulated, and the updating period can be adjusted according to actual requirements.
The technical effects of the present invention will be described in detail with reference to simulation experiments.
In order to verify the effect of the model, the model is evaluated through a public data set and an IoT environment established in a laboratory, the evaluation process comprises 6 groups of experiments, 8 batches of new equipment are randomly extracted from each group of experiments, the number of the new equipment is different from 2 to 6, and the rest equipment is used as known equipment.
First, the present invention evaluates the change of the generalization performance after model update, as shown in FIG. 5, it can be seen that the prediction error rate is in a downward trend with the update of the model because of RFiIs the predicted error number minus RFi-1The prediction error number is a value much smaller than 0.
Next, the present invention evaluates the device identification performance after feature importance changes and removal of insignificant features, as shown in fig. 6.
As can be seen from FIG. 6, after a certain proportion of the unimportant features are deleted, the model identification accuracy is improved.
Finally, the present invention compares the model with the existing work "Iot or not: Identifying Iot devices in a short time scale", which uses two models, unifiedClasifier and Comblasifier, respectively, and the results are shown in Table 1.
TABLE 1 comparison of recognition results by different methods
Figure BDA0003480716600000081
As can be seen from Table 1, the recognition accuracy of the method of the present invention exceeds that of UnifiedClasifier and CombClassifier, proving its effectiveness.
Example 2:
the embodiment provides a system based on the internet of things and non-internet of things equipment identification method, which includes:
and the characteristic extraction module is used for extracting the characteristics of the equipment. Which is used for feature extraction of original equipment or new equipment.
And the identification model building module is used for building identification models of the Internet of things equipment and the non-Internet of things equipment based on the random forest model based on the extracted features and the equipment labels. As in example 1, a device with a known device tag (i.e., a device known as an internet of things device or a non-internet of things device) is used as an original sample, and a random forest model is trained by using the features of the device and the device tag to obtain a recognition model. Therefore, the device can be identified as the equipment of the Internet of things or the equipment of the non-Internet of things based on the identification model.
And the model updating module is used for updating the random forest model by utilizing the flow characteristics of the new equipment.
And the feature reduction module is used for judging the importance of the features and further deleting the unimportant features.
The implementation process of each functional module unit may refer to the description of the corresponding method.
The division of the functional module units is only one division of logical functions, and other division manners may be available in actual implementation, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. Meanwhile, the integrated unit can be realized in a hardware form, and can also be realized in a software functional unit form.
Example 3:
the invention provides an electronic terminal, which at least comprises one or more processors and one or more memories; wherein the memory stores a computer program that the processor invokes to implement: disclosed are a method for identifying equipment in Internet of things and non-Internet of things.
The specific implementation is as follows:
establishing an identification model of the Internet of things equipment and the non-Internet of things equipment based on the random forest model based on the extracted features and the equipment labels; the equipment label is a classification label of equipment which is equipment of the Internet of things or non-Internet of things, and the characteristics at least comprise flow characteristics;
extracting features of newly added equipment, and identifying the equipment as Internet of things equipment or non-Internet of things equipment based on the identification model;
and updating the random forest model by using the flow characteristics of the new equipment.
Or further performing: feature importance is determined based on feature reduction and unimportant features are deleted based on feature importance.
The electronic terminal further comprises: and the communication interface is used for communicating with external equipment and carrying out data interactive transmission.
The memory may include high speed RAM memory, and may also include a non-volatile defibrillator, such as at least one disk memory.
If the memory, the processor and the communication interface are implemented independently, the memory, the processor and the communication interface may be connected to each other via a bus and perform communication with each other. The bus may be an industry standard architecture bus, a peripheral device interconnect bus, an extended industry standard architecture bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc.
Optionally, in a specific implementation, if the memory, the processor, and the communication interface are integrated on a chip, the memory, the processor, that is, the communication interface may complete communication with each other through the internal interface.
The specific implementation process of each step refers to the explanation of the foregoing method.
It should be understood that in the embodiments of the present invention, the Processor may be a Central Processing Unit (CPU), and the Processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The memory may include both read-only memory and random access memory, and provides instructions and data to the processor. A portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.
Example 4:
the present invention provides a readable storage medium storing a computer program for invocation by a processor to implement: disclosed are a method for identifying equipment in Internet of things and non-Internet of things.
The specific implementation is as follows:
establishing an identification model of the Internet of things equipment and the non-Internet of things equipment based on the random forest model based on the extracted features and the equipment labels; the equipment label is a classification label of equipment which is equipment of the Internet of things or non-Internet of things, and the characteristics at least comprise flow characteristics;
extracting features of newly added equipment, and identifying the equipment as Internet of things equipment or non-Internet of things equipment based on the identification model;
and updating the random forest model by using the flow characteristics of the new equipment.
Or further performing: feature importance is determined based on feature reduction and unimportant features are deleted based on feature importance.
The specific implementation process of each step refers to the explanation of the foregoing method.
The readable storage medium is a computer readable storage medium, which may be an internal storage unit of the controller according to any of the foregoing embodiments, for example, a hard disk or a memory of the controller. The readable storage medium may also be an external storage device of the controller, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the controller. Further, the readable storage medium may also include both an internal storage unit of the controller and an external storage device. The readable storage medium is used for storing the computer program and other programs and data required by the controller. The readable storage medium may also be used to temporarily store data that has been output or is to be output.
Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned readable storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
It should be emphasized that the examples described herein are illustrative and not restrictive, and thus the invention is not to be limited to the examples described herein, but rather to other embodiments that may be devised by those skilled in the art based on the teachings herein, and that various modifications, alterations, and substitutions are possible without departing from the spirit and scope of the present invention.

Claims (10)

1. A method for identifying Internet of things and non-Internet of things equipment is characterized by comprising the following steps: the method comprises the following steps:
establishing an identification model of the Internet of things equipment and the non-Internet of things equipment based on the random forest model based on the extracted features and the equipment labels; the equipment label is a classification label of equipment which is equipment of the Internet of things or non-Internet of things, and the characteristics at least comprise flow characteristics;
extracting features of newly added equipment, and identifying the equipment as Internet of things equipment or non-Internet of things equipment based on the identification model;
and updating the random forest model by using the flow characteristics of the new equipment.
2. The method of claim 1, wherein: the model updating process comprises the following steps: online sampling and tree destruction detection;
the online sampling is to collect the flow characteristics in the time window of the new equipment and to mark the flow characteristics as an example;
the tree destruction detection is: and updating the trees in the random forest model according to the updating times of the examples, judging whether each tree is destroyed after the random forest is updated by using the examples of the new equipment, and building a new tree by combining the original data set and the new equipment data aiming at the destroyed trees so as to update the random forest model.
3. The method of claim 2, wherein: judging whether each tree is destroyed according to the following standard 1 and/or standard 2;
standard 1: calculating the change values delta Gini and delta Gini 'of the kini coefficients before and after updating aiming at each internal node of the trees in the random forest, if the internal nodes exist, the change values delta Gini and delta Gini' meet the requirement
Figure FDA0003480716590000011
The corresponding tree needs to be destroyed; alpha represents preset hyper-parameter, alpha is more than 0 and less than 1;
standard 2: aiming at leaf nodes of trees in random forest, if the leaf nodes exist, satisfying
Figure FDA0003480716590000012
The corresponding tree needs to be destroyed, wherein maincBefore the representation tree is updated, the leaf nodes correspond to the device labels to which most of the classes belong,
Figure FDA0003480716590000013
after the representation tree is updated, the leaf nodesCorresponding to the corresponding device tag maincThe ratio of the following.
4. The method of claim 1, wherein: the new equipment when updating the random forest model is representative equipment selected based on the following rules;
obtaining a feature set X of the new device as X1,x2,...,xN},xiFeatures representing the ith new device;
clustering the feature set X of the new equipment, and calculating the contour coefficient sc of each round of clusteringkThe number of the clustering centers of each round of clustering is recorded as k, the k value is sequentially taken from 2 to BUDGET, and the BUDGET is the set maximum number of the identification devices;
based on the maximum profile coefficient sckDetermining the number K of the optimal clustering centers;
and selecting specific new equipment from each cluster as representative equipment based on the clustering result corresponding to the optimal clustering center number K.
5. The method of claim 4, wherein: selecting a specific new device from each cluster as a representative device, and selecting a new device with the largest prediction probability variance from each cluster as a representative device;
wherein the prediction probability variance of the ith device is variable _ probsi=Variance(probi) Variance is a Variance function, probiAnd the probability that the ith device is predicted as the device of the Internet of things through each tree obtained after the ith device passes through the random forest is represented.
6. The method of claim 1, wherein: further comprising: judging feature importance based on feature reduction, and deleting unimportant features based on feature importance; wherein the importance of the features is characterized according to the following formula;
feature importance Import (T, f, T) of feature f in tree T of random foresti) Expressed as:
Import(t,f,Ti)=∑n∈t1(split(n)=f)·ΔGinin
wherein split (n) represents a division characteristic of the node n, TiIndicates the time, Δ Ginin1 represents an indicator function for the kiney coefficient of the node n;
feature importance of feature f in random forest Import (f, T)i) Expressed as:
Figure FDA0003480716590000021
in the formula, | trees | represent the number of trees in the random forest;
feature importance score (f, T) of feature f throughout the processi) Expressed as:
score(f,Ti)=score(f,Ti-1)·γ+Import(f,Ti)·(1-γ)
wherein, score (f, T)i)=Import(f,T1) γ is a hyperreference, representing a discount factor for the history score, 0 < γ < 1.
7. The method of claim 1, wherein: the features of the device include traffic features and protocol features;
wherein the flow characteristics are: some or all of the packet numbers, transmission byte numbers, packet length types, packet length average values, variances, and entropy of packet lengths of the ingress flow, the egress flow, and the bidirectional flow;
the protocol is characterized in that: the number of destination IP addresses, the number of IPv4 packets, the number of IPv6 packets, the number of TCP packets, the number of UDP packets, the number of local and remote TCP/UDP port numbers falling within the interval of [0,1024 ], [1024,49152 ], [49152,65535], the number of port numbers, the number of entropy, the maximum value, the minimum value, the number of types, the entropy of TCP window size, the number of domain names, the number of accesses, the entropy, the partial or total combination of TLS handshake times.
8. A system based on the method of any one of claims 1-7, characterized by: the method comprises the following steps:
the characteristic extraction module is used for extracting the characteristics of the equipment;
the identification model construction module is used for constructing identification models of the Internet of things equipment and the non-Internet of things equipment based on the random forest model based on the extracted features and the equipment labels;
and the model updating module is used for updating the random forest model by utilizing the flow characteristics of the new equipment.
9. An electronic terminal, characterized by: the method comprises the following steps:
at least one or more processors
One or more memories;
wherein the memory stores a computer program that the processor invokes to implement:
the process steps of any one of claims 1 to 7.
10. A readable storage medium, characterized by: a computer program is stored, which is invoked by a processor to implement:
the process steps of any one of claims 1 to 7.
CN202210067478.0A 2022-01-20 2022-01-20 Internet of things and non-Internet of things equipment identification method, system, terminal and readable storage medium Pending CN114492613A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210067478.0A CN114492613A (en) 2022-01-20 2022-01-20 Internet of things and non-Internet of things equipment identification method, system, terminal and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210067478.0A CN114492613A (en) 2022-01-20 2022-01-20 Internet of things and non-Internet of things equipment identification method, system, terminal and readable storage medium

Publications (1)

Publication Number Publication Date
CN114492613A true CN114492613A (en) 2022-05-13

Family

ID=81472407

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210067478.0A Pending CN114492613A (en) 2022-01-20 2022-01-20 Internet of things and non-Internet of things equipment identification method, system, terminal and readable storage medium

Country Status (1)

Country Link
CN (1) CN114492613A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109947079A (en) * 2019-03-20 2019-06-28 阿里巴巴集团控股有限公司 Region method for detecting abnormality and edge calculations equipment based on edge calculations
WO2020022953A1 (en) * 2018-07-26 2020-01-30 Singapore Telecommunications Limited System and method for identifying an internet of things (iot) device based on a distributed fingerprinting solution
CN112270346A (en) * 2020-10-20 2021-01-26 清华大学 Internet of things equipment identification method and device based on semi-supervised learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020022953A1 (en) * 2018-07-26 2020-01-30 Singapore Telecommunications Limited System and method for identifying an internet of things (iot) device based on a distributed fingerprinting solution
CN109947079A (en) * 2019-03-20 2019-06-28 阿里巴巴集团控股有限公司 Region method for detecting abnormality and edge calculations equipment based on edge calculations
CN112270346A (en) * 2020-10-20 2021-01-26 清华大学 Internet of things equipment identification method and device based on semi-supervised learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ARUNAN SIVANATHAN: "Characterizing and Classifying IoT Traffic in Smart Cities and Campuses", IEEE, pages 1 *
WUQIANSHENG: "随机森林-特征重要性计算", pages 1, Retrieved from the Internet <URL:https://qianshengwu.github.io/2018/04/24/随机森林-特征重要性计算/> *

Similar Documents

Publication Publication Date Title
CN107592312B (en) Malicious software detection method based on network flow
CN107360145B (en) Multi-node honeypot system and data analysis method thereof
CN113992349B (en) Malicious traffic identification method, device, equipment and storage medium
CN108848065B (en) Network intrusion detection method, system, medium and equipment
CN111953552B (en) Data flow classification method and message forwarding equipment
CN113328985B (en) Passive Internet of things equipment identification method, system, medium and equipment
CN111935185B (en) Method and system for constructing large-scale trapping scene based on cloud computing
CN112769775B (en) Threat information association analysis method, system, equipment and computer medium
CN113706100B (en) Real-time detection and identification method and system for Internet of things terminal equipment of power distribution network
CN111709022A (en) Hybrid alarm association method based on AP clustering and causal relationship
CN114168968A (en) Vulnerability mining method based on Internet of things equipment fingerprints
CN111314379B (en) Attacked domain name identification method and device, computer equipment and storage medium
CN115017441A (en) Asset classification method and device, electronic equipment and storage medium
CN112839055B (en) Network application identification method and device for TLS encrypted traffic and electronic equipment
CN112448963A (en) Method, device, equipment and storage medium for analyzing automatic attack industrial assets
CN110460593B (en) Network address identification method, device and medium for mobile traffic gateway
CN114492613A (en) Internet of things and non-Internet of things equipment identification method, system, terminal and readable storage medium
CN114205146B (en) Processing method and device for multi-source heterogeneous security log
CN116346434A (en) Method and system for improving monitoring accuracy of network attack behavior of power system
CN114896579A (en) User identification method, device, storage medium and terminal equipment
CN114900835A (en) Malicious traffic intelligent detection method and device and storage medium
CN113765891A (en) Equipment fingerprint identification method and device
CN112257783A (en) Botnet traffic classification method and device and electronic equipment
CN115589362B (en) Method for generating and identifying device type fingerprint, device and medium
CN115580490B (en) Industrial Internet edge device behavior detection method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination