WO2023179014A1

WO2023179014A1 - Traffic identification method and apparatus, electronic device, and storage medium

Info

Publication number: WO2023179014A1
Application number: PCT/CN2022/127468
Authority: WO
Inventors: 何鸿业
Original assignee: 中兴通讯股份有限公司
Priority date: 2022-03-23
Filing date: 2022-10-25
Publication date: 2023-09-28
Also published as: CN116846837A

Abstract

Disclosed in embodiments of the present application are a traffic identification method and apparatus, an electronic device, and a storage medium. The method comprises: on the basis of a verification result of an identification model, obtaining a target service that the identification model fails to identify, the verification result comprising a real service to which a plurality of pieces of traffic belongs and an identification result of the identification model on the traffic; processing a conflict relation graph by using a community discovery algorithm to obtain a conflict service cluster set, and obtaining an identification failure reason for the target service according to the conflict service cluster set, wherein the conflict relation graph is used for representing a conflict relation of services, and a conflict service cluster is used for representing a service having a conflict degree greater than a preset threshold; correcting the identification model according to the identification failure reason for the target service to obtain a corrected identification model; and when the corrected identification model reaches preset identification accuracy, performing traffic identification on the basis of the corrected identification model.

Description

流量识别方法、装置、电子设备及存储介质Traffic identification method, device, electronic equipment and storage medium

相关申请Related applications

本申请要求于2022年3月23日申请的、申请号为202210294504.3的中国专利申请的优先权，其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application with application number 202210294504.3 filed on March 23, 2022, the entire content of which is incorporated into this application by reference.

技术领域Technical field

本发明涉及网络流量识别与监控领域，尤其涉及一种流量识别方法、装置、电子设备及存储介质。The present invention relates to the field of network traffic identification and monitoring, and in particular to a traffic identification method, device, electronic equipment and storage medium.

背景技术Background technique

随着互联网的普及与高速发展，激增的网络流量为网络运营商对网络的监控与分析带来了巨大的挑战，相关的网络流量识别方法应运而生。当前主流的流量识别为深度报文检测(Deep Packet Inspection，DPI)，该方法着重与获取业务流量的固定模式：如关键字，明文字段等，并基于这些固定信息构建特征规则库，对流量进行正则匹配识别。With the popularization and rapid development of the Internet, the surge in network traffic has brought huge challenges to network operators in monitoring and analyzing the network, and related network traffic identification methods have emerged as the times require. The current mainstream traffic identification is Deep Packet Inspection (DPI). This method focuses on obtaining fixed patterns of business traffic: such as keywords, plaintext fields, etc., and builds a feature rule base based on these fixed information to conduct traffic analysis. Regular match recognition.

由于构造DPI的规则通常需要花费大量人力，且当流量的规律发生变化后，往往需要反复进行特征的提取。因此近几年基于机器学习(Machine Learning,ML)与深度学习(Deep Learning,DL)的流量识别方法开始被广泛关注起来：前者提取网络流量的统计特征，如包长，包达到时间间隔等统计特征来训练机器学习模型对流量进行分类；而后者则是通过神经网络的表征学习(Representation Learning,RL)手段来自动学习网络流量字节序的特征，实现流量识别的端到端(End to End)自动化。这类识别方法可以有效减轻流量识别的人力成本。Because constructing DPI rules usually requires a lot of manpower, and when traffic patterns change, feature extraction often needs to be repeated. Therefore, traffic identification methods based on machine learning (ML) and deep learning (DL) have begun to attract widespread attention in recent years: the former extracts statistical characteristics of network traffic, such as packet length, packet arrival time interval and other statistics Characteristics are used to train a machine learning model to classify traffic; the latter uses the Representation Learning (RL) method of neural networks to automatically learn the byte order characteristics of network traffic to achieve end-to-end (End to End) traffic identification. )automation. This type of identification method can effectively reduce the labor cost of traffic identification.

但无论是DPI这类基于规则的流量识别，还是基于统计学习的自动化识别，都无法做到完全正确的识别，并且由于现有技术中的流量识别方法只能对识别模型的识别效果进行验证，无法从验证结果中获取影响识别效果的因素，并对识别模型进行针对性的优化，因此，确定流量识别过程中哪些应用造成了识别冲突，以及冲突产生的原因，并对其进行反馈纠正，是改善流量识别准确率的关键所在。However, neither rule-based traffic identification such as DPI nor automated identification based on statistical learning can achieve completely correct identification, and since the traffic identification methods in the existing technology can only verify the identification effect of the identification model, It is impossible to obtain the factors that affect the recognition effect from the verification results and perform targeted optimization of the recognition model. Therefore, it is necessary to determine which applications caused the recognition conflicts during the traffic recognition process, as well as the reasons for the conflicts, and provide feedback and correction. The key to improving traffic identification accuracy.

发明内容Contents of the invention

本发明的目的在于解决上述问题，提供一种流量识别方法、装置、电子设备及存储介质，解决了无法从模型识别的验证结果获取影响识别效果的因素，并根据影响因素对识别模型进行优化的问题，提高了流量识别的准确率。The purpose of the present invention is to solve the above problems and provide a flow identification method, device, electronic equipment and storage medium, which solves the problem of being unable to obtain the factors that affect the identification effect from the verification results of model identification, and optimizing the identification model based on the influencing factors. problem, improving the accuracy of traffic identification.

为解决上述问题，本申请的实施例提供了一种流量识别方法、装置、电子设备及存储介质，方法包括：基于识别模型的验证结果，获取识别模型识别失误的目标业务，验证结果包括：多个流量所属的真实业务和识别模型对流量的识别结果；使用社区发现算法对冲突关系图进行处理，得到冲突业务簇集合，根据冲突业务簇集合，得到目标业务的识别失误原因；其中，冲突关系图用于表示各个业务的冲突关系，冲突业务簇用于表示冲突程度大于预设门限的业务；根据目标业务的识别失误原因，对识别模型进行修正，得到修正后的识别模型；在修正后的识别模型达到预设识别准确率的情况下，基于修正后的识别模型进行流量识别。In order to solve the above problems, embodiments of the present application provide a traffic identification method, device, electronic device and storage medium. The method includes: based on the verification results of the identification model, obtaining the target business that the identification model fails to identify. The verification results include: multiple The real business to which each traffic belongs and the identification result of the traffic identification model; use the community discovery algorithm to process the conflict relationship graph to obtain a set of conflict business clusters. Based on the set of conflict business clusters, the cause of the identification error of the target business is obtained; among them, the conflict relationship The graph is used to represent the conflict relationship of each business, and the conflict business cluster is used to represent the business whose conflict degree is greater than the preset threshold; according to the cause of the identification error of the target business, the identification model is corrected to obtain the corrected identification model; after the correction When the recognition model reaches the preset recognition accuracy, traffic recognition is performed based on the revised recognition model.

为解决上述问题，本申请的实施例提供了一种流量识别装置，包括：获取模块，用于基于识别模型的验证结果，获取识别模型识别失误的目标业务，所述验证结果包括：多个流量所属的真实业务和识别模型对流量的识别结果；处理模块，用于使用社区发现算法对冲突关系图进行处理，得到冲突业务簇集合，根据冲突业务簇集合，得到目标业务的识别失误原因；其中，冲突关系图用于表示各个业务的冲突关系，冲突业务簇用于表示冲突程度大于预设门限的业务；修正模块，用于根据目标业务的识别失误原因，对识别模型进行修正，得到修正后的识别模型；识别模块，用于在修正后的识别模型达到预设识别准确率的情况下，基于修正后的识别模型进行流量识别。In order to solve the above problem, embodiments of the present application provide a traffic identification device, including: an acquisition module, used to obtain the target business that has been misidentified by the identification model based on the verification results of the identification model. The verification results include: multiple traffic flows The identification result of the traffic by the real business and the identification model to which it belongs; the processing module is used to process the conflict relationship graph using the community discovery algorithm to obtain the conflict business cluster set, and obtain the identification error cause of the target business based on the conflict business cluster set; where , the conflict relationship diagram is used to represent the conflict relationship of each business, and the conflict business cluster is used to represent the business whose conflict degree is greater than the preset threshold; the correction module is used to correct the identification model according to the cause of the identification error of the target business, and obtain the corrected The recognition model; the recognition module is used to identify traffic based on the revised recognition model when the revised recognition model reaches the preset recognition accuracy rate.

为解决上述问题，本申请的实施例还提供了一种电子设备，包括：至少一个处理器；以及，与所述至少一个处理器通信连接的存储器；其中，所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行上述流量识别方法。In order to solve the above problem, an embodiment of the present application also provides an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores information that can be used by the Instructions executed by at least one processor, the instructions being executed by the at least one processor, so that the at least one processor can execute the above traffic identification method.

为解决上述问题，本申请的实施例还提供了一种计算机可读存储介质，存储有计算机程序，所述计算机程序被处理器执行时实现上述流量识别方法。In order to solve the above problem, embodiments of the present application also provide a computer-readable storage medium storing a computer program. When the computer program is executed by a processor, the above traffic identification method is implemented.

在本申请实施例中，通过识别模型的验证结果，获取识别模型识别失误的目标业务，来确定需要进行分析获取识别失败原因的目标业务，通过社区发现算法对用于表示各个业务的冲突关系的冲突图进行处理，得到冲突业务簇，即冲突程度大于预设门限的业务，从而可以根据冲突业务簇来判断获取到的目标业务识别失误的原因，通过获取到的识别失误的原因，对识别模型有针对性的进行修正，并将达到预设识别准确率的识别模型，投入到流量识别中去，有效地解决无法从模型识别的验证结果获取影响识别效果的因素，并根据影响因素对识别模型进行优化的问题，提高了流量识别的准确率。In the embodiment of this application, through the verification results of the recognition model, the target business that the recognition model fails to identify is obtained to determine the target business that needs to be analyzed to obtain the cause of the identification failure, and the community discovery algorithm is used to represent the conflict relationship of each business. The conflict graph is processed to obtain conflict business clusters, that is, businesses whose conflict degree is greater than the preset threshold. Therefore, the cause of the obtained target business identification error can be judged based on the conflict business cluster. Through the obtained cause of the identification error, the identification model can be Make targeted corrections and invest the identification model that reaches the preset identification accuracy into traffic identification, effectively solve the problem of not being able to obtain the factors that affect the identification effect from the verification results of model identification, and evaluate the identification model based on the influencing factors. Optimize the problem and improve the accuracy of traffic identification.

附图说明Description of the drawings

一个或多个实施例通过与之对应的附图中的图片进行示例性说明，这些示例性说明并不构成对实施例的限定，附图中具有相同参考数字标号的元件表示为类似的元件，除非有特别申明，附图中的图不构成比例限制。One or more embodiments are exemplified by the pictures in the corresponding drawings. These illustrative illustrations do not constitute limitations to the embodiments. Elements with the same reference numerals in the drawings are represented as similar elements. Unless otherwise stated, the figures in the drawings are not intended to be limited to scale.

图1是相关技术手段中流量识别方法的流程图；Figure 1 is a flow chart of the traffic identification method in related technical means;

图2是本申请一实施例提供的流量识别方法的流程图；Figure 2 is a flow chart of a traffic identification method provided by an embodiment of the present application;

图3是本申请一实施例提供的应用于DPI流程识别场景的流量识别方法的流程图；Figure 3 is a flow chart of a traffic identification method applied to a DPI process identification scenario provided by an embodiment of the present application;

图4是本申请一实施例提供的应用于统计学习识别环境的流量识别方法的流程图；Figure 4 is a flow chart of a traffic identification method applied in a statistical learning identification environment provided by an embodiment of the present application;

图5是本申请一实施例提供的流量识别***的示意图；Figure 5 is a schematic diagram of a traffic identification system provided by an embodiment of the present application;

图6是本申请一实施例提供流量识别装置的结构示意图；Figure 6 is a schematic structural diagram of a traffic identification device provided by an embodiment of the present application;

图7是本申请一实施例提供的电子设备的结构示意图。FIG. 7 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.

具体实施方式Detailed ways

本申请实施例提供的流量识别方法，为对流量识别方案的通用加强手段，用于加强流量识别的准确性，需要搭配一种现有的流量识别***进行使用，本申请实施例提供的流量识别方法，可以应用到DPI流程识别场景中，也可以适配到端到端的统计学习智能流量识别环境中。基于规则的流量识别和基于统计学习的流量识别的基本流程如图1所示，主要包括如下步骤：业务流量数据的抓包收集、识别模型的构建(规则库/分类模型的构建)和识别模型的验证(对规则库/分类模型的验证)。The traffic identification method provided by the embodiments of this application is a general enhancement method for the traffic identification scheme. It is used to enhance the accuracy of traffic identification. It needs to be used with an existing traffic identification system. The traffic identification method provided by the embodiments of this application The method can be applied to DPI process identification scenarios, and can also be adapted to end-to-end statistical learning intelligent traffic identification environments. The basic process of rule-based traffic identification and statistical learning-based traffic identification is shown in Figure 1, which mainly includes the following steps: packet capture collection of business traffic data, construction of identification model (construction of rule base/classification model) and identification model Validation (validation of rule base/classification model).

为使本申请实施例的目的、技术方案和优点更加清楚，下面将结合附图对本申请的各实施方式进行详细的阐述。然而，本领域的普通技术人员可以理解，在本申请各实施方式中，为了使读者更好地理解本申请而提出了许多技术细节。但是，即使没有这些技术细节和基于以下各实施方式的种种变化和修改，也可以实现本申请所要求保护的技术方案。In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, each implementation mode of the present application will be described in detail below with reference to the accompanying drawings. However, those of ordinary skill in the art can understand that in each embodiment of the present application, many technical details are provided to enable readers to better understand the present application. However, even without these technical details and various changes and modifications based on the following embodiments, the technical solution claimed in this application can also be implemented.

本申请的一实施例涉及一种流量识别方法，方法包括：基于识别模型的验证结果，获取识别模型识别失误的目标业务，验证结果包括：多个流量所属的真实业务和识别模型对流量的识别结果；使用社区发现算法对冲突关系图进行处理，得到冲突业务簇集合，根据冲突业务簇集合，得到目标业务的识别失误原因；其中，冲突关系图用于表示各个业务的冲突关系，冲突业务簇用于表示冲突程度大于预设门限的业务；根据目标业务的识别失误原因，对识别模型进行修正，得到修正后的识别模型；在修正后的识别模型达到预设识别准确率的情况下，基于修正后的识别模型进行流量识别。One embodiment of the present application relates to a traffic identification method. The method includes: based on the verification results of the identification model, obtaining the target business that the identification model fails to identify. The verification results include: the real business to which the multiple flows belong and the identification of the traffic by the identification model. Results: Use the community discovery algorithm to process the conflict relationship graph to obtain a set of conflict business clusters. Based on the set of conflict business clusters, the cause of the identification error of the target business is obtained. Among them, the conflict relationship graph is used to represent the conflict relationship of each business, and the conflict business clusters are Used to represent services with a degree of conflict greater than the preset threshold; based on the cause of the identification error of the target business, the identification model is corrected to obtain a corrected identification model; when the corrected identification model reaches the preset identification accuracy, based on The modified identification model is used for traffic identification.

下面对本实施例中的流量识别方法的实现细节进行具体的说明，以下内容仅为方便理解本方案的实现细节，并非实施本方案的必须。具体流程如图2所示，可包括如下步骤：The implementation details of the traffic identification method in this embodiment will be described in detail below. The following content is only for the convenience of understanding the implementation details of this solution and is not necessary for implementing this solution. The specific process is shown in Figure 2, which may include the following steps:

在步骤S1中，基于识别模型的验证结果，获取识别模型识别失误的目标业务，验证结果包括：多个流量所属的真实业务和识别模型对流量的识别结果；In step S1, based on the verification results of the identification model, obtain the target business that the identification model fails to identify. The verification results include: the real business to which multiple flows belong and the identification results of the traffic by the identification model;

其中，识别模型为基于规则的流量识别方法构造的规则库，或者为基于统计学习的识别方法构造的统计学习分类模型。The identification model is a rule base constructed by a rule-based traffic identification method, or a statistical learning classification model constructed by a statistical learning-based identification method.

在本申请实施例中，根据识别模型的验证结果计算各个业务的用于表征识别准确率的指标值；将所述各个业务的指标值与预设范围进行比较，并将指标值不在预设范围内的业务作为所述目标业务，其中，指标可以为精确率,召回率与F1值(F1-score)。In this embodiment of the present application, the indicator value of each service used to characterize the recognition accuracy is calculated based on the verification result of the recognition model; the indicator value of each service is compared with the preset range, and the indicator value is not within the preset range. The business within is used as the target business, wherein the indicators can be precision rate, recall rate and F1 value (F1-score).

在一个例子中，在用于表征识别准确率的指标为精确率的情况下，根据识别模型的验证结果计算各个业务的精确率，设置一个精确率预设阈值对所有业务进行筛选，精确率小于精确率预设阈值的业务，被判定为目标业务，即被识别模型识别失误的业务。In one example, when the indicator used to characterize the recognition accuracy is accuracy, the accuracy of each business is calculated based on the verification results of the recognition model, and a preset threshold for accuracy is set to screen all businesses. The accuracy is less than Businesses with a preset threshold of accuracy are judged as target businesses, that is, businesses that are misidentified by the recognition model.

在一个例子中，在用于表征识别准确率的指标为召回率的情况下，根据识别模型的验证结果计算各个业务的召回率，设置一个召回率预设阈值对所有业务进行筛选，召回率大于召回率预设阈值的业务，被判定为目标业务，即被识别模型识别失误的业务。In one example, when the indicator used to characterize the recognition accuracy is the recall rate, the recall rate of each business is calculated based on the verification results of the recognition model, and a preset recall threshold is set to screen all businesses. The recall rate is greater than Businesses with a preset recall threshold are judged to be target businesses, that is, businesses that have been misrecognized by the recognition model.

通过设置一个精确率阈值来对所有流量业务进行筛选，找出识别效果低下的业务，便于针对这些识别效果低下的业务进行识别失误原因分析。By setting an accuracy threshold, all traffic services are screened to identify services with low identification effects, so that the causes of identification errors can be analyzed for these services with low identification effects.

在步骤S2中，使用社区发现算法对冲突关系图进行处理，得到冲突业务簇集合，根据冲突业务簇集合，得到目标业务的识别失误原因；其中，冲突关系图用于表示各个业务的冲突关系，冲突业务簇用于表示冲突程度大于预设门限的业务。In step S2, the community discovery algorithm is used to process the conflict relationship graph to obtain a set of conflicting business clusters. According to the set of conflicting business clusters, the cause of the identification error of the target business is obtained. Among them, the conflict relationship graph is used to represent the conflict relationship of each business. Conflict service clusters are used to represent services whose conflict degree is greater than the preset threshold.

其中，社区发现算法是图挖掘算法中一类，它用于挖掘图中联系紧密的部分，可以将高度联系的节点集合作为“社区”提取出来，即冲突业务簇。社区发现算法与识别冲突检测的目标很契合，可以帮助定位到识别过程中高度关联的业务。本申请实施例中的社区发现算法包括且不限于k-clique算法，Newman快速算法，Kernighan-Lin算法等。Among them, the community discovery algorithm is a type of graph mining algorithm. It is used to mine closely connected parts of the graph. It can extract highly connected node sets as "communities", that is, conflict business clusters. The community discovery algorithm is very consistent with the goal of identification conflict detection and can help locate highly relevant businesses in the identification process. Community discovery algorithms in the embodiments of this application include but are not limited to k-clique algorithm, Newman fast algorithm, Kernighan-Lin algorithm, etc.

其中，冲突关系图用于反映各个业务的冲突关系，在流量识别验证过程中，当一个业务A的流量被大量命中为业务B，同时业务B的流量也被大量命中为业务B，则A与B被定义为冲突业务，即具有冲突关系。Among them, the conflict relationship diagram is used to reflect the conflict relationship of each business. During the traffic identification and verification process, when a large number of traffic of business A is hit as business B, and at the same time, a large number of traffic of business B is also hit as business B, then A and B is defined as a conflict business, that is, it has a conflict relationship.

在本申请实施例中，在基于识别模型的验证结果，获取识别模型识别失误的目标业务之后，且在使用社区发现算法对冲突关系图进行处理，得到冲突业务簇集合，根据冲突业务簇集合，得到目标业务的识别失误原因之前，还包括：根据验证结果生成命中混淆矩阵M；命中混淆矩阵的行元素表示流量所属的真实业务；命中混淆矩阵的列元素表示识别模型对流量的识别结果；M的元素值M[i,j]表示第i个真实业务被识别为第j个业务的次数；将每个业务作为一个节点，并根据M中的各元素值和第一预设阈值，生成冲突关系图；其中，在M[i,j]和M[j,i]均大于第一预设阈值的情况下，两个业务之间具有冲突关系。In the embodiment of this application, after obtaining the target business that the recognition model misidentified based on the verification results of the recognition model, and using the community discovery algorithm to process the conflict relationship graph, a set of conflicting business clusters is obtained. According to the set of conflicting business clusters, Before obtaining the cause of the identification error of the target business, it also includes: generating a hit confusion matrix M based on the verification results; the row elements of the hit confusion matrix represent the real business to which the traffic belongs; the column elements of the hit confusion matrix represent the identification results of the traffic by the identification model; M The element value M[i,j] of represents the number of times that the i-th real business is identified as the j-th business; each business is regarded as a node, and a conflict is generated based on each element value in M and the first preset threshold Relationship diagram; wherein, when M[i,j] and M[j,i] are both greater than the first preset threshold, there is a conflict relationship between the two services.

在一个例子中，假设流量识别有N种业务，混淆矩阵就是一个N×N大小的二维矩阵，它的构造方法如下：构造一个N×N大小的二维矩阵，矩阵中各元素的值为0；遍历全部的验证结果，将验证结果中流量所属的真实业务作为矩阵的行元素，将识别模型对流量的识别结果作为矩阵的列元素，例如流量的真实业务A对应编号i，流量的识别业务B对应的编号为j,那么真实业务A的流量被识别成业务B的次数为矩阵M[i,j]的值。因此，矩阵的每一行M[i,:]反映了业务编号为i的业务命中为各个业务的次数，M[i,i]的值越大则代表业务i识别的正确率越高。In an example, assuming that there are N types of traffic identification services, the confusion matrix is a two-dimensional matrix of N×N size. Its construction method is as follows: Construct a two-dimensional matrix of N×N size, and the value of each element in the matrix is 0; Traverse all the verification results, use the real business to which the traffic in the verification results belongs as the row element of the matrix, and use the identification result of the traffic identification model as the column element of the matrix. For example, the real business A of the traffic corresponds to the number i, and the identification of the traffic The number corresponding to business B is j, then the number of times the traffic of real business A is recognized as business B is the value of matrix M[i,j]. Therefore, each row M[i,:] of the matrix reflects the number of times the service number i hits each service. The larger the value of M[i,i], the higher the accuracy of identifying service i.

在一个例子中，冲突关系图的构造流程如下：将每个业务作为一个节点，两两遍历矩阵M，将具有冲突关系的两个业务之间用无向边连接，例如：预设命中门限阈值为6，即第一预设阈值为6，M[i,j]＝7，即i所对应的业务被识别为j所对应的业务的次数为7次，M[j,i]＝8，即j所对应的业务被识别为i所对应的业务的次数为8次，由于M[i,j]和M[j,i]都大于第一预设阈值，因此i和j所对应的两个业务之间具有冲突关系。In an example, the construction process of the conflict relationship graph is as follows: treat each business as a node, traverse the matrix M in pairs, and connect the two businesses with conflict relationships with undirected edges, for example: preset hit threshold is 6, that is, the first preset threshold is 6, M[i,j]=7, that is, the number of times that the service corresponding to i is identified as the service corresponding to j is 7 times, M[j,i]=8, That is, the number of times that the service corresponding to j is identified as the service corresponding to i is 8 times. Since both M[i,j] and M[j,i] are greater than the first preset threshold, the two corresponding services of i and j There is a conflict between the two businesses.

在一个例子中，使用的社区发现算法为k-clique算法，冲突业务簇集合的提取过程如下：In one example, the community discovery algorithm used is the k-clique algorithm, and the extraction process of conflict business cluster sets is as follows:

1.提取冲突关系图中所有的团，其中，团为冲突关系图的一个子图，内部节点全部连接，且任意再加一个节点就不能再维持全联通。1. Extract all cliques in the conflict relationship graph. A clique is a subgraph of the conflict relationship graph. All internal nodes are connected, and any additional node cannot maintain full connectivity.

2.设置一个K值，筛选出所有节点数大于等于k的团。2. Set a K value to filter out all clusters with a number of nodes greater than or equal to k.

3.两两遍历所有筛选的团，如果两个团公共节点数大于等于k-1，则合并两个团。3. Traverse all filtered groups in pairs. If the number of common nodes between the two groups is greater than or equal to k-1, merge the two groups.

4.将团最终合并的结果作为冲突业务簇提取出来。4. Extract the final merged results of the group as conflicting business clusters.

在k-clique算法中，k值越大提取的冲突业务簇的业务联系更紧密，但是提取的冲突业务簇数量会变少。对于在k-clique算法中，本申请实施例引入k值递减的提取策略，即先设置一个偏高的k值提取高关联的社区，然后对剩余的节点降低k值提取相对低关联的社区。这样操作可以优先获取到冲突程度高的流量业务集合，但是也不会忽略剩余的冲突业务簇。对应其他算法，本申请实施例使用类似的“由繁到简”的策略。In the k-clique algorithm, the larger the k value, the closer the business connection of the extracted conflict business clusters, but the number of extracted conflict business clusters will become smaller. For the k-clique algorithm, the embodiment of this application introduces the extraction strategy of decreasing k value, that is, first setting a relatively high k value to extract highly correlated communities, and then lowering the k value for the remaining nodes to extract relatively low correlated communities. In this way, traffic service sets with a high degree of conflict can be obtained first, but the remaining conflicting service clusters will not be ignored. Corresponding to other algorithms, the embodiment of this application uses a similar "from complex to simple" strategy.

值得一提的是，本申请实施例提供的流量识别方法首次引入图挖掘的社区发现算法来检测流量识别的冲突，在构造识别命中的混淆矩阵后，将业务的冲突关系整理为网络图，利用社区发现算法来定位高关联性的应用业务。It is worth mentioning that the traffic identification method provided by the embodiment of this application introduces the community discovery algorithm of graph mining for the first time to detect conflicts in traffic identification. After constructing a confusion matrix for identification hits, the conflict relationships of the services are organized into a network graph, using Community discovery algorithm to locate highly relevant application services.

在本申请实例中，在目标业务属于冲突业务簇集合的情况下，得到目标业务识别失误的原因为存在业务冲突；在目标业务不属于冲突业务簇集合的情况下，得到目标业务识别失误的原因为输入识别模型的流量特征的原因。In the example of this application, when the target service belongs to the conflicting business cluster set, the reason for the target service identification error is that there is a business conflict; when the target service does not belong to the conflicting business cluster set, the reason for the target service identification error is obtained The reason for the traffic characteristics of the input identification model.

在一个例子中，遍历所有目标业务，如果目标业务从属某个冲突业务簇，则将其判定为存在“业务冲突”；如果目标业务不属于任一冲突业务簇，说明该目标业务识别效果低下不是由于相似业务造成，而是数据或特征本身原因，将这些业务判定为“特征问题”，即输入识别模型的流量特征的原因。In one example, all target businesses are traversed. If the target business belongs to a conflicting business cluster, it is determined to have a "business conflict"; if the target business does not belong to any conflicting business cluster, it means that the target business identification effect is low. These services are determined to be "feature problems" because they are caused by similar services, but not by the data or features themselves, that is, because of the traffic characteristics input into the recognition model.

在步骤S3中，根据目标业务的识别失误原因，对识别模型进行修正，得到修正后的识别模型。In step S3, the recognition model is corrected based on the cause of the target business recognition error to obtain a corrected recognition model.

在本申请实施例中，在识别失误的原因为输入识别模型的流量特征的原因的情况下，对目标业务进行人工分析，并对识别模型进行修正；在识别失误的原因为存在业务冲突且存在公共流量特征的情况下，将公共流量特征在构造识别模型时去除，或将公共流量特征作为一个单独的业务类别的特征；其中，公共流量特征为在一个冲突业务簇内导致多个业务冲突的特征；在识别失误的原因为存在业务冲突且不存在公共流量特征的情况下，将冲突业务簇内的各业务合并为一个业务。其中，公共流量特征可以为IP和域名。In the embodiment of the present application, when the cause of the identification error is the traffic characteristics of the input recognition model, the target business is manually analyzed and the identification model is corrected; when the cause of the identification error is the existence of business conflicts and the existence of In the case of public traffic characteristics, the public traffic characteristics should be removed when constructing the identification model, or the public traffic characteristics should be regarded as the characteristics of a separate business category; among them, the public traffic characteristics are those that cause multiple business conflicts within a conflicting business cluster. Characteristics; when the cause of the identification error is that there is a business conflict and there are no common traffic characteristics, each business in the conflicting business cluster is merged into one business. Among them, the public traffic characteristics can be IP and domain name.

在一个例子中，对于输入识别模型的流量特征造成的冲突，很难进行自动化的反馈修正，因为这种问题的根因有很多可能性，如采集的样本数据不规范，规则库/模型设计的缺陷等。对这部分的目标业务，更多的是进行人工“监控”，将目标业务收集下发给流量识别环节进行人工干涉：例如重新收集目标业务的流量，反馈给业务人员重新分析流量，重新提取正则特征(基于传统DPI识别)或者修改特征工程(基于ML的识别)等。In one example, it is difficult to perform automated feedback correction for conflicts caused by the traffic characteristics of the input recognition model, because there are many possibilities for the root cause of this problem, such as irregularities in collected sample data, improper design of the rule base/model Defects etc. For this part of the target business, more manual "monitoring" is performed, and the target business is collected and sent to the traffic identification link for manual intervention: for example, the traffic of the target business is re-collected, fed back to the business personnel to re-analyze the traffic, and re-extract the regular rules. Features (based on traditional DPI recognition) or modified feature engineering (based on ML recognition), etc.

在一个例子中，对于业务冲突造成的识别失误，可以实现自动化的反馈修正，相关修正操作包括且不限于：In one example, automatic feedback correction can be implemented for identification errors caused by business conflicts. Relevant correction operations include but are not limited to:

在没有获取到公共流量特征的情况下，或公共流量特征无效的情况下，将冲突业务簇内的各业务合并为一个业务，下一轮模型构造时规避这些应用的识别冲突；If the public traffic characteristics are not obtained, or if the public traffic characteristics are invalid, merge the services in the conflicting business cluster into one business to avoid the identification conflicts of these applications in the next round of model construction;

假设公共流量特征为公共域名，在获取到有效的公共流量特征的情况下，执行的修正操作为：将冲突业务簇提取的公共域名分离为单独的一个业务类别的特征，如“阿里公共流量”，“腾讯公共流量”等，识别模型进行重构造时将这些公共域名直接归类到分离的类别中，防止与其他业务发生识别冲突；Assume that the public traffic characteristics are public domain names. When valid public traffic characteristics are obtained, the corrective operation is to separate the public domain names extracted from conflicting business clusters into features of a separate business category, such as "Alibaba Public Traffic" , "Tencent Public Traffic", etc. When the recognition model is reconstructed, these public domain names are directly classified into separate categories to prevent recognition conflicts with other businesses;

假设公共流量特征为公共域名，在获取到有效的公共流量特征的情况下，执行的修正操作还可以为：将提取的公共域名在下一轮模型构造时剔除掉，如ML与DL分类模型训练时去掉这些公共域名的训练样本。Assume that the public traffic characteristics are public domain names. When valid public traffic characteristics are obtained, the correction operation can also be: remove the extracted public domain names from the next round of model construction, such as when training ML and DL classification models. Remove the training samples of these public domain names.

在上述修正过程中，可以对识别过程中的问题业务进行监控，并确定问题业务造成识别冲突的原因，因此，可以有效地知道业务分析人员对数据采集与识别模型的构造进行针对性的检查与改进，使流量识别问题的定位形成体系化。In the above correction process, the problem business in the identification process can be monitored and the cause of the identification conflict caused by the problem business can be determined. Therefore, the business analysts can effectively know the targeted inspection and construction of the data collection and identification model. Improvement to systematize the positioning of traffic identification problems.

另外，除了为业务人员提供分析参考外，对由于相似业务造成的识别冲突，可以做到完全自动的反馈修正，提供了诸如业务合并与公共流量业务分离等多种可选择的反馈策略，降低了改善流量识别过程的人工干涉。In addition, in addition to providing analysis reference for business personnel, it can achieve fully automatic feedback correction for identification conflicts caused by similar services, and provides a variety of optional feedback strategies such as business merger and public traffic business separation, reducing the cost of Improve manual intervention in the traffic identification process.

在本申请实施例中，提取公共流量特征的操作如下：针对每一个冲突业务簇，提取出现的流量特征；计算各流量特征在冲突业务簇内的各业务中分别出现的词频；在冲突业务簇内，计算各流量特征的逆文档频率；逆文档频率用于表征第一业务数量与冲突业务簇内的总业务数量的占比，其中，第一业务数量用于表示出现流量特征的业务的数量；在冲突业务簇内的流量特征的逆文档频率大于第二预设阈值，且在冲突业务簇内的任一业务的流量特征的词频均小于第三预设阈值的情况下，将识别准确率小于第四预设阈值的冲突业务簇内的流量特征，作为所述冲突业务簇的公共流量特征。In the embodiment of this application, the operation of extracting public traffic features is as follows: for each conflicting business cluster, extract the traffic features that appear; calculate the word frequency of each traffic feature appearing in each business within the conflicting business cluster; Within, calculate the inverse document frequency of each traffic feature; the inverse document frequency is used to represent the proportion of the first business number to the total number of businesses in the conflicting business cluster, where the first business number is used to represent the number of businesses with traffic features. ; When the inverse document frequency of the traffic characteristics in the conflicting business cluster is greater than the second preset threshold, and when the word frequency of the traffic characteristics of any business in the conflicting business cluster is less than the third preset threshold, the recognition accuracy will be The traffic characteristics within the conflicting service cluster that are less than the fourth preset threshold are used as the common traffic characteristics of the conflicting service cluster.

在一个例子中，公共流量特征为域名，提取的公共域名的操作如下：In one example, the public traffic feature is a domain name, and the operation of extracting the public domain name is as follows:

1.对每一个冲突业务簇提取其中所有出现的域名。1. Extract all domain names appearing in each conflicting business cluster.

2.计算各个域名在每个业务中出现的词频(Term Frequency,TF)：假设业务A的验证样本有N条，而其中域名为x的流量样本有N’条，那么x在A业务中的词频TF _A＝N’/N。TF值越高说明域名x在业务中的重要性越高。 2. Calculate the term frequency (TF) of each domain name in each business: Assume that there are N verification samples for business A, and there are N' traffic samples with domain name x, then x in business A Word frequency TF _A =N'/N. The higher the TF value, the higher the importance of domain name x in the business.

3.在冲突业务簇内，计算各个域名的逆文档频率(Inverse Document Frequency，IDF)：假设某个冲突业务簇B有M个业务，而域名x存在于其中的M’个业务流量的样本中，那么域名x在冲突簇B中的逆文档频率IDF _B＝M’/M。IDF值越高说明域名x出现在上述冲突业务簇内多个业务的流量中，其很可能就是造成识别冲突的原因。 3. Within the conflicting business cluster, calculate the inverse document frequency (IDF) of each domain name: Assume that a conflicting business cluster B has M businesses, and the domain name x exists in M' business traffic samples , then the inverse document frequency IDF _B of domain name x in conflict cluster B =M'/M. The higher the IDF value indicates that domain name x appears in the traffic of multiple services in the conflicting service cluster, which is likely to be the cause of the identification conflict.

4.为IDF设置阈值T _3，，即第二预设阈值，假设指定冲突业务簇内域名的IDF值大于T ₃，则将其作为公共流量特征的候补输入后续流程，否则忽略。 4. Set the threshold T _{3 for IDF,} that is, the second preset threshold. If the IDF value of the domain name in the specified conflicting business cluster is greater than T ₃ , then it will be entered into the subsequent process as a candidate for public traffic characteristics, otherwise it will be ignored.

5.为TF设置阈值T ₄，即第三预设阈值，由于，若域名在冲突业务簇内任一业务的TF高于T ₄，那么说明该域名为业务的主要流量，因此，提取在冲突业务簇内任一业务的TF均低于T ₄的域名，输入后续流程作为公共流量特征的候补。 5. Set the threshold T ₄ for TF, which is the third preset threshold. If the domain name has a TF higher than T ₄ for any business in the conflicting business cluster, it means that the domain name is the main traffic of the business. Therefore, extract the content in the conflicting business cluster. If the TF of any business in the business cluster is lower than T ₄ domain name, enter the subsequent process as a candidate for public traffic characteristics.

6.获取基于验证结果得到的域名的识别准确率，设置阈值T5，如果域名在验证时的识别准确率小于T5说明该域名的流量不能被有效的识别出来，则将该域名作为公共流量特征。6. Obtain the recognition accuracy of the domain name based on the verification results, and set the threshold T5. If the recognition accuracy of the domain name during verification is less than T5, it means that the traffic of the domain name cannot be effectively identified, and the domain name will be used as a public traffic feature.

通过TF-IDF来挖掘冲突业务簇中的公共流量的方法，为反馈修正提供有力的支持。The method of mining public traffic in conflicting business clusters through TF-IDF provides strong support for feedback correction.

在本申请实施例中，验证结果还包括：用于输入识别模型的流量特征，流量特征包括：域名和/或IP信息；在提取所述公共流量特征之前，还包括：统计验证结果中出现的流量特征，并计算流量特征的识别准确率。In the embodiment of this application, the verification results also include: traffic characteristics used to input the identification model. The traffic characteristics include: domain name and/or IP information; before extracting the public traffic characteristics, it also includes: the traffic characteristics that appear in the statistical verification results. Traffic characteristics, and calculate the recognition accuracy of traffic characteristics.

在一个例子中，统计验证结果中出现的域名，并计算这些域名的识别准确率，这部分信息用于支持公共流量特征的提起，防止筛选公共流量特征时将能够被有效识别出来的域名，作为公共流量特征。In one example, the domain names appearing in the verification results are counted and the recognition accuracy of these domain names is calculated. This part of the information is used to support the mention of public traffic features and prevent domain names that can be effectively identified when filtering public traffic features, as Public traffic characteristics.

在本申请实施例中，在修正后的识别模型未达到预设识别准确率的情况下，重复执行S1至S3。In the embodiment of the present application, when the corrected recognition model does not reach the preset recognition accuracy rate, S1 to S3 are repeatedly executed.

在一个例子中，可以在识别模型的失误原因检测与反馈修正识别模型的循环中，不断强化识别模型的效果，使流量识别更加精准，将“流量识别”与“冲突检测”连接为了一个闭环，反馈修正后的流量识别模型可以再次进行冲突挖掘与修正，以达到模型不断迭代改进的效果。In one example, the effect of the identification model can be continuously strengthened in the cycle of error cause detection and feedback correction of the identification model to make traffic identification more accurate, connecting "traffic identification" and "conflict detection" into a closed loop. The traffic identification model after feedback correction can be used for conflict mining and correction again to achieve the effect of continuous iterative improvement of the model.

本申请实施例提出的流量识别方法，在基于规则的流量识别方法或基于统计学习的识别方法的基础上，添加识别冲突定位流程和反馈修正流程，来解决现有流量识别技术手段中流量识别流程只到识别评估环节，而无法从评估结果获取影响识别准确性的原因，并进行针对性的修正的问题，同时使流量识别过程形成一个识别迭代改进的闭环。The traffic identification method proposed in the embodiment of this application is based on the rule-based traffic identification method or the identification method based on statistical learning, and adds an identification conflict positioning process and a feedback correction process to solve the traffic identification process in the existing traffic identification technical means. Only the identification and evaluation stage is reached, but the reasons that affect the identification accuracy cannot be obtained from the evaluation results and targeted corrections can be made. At the same time, the traffic identification process forms a closed loop of iterative improvement of identification.

在步骤S4中，在修正后的识别模型达到预设识别准确率的情况下，基于修正后的识别模型进行流量识别。通过分析识别模型的验证结果，生成业务的关系图并借助相关图挖掘算法提取业务的潜在冲突关系，最后通过相关策略来修正识别模型，消除冲突，并将符合识别准确率要求的修正后的识别模型投入到流量识别过程中，提高了流量识别的准确率。In step S4, when the corrected identification model reaches the preset identification accuracy rate, flow identification is performed based on the corrected identification model. By analyzing the verification results of the identification model, a business relationship diagram is generated and the potential conflict relationships of the business are extracted with the help of a correlation graph mining algorithm. Finally, the identification model is modified through relevant strategies to eliminate conflicts and the corrected identification that meets the identification accuracy requirements is The model is invested in the traffic identification process, improving the accuracy of traffic identification.

为了使本申请实施例提供的流量识别方法更加清楚，通过在DPI流量识别方法中引入本申请实施例提供的方法进行介绍，流程图参考图3所示，具体内容如下：In order to make the traffic identification method provided by the embodiment of this application clearer, the method provided by the embodiment of this application is introduced into the DPI traffic identification method for introduction. The flow chart is shown in Figure 3, and the specific content is as follows:

在步骤301中，接入流量数据，获取待分类的业务流量。In step 301, traffic data is accessed to obtain service traffic to be classified.

在步骤302中，业务分析人员对业务流量进行解包，归纳总结业务流量的固定模式特征，将业务流量的固定模式特征编写为正则特征录入到DPI特征库中，构建DPI特征规则库。In step 302, the business analyst unpacks the business traffic, summarizes the fixed pattern characteristics of the business traffic, compiles the fixed pattern characteristics of the business traffic as regular features and enters them into the DPI feature database to build a DPI feature rule database.

在步骤303中，完成所有业务的特征构造后，将DPI特征库载入引擎，并使用验证样本进行推理识别，获取验证结果。In step 303, after completing the feature construction of all services, the DPI feature library is loaded into the engine, and verification samples are used for inference identification to obtain verification results.

在步骤304中，对验证结果进行冲突挖掘，提取识别失误的业务为目标业务，提取相互冲突严重的冲突业务簇以及上述流量的公共域名。In step 304, conflict mining is performed on the verification results, services with misidentification errors are extracted as target services, conflicting service clusters with serious conflicts with each other and public domain names of the above traffic are extracted.

在步骤305中，判断目标业务是否从属于冲突业务簇集合。In step 305, it is determined whether the target service belongs to the conflicting service cluster set.

在步骤306中，在目标业务不属于任何一个冲突业务簇的情况下，识别失误的原因为业务的特征质量存在问题，那么将业务的相关信息与现有的DPI特征反馈给业务分析人员定位特征问题，重新提取特征；如果业务人员无法从样本中发现特征问题，则回溯到流量采集环节重新获取业务流量进行对比。In step 306, if the target business does not belong to any conflicting business cluster and the cause of the identification error is a problem with the feature quality of the business, then the relevant information of the business and the existing DPI features are fed back to the business analyst to locate the features. If the problem is detected, the features will be extracted again; if the business personnel cannot find the feature problem from the sample, they will go back to the traffic collection process to re-obtain the business traffic for comparison.

在步骤307中，在目标业务属于冲突业务簇的情况下，得到识别失误的原因为业务存在识别冲突，触发自动修正流程，首先获取所有的冲突业务簇并对冲突业务簇内的业务的正则特征的公共部分进行提取。In step 307, when the target business belongs to a conflicting business cluster, the cause of the recognition error is that there is a recognition conflict in the business, and the automatic correction process is triggered. First, all conflicting business clusters are obtained and the regular features of the business in the conflicting business cluster are obtained. The public part is extracted.

在步骤308中，判断是否获取到公共特征。In step 308, it is determined whether the common features are obtained.

在步骤309中，在获取到公共特征的情况下，将公共特征从正则特征中去除，上述公共的DPI特征会导致输入的待识别流量无法正确的匹配到具体的业务上。In step 309, if the common features are obtained, the common features are removed from the regular features. The above-mentioned common DPI features will cause the input traffic to be identified to be unable to be correctly matched to specific services.

在步骤310中，获取冲突检测阶段提取的冲突业务簇的公共域名。In step 310, the public domain name of the conflicting service cluster extracted in the conflict detection stage is obtained.

在步骤311中，检测冲突业务的关键字特征中是否包含公共域名。In step 311, it is detected whether the keyword feature of the conflicting service contains a public domain name.

在步骤312中，在公共域名在冲突业务簇的多个业务中出现的情况下，从规则库中提出公共域名。In step 312, in the case where the public domain name appears in multiple services of the conflicting service cluster, the public domain name is extracted from the rule base.

在步骤313中，在无法对冲突业务获取公共特征的情况下，且冲突业务内没有包含公共域名的关键字特征的情况下，无法通过调整特征来对特征库进行修正，因此，触发合并策略，将冲突业务簇内的业务合并为一个业务，合并后的业务的特征为冲突业务簇内业务原始特征的集合。In step 313, if the public features cannot be obtained for the conflicting business, and the conflicting business does not contain keyword features of public domain names, the feature library cannot be modified by adjusting the features, and therefore, the merging strategy is triggered. Merge the services in the conflicting business cluster into one business, and the characteristics of the merged business are the collection of original characteristics of the services in the conflicting business cluster.

在步骤314中，对修改后的特征库进行人工审核，在审核通过的情况下，同步到DPI特征库中，通过人工审核避免了自动修改的特征库可能带来的识别风险，提高了安全性。In step 314, the modified signature database is manually reviewed. If the review passes, it is synchronized to the DPI signature database. Through manual review, the identification risks that may be caused by the automatically modified feature database are avoided and security is improved. .

对修正后的特征库进行识别验证，并进行下一轮的识别冲突检测，形成迭代闭环。The revised feature library is recognized and verified, and the next round of recognition conflict detection is performed to form an iterative closed loop.

将本申请实施例提供的流量识别方法，引入到DPI识别流程中，可以在每一轮规则库的构造后，自动地或者在业务人员的协助下，对特征进行改善，提高流量识别的准确性。By introducing the traffic identification method provided by the embodiments of this application into the DPI identification process, after each round of rule base construction, features can be improved automatically or with the assistance of business personnel to improve the accuracy of traffic identification. .

为了使本申请实施例中流量识别方法更加清楚，以手机应用APP流量分类为例，通过在统计学习分类识别流程中引入流量识别方法进行介绍，流程图参考图4所示，具体内容如下：In order to make the traffic identification method in the embodiment of this application clearer, taking mobile phone application APP traffic classification as an example, the traffic identification method is introduced into the statistical learning classification identification process. The flow chart is shown in Figure 4, and the specific content is as follows:

在步骤401中，接入手机设备后，对待识别的APP进行数据拨测，即在操作APP上同时抓取APP产生的业务流量；In step 401, after accessing the mobile device, perform data dialing test on the APP to be identified, that is, while operating the APP, capture the business traffic generated by the APP;

在步骤402中，清洗拨测的流量，剔除诸如广告流量和背景流量这类与APP实际业务无关的流量。In step 402, the traffic of the dial test is cleaned, and traffic irrelevant to the actual business of the APP, such as advertising traffic and background traffic, is eliminated.

在步骤403中，构造训练与验证的数据集，确定样本的标签，即在最初状态下样本的标签就是其抓包时对应的APP。In step 403, a training and verification data set is constructed, and the label of the sample is determined. That is, in the initial state, the label of the sample is the APP corresponding to the packet capture.

在步骤404中，对流量数据进行向量化，在识别流程为基于机器学习的流程的情况下，提取网络流量的统计特征，如包长，传包间隔等；在识别流程为基于深度学习的流程的情况下，对原始流量字节进行编码转化，例如1D-CNN相关方法会将原始码流转化为灰度图片。In step 404, the traffic data is vectorized. If the identification process is a process based on machine learning, the statistical characteristics of the network traffic are extracted, such as packet length, packet transmission interval, etc.; if the identification process is a process based on deep learning, In this case, the original traffic bytes are encoded and converted. For example, the 1D-CNN related method will convert the original code stream into a grayscale image.

在步骤405中，设计相关的分类模型结构，利用训练数据进行分类学习，使用验证数据输入训练完成的模型，得到验证样本的验证结果。In step 405, the relevant classification model structure is designed, the training data is used for classification learning, and the verification data is used to input the trained model to obtain the verification results of the verification sample.

在步骤406中，对分类模型的验证结果进行冲突挖掘，提取分类指标不合格的APP，相互冲突严重的APP簇以及上述APP流量的公共IP与公共域名。In step 406, conflict mining is performed on the verification results of the classification model, and APPs with unqualified classification indicators, APP clusters with serious conflicts with each other, and the public IP and public domain name of the above APP traffic are extracted.

在步骤407中，判断不合格APP是否属于任何一个APP簇。In step 407, it is determined whether the unqualified APP belongs to any APP cluster.

在步骤408中，在不合格APP不属于任何一个冲突APP簇的情况下，说明分类模型无法对不合格APP进行有效识别，则将不合格APP反馈给业务人员，检测分类模型的涉及、数据的获取等环节是否有问题。In step 408, if the unqualified APP does not belong to any conflicting APP cluster, it means that the classification model cannot effectively identify the unqualified APP, and the unqualified APP will be fed back to the business personnel to detect the involvement of the classification model and the data. Are there any problems with the acquisition and other aspects?

在步骤409中，在不合格APP属于冲突APP簇的情况下，说明存在业务冲突，提取不合格APP的公共IP和/或域名。In step 409, if the unqualified APP belongs to the conflicting APP cluster, it indicates that there is a service conflict, and the public IP and/or domain name of the unqualified APP is extracted.

在步骤410中，判断是否提取到公共IP和/或域名In step 410, determine whether the public IP and/or domain name are extracted

在步骤411中，在提取到公共IP和/或域名的情况下，基于不合格APP的公共IP和/或域名进行处理，处理方法包括两种：直接在训练样本中剔除公共域名的流量或为每个冲突簇针对公共域名新增一个公共类别标签，在下一轮训练中，上述流量样本被标记为新标签。In step 411, if the public IP and/or domain name are extracted, processing is performed based on the public IP and/or domain name of the unqualified APP. The processing methods include two: directly eliminating the traffic of the public domain name in the training sample or Each conflict cluster adds a new public category label for the public domain name, and in the next round of training, the above traffic samples are marked as new labels.

在步骤412中，对于无法通过公共IP和/或域名处理解决的业务冲突，触发类别合并策略，将冲突APP簇内的APP合并为一个标签，在模型训练过程中不合格APP的流量被打上相同的类别标签进行分类。In step 412, for business conflicts that cannot be resolved through public IP and/or domain name processing, the category merging strategy is triggered to merge the APPs in the conflicting APP cluster into one label. During the model training process, the traffic of unqualified APPs is labeled with the same label. Classify with category labels.

在步骤413中，对公共IP和/或域名的剔除或添加类别标签的操作进行人工审核，防止自动化流程执行高风险的调整。In step 413, the operations of removing or adding category labels to public IPs and/or domain names are manually reviewed to prevent automated processes from performing high-risk adjustments.

在反馈修正后，重新构建数据集训练模型，并进行下一轮的冲突检测，形成迭代闭环。After the feedback correction, the data set training model is rebuilt and the next round of conflict detection is performed to form an iterative closed loop.

上述流程中对统计学习识别结果的冲突检测与案例一基本类似，而在冲突自动修正策略上有很大不同，其相关修正主要针对的是训练数据的清洗，与数据标签的确定，但其做法本质依然是消除冲突流量对模型识别的影响。The conflict detection of statistical learning recognition results in the above process is basically similar to Case 1, but the conflict automatic correction strategy is very different. The relevant corrections are mainly aimed at cleaning the training data and determining the data labels, but its approach The essence is still to eliminate the impact of conflicting traffic on model identification.

本申请实施例提供的流量识别方法，通过对识别模型的验证结果进行分析、计算、筛选，得到需要对识别失误原因进行分析的目标业务，并通过构造命中混淆矩阵来标识各业务被识别的结果，以及通过互命中的结果来构造冲突关系图，使得各业务之间的冲突关系更加清楚，通过社区发现算法将冲突关系图中的冲突程度大的业务提取出来，作为冲突业务簇，从而可以判断出各个目标业务识别失误的原因，最后通过识别失误的原因进行针对性的修正，使得识别模型的准确率更高，本申请实施例提供的流量识别方法是对现有流量识别技术的一种优化改进，可以通用地适配到基于规则的流量识别，与基于统计学习的流量识别方法中，提供了一种体系化的流量识别更新迭代流程，可以在问题检测与反馈修正的循环中，不断强化识别模型的识别效果，使流量识别更加精准，同时对冲突业务反馈的输出信息进行检测，可以帮助业务人员快速定位当前流量识别***存在的问题，并进行及时的响应。由于本申请实时例提供的方法是一种通用的流量识别改进方法，同样可以有效地适配到未来新的流量识别***中。The traffic identification method provided by the embodiment of this application obtains the target services that require analysis of the causes of identification errors by analyzing, calculating, and filtering the verification results of the identification model, and identifies the identified results of each service by constructing a hit confusion matrix. , and construct a conflict relationship graph through the results of mutual hits, making the conflict relationship between each business clearer. Through the community discovery algorithm, the businesses with a high degree of conflict in the conflict relationship graph are extracted as conflict business clusters, so that it can be judged The reasons for the identification errors of each target business are identified, and finally targeted corrections are made by identifying the causes of the errors, so that the accuracy of the identification model is higher. The traffic identification method provided by the embodiment of this application is an optimization of the existing traffic identification technology. Improvements can be universally adapted to rule-based traffic identification and statistical learning-based traffic identification methods. It provides a systematic flow identification update iteration process that can be continuously strengthened in the cycle of problem detection and feedback correction. The identification effect of the identification model makes traffic identification more accurate. At the same time, detecting the output information of conflicting business feedback can help business personnel quickly locate problems in the current traffic identification system and respond in a timely manner. Since the method provided in the real-time example of this application is a universal traffic identification improvement method, it can also be effectively adapted to new traffic identification systems in the future.

本申请实施例还提供了一种流量识别***，包括数据采集模块、模型构造模块、识别验证模块、识别统计模块、关系图构建模块、社区发现模块、冲突判定模块、公共流量挖掘模块和反馈策略模块。其中，数据采集模块、模型构造模块和识别验证模块属于流量识别流程的相关模块，参考流量识别方法的整体模块与模型间的交互图，如图5所示，对各模块进行介绍，具体内容如下：The embodiment of the present application also provides a traffic identification system, including a data collection module, a model construction module, an identification verification module, an identification statistics module, a relationship diagram construction module, a community discovery module, a conflict determination module, a public traffic mining module and a feedback strategy. module. Among them, the data collection module, model construction module and identification verification module are related modules of the traffic identification process. Refer to the interaction diagram between the overall module and the model of the traffic identification method, as shown in Figure 5, to introduce each module. The specific content is as follows :

数据采集模块，用于对待识别的业务进行流量数据的采集与管理，并定期更新流量数据以支持最新版本业务的识别。The data collection module is used to collect and manage traffic data for the business to be identified, and regularly updates the traffic data to support the identification of the latest version of the business.

模型构造模块，用于构造流量识别模型，例如生成DPI特征规则库，或者训练统计分类模型。流量识别模型是一个广泛的定义，但根本目标基本一致，即对待识别的流量进行推理，将这些流量打上具体业务标签。The model construction module is used to construct a traffic identification model, such as generating a DPI feature rule base or training a statistical classification model. Traffic identification model is a broad definition, but the fundamental goal is basically the same, that is, to reason about the traffic to be identified and label these traffic with specific business labels.

识别验证模块，用于对发布前的识别模型进行验证，验证的基本操作为：准备一份标准验证数据集，标准验证数据集中的流量业务标签已经确定，将这些数据输入识别模型获取推理验证的结果，对比原始标签进行评估分析。The identification verification module is used to verify the identification model before release. The basic operation of verification is: prepare a standard verification data set. The traffic business labels in the standard verification data set have been determined. Enter these data into the identification model to obtain inference verification. The results are evaluated and analyzed compared to the original tags.

上述流量识别流程的相关程模块构成了闭环的上游，将该部分得到的验证结果输入到下游的冲突检测反馈相关模块中。The correlation module of the above traffic identification process constitutes the upstream of the closed loop, and the verification results obtained in this part are input into the downstream conflict detection feedback correlation module.

识别统计模块，直接对接上游的识别验证模块，该模块是后续冲突分析与反馈的基础，该模块主要有三个目标：The identification statistics module is directly connected to the upstream identification verification module. This module is the basis for subsequent conflict analysis and feedback. This module has three main goals:

统计业务的识别指标：根据验证样本的推理结果计算各个业务的指标值。常见的指标有精确率，召回率与F1值。Identification indicators of statistical services: Calculate the indicator value of each business based on the inference results of the verification sample. Common indicators include precision rate, recall rate and F1 value.

统计流量特征的识别准确率：统计验证流量样本中出现的域名和/或IP信息，并计算这些域名与IP的识别指标，这部分信息用于支持公共流量挖掘模块，防止后续环节筛选公共流量时误剔除了识别准确率达标的流量样本。Statistical identification accuracy of traffic characteristics: Statistically verify the domain name and/or IP information appearing in traffic samples, and calculate the identification indicators of these domain names and IPs. This part of the information is used to support the public traffic mining module and prevent subsequent steps from screening public traffic. Traffic samples whose recognition accuracy met the standard were mistakenly eliminated.

输出命中混淆矩阵：该目标为识别统计模块的主要功能，混淆矩阵为识别精度评估的一种常见手段。Output hit confusion matrix: This goal is the main function of the recognition statistics module. The confusion matrix is a common means of recognition accuracy evaluation.

关系图构建模块，用于构造一个关系图反映各个业务的冲突关系。在流量识别验证过程中，如果一个业务A的流量被大量命中为B业务，同时B业务的流量也被大量命中为A业务，则A与B被定义为冲突业务。这里利用识别统计模型输出的混淆矩阵M与一个命中门限阈值T ₁来确定应用直接的冲突与否，并构造一张无向图来反映整体的业务识别冲突关系， The relationship diagram building module is used to construct a relationship diagram to reflect the conflict relationships of each business. During the traffic identification and verification process, if a large number of traffic of service A is hit as service B, and at the same time a large number of traffic of service B is hit as service A, then A and B are defined as conflicting services. Here, the confusion matrix M output by the recognition statistical model and a hit threshold T ₁ are used to determine whether the application directly conflicts, and an undirected graph is constructed to reflect the overall business recognition conflict relationship.

社区发现模块，社区发现模块的主要功能便是使用相关的社区算法对冲突关系图进行分析，输出冲突业务簇集合，即定位识别过程中高度关联的业务。该模块并不局限于使用固定的社区发现算法，可用的算法包括且不限于k-clique算法，Newman快速算法，Kernighan-Lin算法等。Community discovery module. The main function of the community discovery module is to use relevant community algorithms to analyze the conflict relationship graph and output a set of conflict business clusters, that is, to locate highly correlated businesses in the identification process. This module is not limited to using fixed community discovery algorithms. Available algorithms include but are not limited to k-clique algorithm, Newman fast algorithm, Kernighan-Lin algorithm, etc.

冲突判定模块，冲突判断模块会根据识别统计模块与社区发现模块的提取的结果，来判断各个流量业务的冲突情况，并根据各个流量业务的验证结果情况，决定了对各个业务的处理方式。Conflict determination module. The conflict determination module will determine the conflict situation of each traffic service based on the extraction results of the identification statistics module and the community discovery module, and determine the processing method of each service based on the verification results of each traffic service.

公共流量特征挖掘模块，如果判断冲突为相似业务造成，那么极高概率造成冲突的因素是存在公共流量特征，因此该模块用于挖掘公共流量特征。Public traffic feature mining module, if it is determined that the conflict is caused by similar services, then the factor causing the conflict with a high probability is the existence of public traffic features, so this module is used to mine public traffic features.

反馈策略模块，前面冲突挖掘流程模块获取的所有信息：冲突业务簇簇，冲突原因，公共流量特征等信息都会传入反馈策略模块，该模块为连接“流量识别流程”与“冲突挖掘流程”的关键模块，也是实际形成反馈闭环的节点。根据冲突判断模块的结果，反馈模块做出的处理是不同的。Feedback strategy module. All the information obtained by the previous conflict mining process module: conflict business clusters, conflict causes, public traffic characteristics and other information will be passed to the feedback strategy module. This module is the link between the "traffic identification process" and the "conflict mining process". The key module is also the node that actually forms the feedback closed loop. According to the result of the conflict judgment module, the feedback module performs different processing.

值得一提的是，借助“人工干涉+自动修正”的策略，下一轮流量识别训练过程构造的识别模型会针对目标业务进行改善，在下一轮评估时重新进行冲突挖掘，形成一个迭代的闭环。需要注意的是，反馈模块的策略是灵活可配置的，上述的反馈方法只是一些例子，只要能利用到挖掘到的冲突信息，后续可以为反馈模型配置添加更多的规则策略。It is worth mentioning that with the help of the "manual intervention + automatic correction" strategy, the identification model constructed in the next round of traffic identification training process will be improved for the target business, and conflict mining will be re-examined in the next round of evaluation, forming an iterative closed loop. . It should be noted that the strategies of the feedback module are flexible and configurable. The above feedback methods are just some examples. As long as the mined conflict information can be used, more rules and strategies can be added to the feedback model configuration in the future.

上面各种方法的步骤划分，只是为了描述清楚，实现时可以合并为一个步骤或者对某些步骤进行拆分，分解为多个步骤，只要包括相同的逻辑关系，都在本专利的保护范围内；对算法中或者流程中添加无关紧要的修改或者引入无关紧要的设计，但不改变其算法和流程的核心设计都在该专利的保护范围内。The steps of the various methods above are divided just for the purpose of clear description. During implementation, they can be combined into one step or some steps can be split into multiple steps. As long as they include the same logical relationship, they are all within the scope of protection of this patent. ; Adding insignificant modifications or introducing insignificant designs to the algorithm or process without changing the core design of the algorithm and process are within the scope of protection of this patent.

本申请实施例还涉及一种流量识别装置，如图6所示，包括：获取模块601、处理模块602、修正模块603和识别模块604。The embodiment of the present application also relates to a traffic identification device, as shown in Figure 6 , including: an acquisition module 601, a processing module 602, a correction module 603 and an identification module 604.

具体地说，获取模块601，用于基于识别模型的验证结果，获取识别模型识别失误的目标业务，验证结果包括：多个流量所属的真实业务和识别模型对流量的识别结果；处理模块602，用于使用社区发现算法对冲突关系图进行处理，得到冲突业务簇集合，根据冲突业务簇集合，得到目标业务的识别失误原因；其中，冲突关系图用于表示各个业务的冲突关系，冲突业务簇用于表示冲突程度大于预设门限的业务；修正模块603，用于根据目标业务的识别失误原因，对识别模型进行修正，得到修正后的识别模型；识别模块604，用于在修正后的识别模型达到预设识别准确率的情况下，基于修正后的识别模型进行流量识别。Specifically, the acquisition module 601 is used to obtain the target business that the recognition model fails to identify based on the verification results of the recognition model. The verification results include: the real business to which the multiple flows belong and the recognition results of the traffic by the recognition model; the processing module 602, It is used to process the conflict relationship graph using the community discovery algorithm to obtain a set of conflicting business clusters. According to the set of conflicting business clusters, the cause of the identification error of the target business is obtained; among them, the conflict relationship graph is used to represent the conflict relationship of each business, and the conflicting business cluster Used to represent services with a degree of conflict greater than the preset threshold; the correction module 603 is used to correct the identification model according to the cause of the identification error of the target service to obtain a corrected identification model; the identification module 604 is used to identify the corrected When the model reaches the preset recognition accuracy, traffic identification is performed based on the revised recognition model.

在一个例子中，在用于表征识别准确率的指标为精确率的情况下，识别模块601根据识别模型的验证结果计算各个业务的精确率，设置一个精确率预设阈值对所有业务进行筛选，精确率小于精确率预设阈值的业务，被判定为目标业务，即被识别模型识别失误的业务。In one example, when the indicator used to characterize the recognition accuracy is accuracy, the recognition module 601 calculates the accuracy of each service based on the verification results of the recognition model, and sets a preset accuracy threshold to screen all services. Businesses whose accuracy rate is less than the preset threshold of accuracy are determined as target businesses, that is, businesses that are misidentified by the recognition model.

在一个例子中，处理模块602遍历所有目标业务，如果目标业务从属某个冲突业务簇，则将其判定为存在“业务冲突”；如果目标业务不属于任一冲突业务簇，说明该目标业务识别效果低下不是由于相似业务造成，而是数据或特征本身原因，将这些业务判定为“特征问题”，即输入识别模型的流量特征的原因。In one example, the processing module 602 traverses all target services. If the target service belongs to a conflicting service cluster, it is determined to have a "service conflict"; if the target service does not belong to any conflicting service cluster, it indicates that the target service is identified. The low effect is not caused by similar services, but by the data or features themselves. These services are judged as "feature problems", that is, the reasons for the traffic characteristics input into the recognition model.

在一个例子中，对于输入识别模型的流量特征造成的冲突，很难进行自动化的反馈修正，因为这种问题的根因有很多可能性，如采集的样本数据不规范，规则库/模型设计的缺陷等。对这部分的目标业务，更多的是进行人工“监控”，修正模块603将目标业务收集下发给流量识别环节进行人工干涉：例如重新收集目标业务的流量，反馈给业务人员重新分析流量，重新提取正则特征(基于传统DPI识别)或者修改特征工程(基于ML的识别)等。In one example, it is difficult to perform automated feedback correction for conflicts caused by the traffic characteristics of the input recognition model, because there are many possibilities for the root cause of this problem, such as irregularities in collected sample data, improper design of the rule base/model Defects etc. For this part of the target business, more manual "monitoring" is performed. The correction module 603 collects the target business and sends it to the traffic identification link for manual intervention: for example, re-collect the traffic of the target business and feed it back to the business personnel to re-analyze the traffic. Re-extract regular features (based on traditional DPI recognition) or modify feature engineering (based on ML recognition), etc.

在没有获取到公共流量特征的情况下，或公共流量特征无效的情况下，修正模块603将冲突业务簇内的各业务合并为一个业务，下一轮模型构造时规避这些应用的识别冲突；If the public traffic characteristics are not obtained, or if the public traffic characteristics are invalid, the correction module 603 merges each business in the conflicting business cluster into one business to avoid identification conflicts of these applications in the next round of model construction;

假设公共流量特征为公共域名，在获取到有效的公共流量特征的情况下，执行的修正操作为：修正模块603将冲突业务簇提取的公共域名分离为单独的一个业务类别的特征，如“阿里公共流量”，“腾讯公共流量”等，识别模型进行重构造时将这些公共域名直接归类到分离的类别中，防止与其他业务发生识别冲突。Assume that the public traffic characteristics are public domain names. When valid public traffic characteristics are obtained, the correction operation performed is: the correction module 603 separates the public domain names extracted from the conflicting business clusters into features of a separate business category, such as "Alibaba" "Public Traffic", "Tencent Public Traffic", etc. When the recognition model is reconstructed, these public domain names are directly classified into separate categories to prevent recognition conflicts with other businesses.

假设公共流量特征为公共域名，在获取到有效的公共流量特征的情况下，执行的修正操作还可以为：修正模块603将提取的公共域名在下一轮模型构造时剔除掉，如ML与DL分类模型训练时去掉这些公共域名的训练样本。Assuming that the public traffic characteristics are public domain names, when valid public traffic characteristics are obtained, the correction operation may also be: the correction module 603 will eliminate the extracted public domain names in the next round of model construction, such as ML and DL classification The training samples of these public domain names are removed during model training.

本申请实施例提供的流量识别装置，通过对识别模型的验证结果进行分析、计算、筛选，得到需要对识别失误原因进行分析的目标业务，并通过构造命中混淆矩阵来标识各业务被识别的结果，以及通过互命中的结果来构造冲突关系图，使得各业务之间的冲突关系更加清楚，通过社区发现算法将冲突关系图中的冲突程度大的业务提取出来，作为冲突业务簇，从而可以判断出各个目标业务识别失误的原因，最后通过识别失误的原因进行针对性的修正，使得识别模型的准确率更高，本申请实施例提供的流量识别方法是对现有流量识别技术的一种优化改进，可以通用地适配到基于规则的流量识别，与基于统计学习的流量识别方法中，提供了一种体系化的流量识别更新迭代流程，可以在问题检测与反馈修正的循环中，不断强化识别模型的识别效果，使流量识别更加精准，同时对冲突业务反馈的输出信息进行检测，可以帮助业务人员快速定位当前流量识别***存在的问题，并进行及时的响应。由于本申请实时例提供的方法是一种通用的流量识别改进方法，同样可以有效地适配到未来新的流量识别***中。The traffic identification device provided by the embodiment of the present application analyzes, calculates, and filters the verification results of the identification model to obtain the target services that require analysis of the causes of identification errors, and identifies the identified results of each service by constructing a hit confusion matrix. , and construct a conflict relationship graph through the results of mutual hits, making the conflict relationship between each business clearer. Through the community discovery algorithm, the businesses with a high degree of conflict in the conflict relationship graph are extracted as conflict business clusters, so that it can be judged The reasons for the identification errors of each target business are identified, and finally targeted corrections are made by identifying the causes of the errors, so that the accuracy of the identification model is higher. The traffic identification method provided by the embodiment of this application is an optimization of the existing traffic identification technology. Improvements can be universally adapted to rule-based traffic identification and statistical learning-based traffic identification methods. It provides a systematic flow identification update iteration process that can be continuously strengthened in the cycle of problem detection and feedback correction. The identification effect of the identification model makes traffic identification more accurate. At the same time, detecting the output information of conflicting business feedback can help business personnel quickly locate problems in the current traffic identification system and respond in a timely manner. Since the method provided in the real-time example of this application is a universal traffic identification improvement method, it can also be effectively adapted to new traffic identification systems in the future.

不难发现，本实施方式为上述流量识别方法实施例相对应的装置实施例，本实施方式可与上述流量识别方法实施例互相配合实施。上述流量识别方法实施例提到的相关技术细节在本实施方式中依然有效，为了减少重复，这里不再赘述。相应地，本实施方式中提到的相关技术细节也可应用在上述流量识别方法实施例中。It is not difficult to find that this implementation mode is a device embodiment corresponding to the above-mentioned traffic identification method embodiment, and this implementation mode can be implemented in cooperation with the above-mentioned traffic identification method embodiment. The relevant technical details mentioned in the above embodiments of the traffic identification method are still valid in this implementation mode. In order to reduce duplication, they will not be described again here. Correspondingly, the relevant technical details mentioned in this implementation mode can also be applied to the above-mentioned traffic identification method embodiments.

值得一提的是，本申请上述实施方式中所涉及到的各模块均为逻辑模块，在实际应用中，一个逻辑单元可以是一个物理单元，也可以是一个物理单元的一部分，还可以以多个物理单元的组合实现。此外，为了突出本申请的创新部分，本实施方式中并没有将与解决本申请所提出的技术问题关系不太密切的单元引入，但这并不表明本实施方式中不存在其它的单元。It is worth mentioning that each module involved in the above-mentioned embodiments of the present application is a logical module. In practical applications, a logical unit can be a physical unit, or a part of a physical unit, or it can be in the form of multiple The combination of physical units is realized. In addition, in order to highlight the innovative part of this application, units that are not closely related to solving the technical problems raised in this application are not introduced in this embodiment, but this does not mean that other units do not exist in this embodiment.

本申请的实施例还提供一种电子设备，如图7所示，包括至少一个处理器701；以及，与所述至少一个处理器701通信连接的存储器702；其中，所述存储器702存储有可被所述至少一个处理器701执行的指令，所述指令被所述至少一个处理器701执行，以使所述至少一个处理器能够执行上述流量识别方法。An embodiment of the present application also provides an electronic device, as shown in Figure 7, including at least one processor 701; and a memory 702 communicatively connected to the at least one processor 701; wherein the memory 702 stores information that can Instructions executed by the at least one processor 701, the instructions are executed by the at least one processor 701, so that the at least one processor can execute the above traffic identification method.

其中，存储器和处理器采用总线方式连接，总线可以包括任意数量的互联的总线和桥，总线将一个或多个处理器和存储器的各种电路连接在一起。总线还可以将诸如***设备、稳压器和功率管理电路等之类的各种其他电路连接在一起，这些都是本领域所公知的，因此，本文不再对其进行进一步描述。总线接口在总线和收发机之间提供接口。收发机可以是一个元件，也可以是多个元件，比如多个接收器和发送器，提供用于在传输介质上与各种其他装置通信的单元。经处理器处理的数据通过天线在无线介质上进行传输，进一步，天线还接收数据并将数据传送给处理器。Among them, the memory and the processor are connected using a bus. The bus can include any number of interconnected buses and bridges. The bus connects one or more processors and various circuits of the memory together. The bus may also connect various other circuits together such as peripherals, voltage regulators, and power management circuits, which are all well known in the art and therefore will not be described further herein. The bus interface provides the interface between the bus and the transceiver. A transceiver may be one element or may be multiple elements, such as multiple receivers and transmitters, providing a unit for communicating with various other devices over a transmission medium. The data processed by the processor is transmitted over the wireless medium through the antenna. Further, the antenna also receives the data and transmits the data to the processor.

处理器负责管理总线和通常的处理，还可以提供各种功能，包括定时，***接口，电压调节、电源管理以及其他控制功能。而存储器可以被用于存储处理器在执行操作时所使用的数据。The processor is responsible for managing the bus and general processing, and can also provide a variety of functions, including timing, peripheral interfaces, voltage regulation, power management, and other control functions. Memory can be used to store data used by the processor when performing operations.

上述产品可执行本申请实施例所提供的方法，具备执行方法相应的功能模块和有益效果，未在本实施例中详尽描述的技术细节，可参见本申请实施例所提供的方法。The above-mentioned products can execute the methods provided by the embodiments of this application and have corresponding functional modules and beneficial effects for executing the methods. For technical details not described in detail in this embodiment, please refer to the methods provided by the embodiments of this application.

本申请的实施例还提供一种计算机可读存储介质，存储有计算机程序。计算机程序被处理器执行时实现上述方法实施例。Embodiments of the present application also provide a computer-readable storage medium storing a computer program. The above method embodiments are implemented when the computer program is executed by the processor.

本领域技术人员可以理解，实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成，该程序存储在一个存储介质中，包括若干指令用以使得一个设备(可以是单片机，芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。Those skilled in the art can understand that all or part of the steps in the methods of the above embodiments can be completed by instructing relevant hardware through a program. The program is stored in a storage medium and includes several instructions to make a device (which may be A microcontroller, a chip, etc.) or a processor (processor) executes all or part of the steps of the methods described in various embodiments of this application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code. .

上述实施例是提供给本领域普通技术人员来实现和使用本申请的，本领域普通技术人员可以在脱离本申请的发明思想的情况下，对上述实施例做出种种修改或变化，因而本申请的保护范围并不被上述实施例所限，而应该符合权利要求书所提到的创新性特征的最大范围。The above embodiments are provided for those of ordinary skill in the art to implement and use the present application. Those of ordinary skill in the art can make various modifications or changes to the above embodiments without departing from the inventive concept of the present application. Therefore, the present application The scope of protection is not limited by the above embodiments, but should comply with the maximum scope of the innovative features mentioned in the claims.

Claims

一种流量识别方法，包括：A traffic identification method including:

S1，基于所述识别模型的验证结果，获取所述识别模型识别失误的目标业务，其中，所述验证结果包括：多个流量所属的真实业务和所述识别模型对流量的识别结果；S1. Based on the verification results of the identification model, obtain the target services that the identification model fails to identify, wherein the verification results include: the real services to which multiple flows belong and the identification results of the traffic by the identification model;

S2，使用社区发现算法对冲突关系图进行处理，得到冲突业务簇集合，根据所述冲突业务簇集合，得到所述目标业务的识别失误原因；其中，所述冲突关系图用于表示各个业务的冲突关系，所述冲突业务簇用于表示冲突程度大于预设门限的业务；S2, use the community discovery algorithm to process the conflict relationship graph to obtain a set of conflicting business clusters. According to the set of conflicting business clusters, obtain the cause of the identification error of the target business; wherein, the conflicting relationship graph is used to represent the identity of each business. Conflict relationship, the conflict business cluster is used to represent services with a conflict degree greater than a preset threshold;

S3，根据所述目标业务的识别失误原因，对所述识别模型进行修正，得到修正后的识别模型；S3: Modify the identification model according to the cause of the identification error of the target business to obtain a corrected identification model;

S4，在所述修正后的识别模型达到预设识别准确率的情况下，基于所述修正后的识别模型进行流量识别。S4: When the corrected identification model reaches the preset identification accuracy rate, perform traffic identification based on the corrected identification model.
根据权利要求1所述的流量识别方法，其中，在所述根据所述目标业务的识别失误原因，对所述识别模型进行修正，得到修正后的识别模型之后，还包括：The traffic identification method according to claim 1, wherein after correcting the identification model according to the identification error cause of the target service to obtain the corrected identification model, it further includes:

在所述修正后的识别模型未达到预设识别准确率的情况下，重复执行所述S1至所述S3。If the corrected recognition model does not reach the preset recognition accuracy rate, S1 to S3 are repeatedly executed.
根据所述权利要求1所述的流量识别方法，其中，所述基于所述识别模型的验证结果，获取所述识别模型识别失误的目标业务，包括：The traffic identification method according to claim 1, wherein the verification result based on the identification model, obtaining the target business that the identification model fails to identify includes:

根据识别模型的验证结果计算各个业务的用于表征识别准确率的指标值；Calculate the indicator values for each business to characterize the recognition accuracy based on the verification results of the recognition model;

将所述各个业务的指标值与预设范围进行比较，并将指标值不在预设范围内的业务作为所述目标业务。Compare the index value of each service with a preset range, and use the service whose index value is not within the preset range as the target service.
根据权利要求1所述的流量识别方法，其中，在所述S1之后，且在所述S2之前，还包括：The traffic identification method according to claim 1, wherein after the S1 and before the S2, it further includes:

根据所述验证结果生成命中混淆矩阵M；所述命中混淆矩阵的行元素表示流量所属的真实业务；所述命中混淆矩阵的列元素表示所述识别模型对流量的识别结果；所述M的元素值M[i,j]表示第i个真实业务被识别为第j个业务的次数；A hit confusion matrix M is generated based on the verification results; the row elements of the hit confusion matrix represent the real business to which the traffic belongs; the column elements of the hit confusion matrix represent the recognition results of the traffic by the identification model; the elements of M The value M[i,j] represents the number of times that the i-th real business is recognized as the j-th business;

将每个业务作为一个节点，并根据所述M中的各元素值和第一预设阈值，生成所述冲突关系图；其中，在M[i,j]和M[j,i]均大于所述第一预设阈值的情况下，两个业务之间具有冲突关系。Each service is regarded as a node, and the conflict relationship graph is generated according to each element value in M and the first preset threshold; wherein, when M[i,j] and M[j,i] are both greater than In the case of the first preset threshold, there is a conflict relationship between the two services.
根据权利要求1所述的流量识别方法，其中，所述根据所述冲突业务簇集合，得到所述目标业务的识别失误原因，包括；The traffic identification method according to claim 1, wherein said obtaining the identification error cause of the target service according to the conflicting service cluster set includes;

在所述目标业务属于所述冲突业务簇集合的情况下，得到所述目标业务识别失误的原因为存在业务冲突；When the target service belongs to the conflicting service cluster set, the cause of the target service identification error is that there is a service conflict;

在所述目标业务不属于所述冲突业务簇集合的情况下，得到所述目标业务识别失误的原因为输入所述识别模型的流量特征的原因。When the target service does not belong to the conflicting service cluster set, the cause of the target service identification error is the cause of the traffic characteristics input into the identification model.
根据权利要求5所述的流量识别方法，其中，所述根据所述目标业务的识别失误原因，对所述识别模型进行修正，包括：The traffic identification method according to claim 5, wherein modifying the identification model according to the identification error cause of the target service includes:

在所述识别失误的原因为输入所述识别模型的流量特征的原因的情况下，对所述目标业务进行人工分析，并对所述识别模型进行修正；When the cause of the identification error is the traffic characteristics input into the identification model, perform manual analysis on the target business and correct the identification model;

在所述识别失误的原因为存在业务冲突且存在公共流量特征的情况下，将所述公共流量特征在构造所述识别模型时去除，或将所述公共流量特征作为一个单独的业务类别的特征；其中，所述公共流量特征为在一个所述冲突业务簇内导致多个业务冲突的特征；When the cause of the identification error is a business conflict and a common traffic feature, the common traffic feature is removed when constructing the identification model, or the public traffic feature is used as a feature of a separate business category. ; Wherein, the common traffic characteristics are characteristics that cause multiple service conflicts within one of the conflicting service clusters;

在所述识别失误的原因为存在业务冲突且不存在公共流量特征的情况下，将所述冲突业务簇内的各业务合并为一个业务。When the cause of the identification error is that there is a service conflict and there are no common traffic characteristics, each service in the conflicting service cluster is merged into one service.
根据权利要求6所述的流量识别方法，其中，所述公共流量特征的提取，包括：The traffic identification method according to claim 6, wherein the extraction of public traffic features includes:

针对每一个冲突业务簇，提取出现的流量特征；For each conflicting business cluster, extract the traffic characteristics that appear;

计算各流量特征在所述冲突业务簇内的各业务中分别出现的词频；Calculate the word frequency of each traffic feature appearing in each business within the conflicting business cluster;

在所述冲突业务簇内，计算各流量特征的逆文档频率；所述逆文档频率用于表征第一业务数量与所述冲突业务簇内的总业务数量的占比，其中，所述第一业务数量用于表示出现所述流量特征的业务的数量；In the conflicting business cluster, the inverse document frequency of each traffic feature is calculated; the inverse document frequency is used to represent the proportion of the first business quantity to the total business quantity in the conflicting business cluster, wherein the first The number of services is used to represent the number of services in which the traffic characteristics appear;

在所述冲突业务簇内的流量特征的逆文档频率大于第二预设阈值，且在所述冲突业务簇内的任一业务的流量特征的词频均小于第三预设阈值的情况下，将识别准确率小于第四预设阈值的所述冲突业务簇内的流量特征，作为所述冲突业务簇的公共流量特征。When the inverse document frequency of the traffic characteristics in the conflicting business cluster is greater than the second preset threshold, and when the word frequency of the traffic characteristics of any business in the conflicting business cluster is less than the third preset threshold, the Traffic characteristics within the conflicting service cluster whose accuracy is less than a fourth preset threshold are identified as common traffic characteristics of the conflicting service cluster.
根据权利要求7所述的流量识别方法，其中，所述验证结果还包括：The traffic identification method according to claim 7, wherein the verification result further includes:

用于输入所述识别模型的流量特征，所述流量特征包括：域名和/或IP信息；在提取所述公共流量特征之前，还包括：Used to input the traffic characteristics of the identification model, the traffic characteristics include: domain name and/or IP information; before extracting the public traffic characteristics, it also includes:

统计验证结果中出现的流量特征，并计算所述流量特征的识别准确率。Statistically verify the traffic characteristics appearing in the results, and calculate the recognition accuracy of the traffic characteristics.
一种流量识别装置，包括：A flow identification device including:

获取模块(601)，用于基于所述识别模型的验证结果，获取所述识别模型识别失误的目标业务，所述验证结果包括：多个流量所属的真实业务和所述识别模型对流量的识别结果；The acquisition module (601) is used to obtain the target business that the recognition model fails to identify based on the verification results of the identification model. The verification results include: the real services to which multiple flows belong and the identification of the traffic by the identification model. result;

处理模块(602)，用于使用社区发现算法对冲突关系图进行处理，得到冲突业务簇集合，根据所述冲突业务簇集合，得到所述目标业务的识别失误原因；其中，所述冲突关系图用于表示各个业务的冲突关系，所述冲突业务簇用于表示冲突程度大于预设门限的业务；The processing module (602) is used to process the conflict relationship graph using a community discovery algorithm to obtain a conflict business cluster set, and obtain the identification error cause of the target business based on the conflict business cluster set; wherein, the conflict relationship graph Used to represent the conflict relationship of each service, and the conflicting service cluster is used to represent services with a degree of conflict greater than a preset threshold;

修正模块(603)，用于根据所述目标业务的识别失误原因，对所述识别模型进行修正，得到修正后的识别模型；A correction module (603), used to correct the identification model according to the cause of the identification error of the target business to obtain a corrected identification model;

识别模块(604)，用于在所述修正后的识别模型达到预设识别准确率的情况下，基于所述修正后的识别模型进行流量识别。The identification module (604) is configured to perform traffic identification based on the corrected identification model when the corrected identification model reaches a preset identification accuracy rate.
一种电子设备，包括：An electronic device including:

至少一个处理器(702)；以及，at least one processor (702); and,

与所述至少一个处理器(701)通信连接的存储器(702)；其中，A memory (702) communicatively connected with the at least one processor (701); wherein,

所述存储器(702)存储有可被所述至少一个处理器(701)执行的指令，所述指令被所述至少一个处理器(701)执行，以使所述至少一个处理器(701)能够执行如权利要求1至8中任一项所述的流量识别方法。The memory (702) stores instructions executable by the at least one processor (701), and the instructions are executed by the at least one processor (701) to enable the at least one processor (701) to Implement the traffic identification method according to any one of claims 1 to 8.
一种计算机可读存储介质，存储有计算机程序，其中，所述计算机程序被处理器执行时实现权利要求1至8中任一项所述的流量识别方法。A computer-readable storage medium stores a computer program, wherein when the computer program is executed by a processor, the traffic identification method according to any one of claims 1 to 8 is implemented.