CN108664607A

CN108664607A - A kind of power telecom network quality of data method for improving based on transfer learning

Info

Publication number: CN108664607A
Application number: CN201810445948.6A
Authority: CN
Inventors: 杨济海; 李仁华; 彭汐单; 巢玉坚; 邓永康; 伍小生; 田晖; 郑富永; 王�华; 付萍萍; 胡游君; 邱玉祥; 吕顺利; 周鹏; 邓伟; 刘皓; 蔡新忠; 查凡; 王宏; 丁传文
Original assignee: Information And Communication Branch Of Jiangxi Electric Power Co Ltd; Wuhan University WHU; NARI Group Corp
Current assignee: Information And Communication Branch Of Jiangxi Electric Power Co Ltd; Wuhan University WHU; NARI Group Corp; Information and Telecommunication Branch of State Grid Jiangxi Electric Power Co Ltd
Priority date: 2018-05-11
Filing date: 2018-05-11
Publication date: 2018-10-16

Abstract

The present invention relates to a kind of power telecom network quality of data method for improving based on transfer learning.Kernel discriminant analysis is used to set L first, finds a kind of suitable nuclear mapping space, and all samples in L, U and O are mapped in nuclear space so that the edge distribution of source domain and target domain sample is very close in nuclear space.Then the sample for possessing similar conditional probability distribution with target domain is selected in source domain using two points of k mean algorithms.And in the nuclear space that step 1 obtains, the markd sample of sample and target domain picked out with step 2 trains a model jointly, and there is no the sample of label to be predicted in target domain, finally obtain the N kind prediction results to set U, with majority voting method, the label that sample is final in set U is determined.The present invention efficiently solves the problems, such as that training set and test set sample distribution are inconsistent by transfer learning, has solved the problems, such as that exemplar is less and can not train, has dramatically saved manpower and financial resources.

Description

A kind of power telecom network quality of data method for improving based on transfer learning

Technical field

The invention belongs to the technical fields that the power telecom network quality of data is promoted, and are related specifically to the electricity based on transfer learning Power communication network quality of data method for improving.

Background technology

With the deep development of State Grid Corporation of China " three collection five are big " system, strong intelligent grid builds swift and violent, enterprise's letter Breath chemical industry is pushed forward comprehensively.As the electric power dedicated communications network of intelligent grid important support, stepped by the way that at the double, quick march within 3 years Enter the information system management stage, builds up a set of general headquarters and provincial company " two-stage deployment ", general headquarters, branch, provincial company, company of cities and counties The communications management system " SG-TMS " of " level Four application ".By the project construction of standardization and normalization and to system functionization Carry forward vigorously, " SG-TMS " depth incorporates in the routine work of tens thousand of power communication professionals, and acquires comprehensively The construction in the past few years of tens thousand of equipment, operation, management data, accumulative magnanimity Electric Power Communication Data and numerous external systems System data, common data have been formed together the basis for carrying out big data analysis.

Want from accumulative mass data, efficiently and accurately searches out required information, information classification is must not The first step that can lack.By classification, information can be organized effectively, and quickly and accurately location information is conducive to.Point Class problem concerning study is a kind of important learning method in machine learning, has obtained extensive research and development at present.

In traditional classification study, in order to ensure that the disaggregated model that training obtains has accuracy and high reliability, have Two basic assumptions：(1) training sample and new test sample for being used for study meet independent identically distributed condition；(2) must There must be training sample available enough that could learn to obtain a good disaggregated model.But we send out in practical applications Existing, the two conditions often cannot be satisfied.First, over time, original available sample data for having label can It can become use, the distribution with new test sample generates difference semantic, in distribution.In addition, there is the sample of label Data are often very deficient, and are difficult to obtain, and abandon out-of-date data completely, excessively waste.

In recent years, with the further investigation of transfer learning, the above problem is resolved.Transfer learning is in source domain Knowledge solve the problems, such as a kind of new machine learning method of target domain, research field includes mainly text classification, text Cluster, emotional semantic classification, image classification, collaborative filtering, sensor-based location estimation, artificial intelligence planning etc..

In text-processing field, Dai et al. proposes joint clustering method, while being clustered to document and word feature, Identical word feature, which is shared, by different field carries out knowledge migration.They also propose migration Bayes classifier, estimate first Then the data distribution of source domain data is constantly corrected and adapts it to target domain data.Zhuang et al. is in concept level On text is handled, propose excavate document concepts and word Feature concept transfer learning method.Long etc. on this basis People proposes dual migration models, is further divided to concept, improves algorithm classification accuracy rate.Gu et al. proposes shared son The multitask clustering method in space, and applied in migration classification.

In terms of image procossing, Dai et al. proposes a kind of translation transfer learning method, carrys out assistant images by text data Cluster.Raina et al. proposes a kind of new method for carrying out self study from no label data, this method usage factor coding techniques From largely without high-level characteristic is constructed on label data, to improve image classification performance.Zhu et al. has studied a kind of isomery migration Learning method, using the Tag label informations on image as the bridge of knowledge migration between text and image, to improve image Classifying quality in data.

In terms of collaborative filtering, Wang et al. proposes the transfer learning method of proper subspace to overcome in collaborative filtering Sparse Problems, i.e., from the acquistion of auxiliary data middle school to user characteristics subspace be migrated in target domain.Pan et al. is studied Transfer learning algorithm with uncertain scoring in collaborative filtering, i.e., consider uncertain scoring in optimization aim matrix decomposition Auxiliary data as limitation.Cao et al. proposes the link prediction model based on the potential feature sharing policy of project, compares in performance The study of individual task is promoted.

Invention content

With the continuous intensification of the power telecom network level of informatization, telecommunication management, equipment operation, network construction etc. Mass data gradually accumulates, and is urgently excavated wherein having contained huge value.But over time, former First available to have the sample data of label that become use, the distribution with new test sample generates semanteme, divides Difference on cloth.In addition, there is the sample data of label often very deficient, and it is difficult to obtain, and abandon out-of-date number completely According to excessively wasting.Since data have the characteristics that timeliness is strong, when excavating hiding information, it is possible that deviation.

The present invention defines following relational language before proposing solution：

Out-of-date data are source domains, and new data is target domain.Present invention L={ X_L, Y_LRepresent in target domain and have The sample of label, wherein X_L={ x₁..., x_γ, Y_L={ y₁..., y_γ, including γ sample；With U={ X_URepresent target domain In there is no the sample of label, wherein X_U={ x_γ+1..., x_γ+u, including u sample.Similarly, with O={ X_o, Y_oRepresent source neck Domain sample, including o sample.

The present invention utilizes transfer learning domain knowledge, migrate in source domain part sample and target domain sample jointly into Row training.Migrate basic principle：Source domain sample possesses the edge distribution same or similar with target domain and condition distribution.

To complete the above target, scheme proposed by the present invention is as follows：

A kind of power telecom network quality of data method for improving based on transfer learning, which is characterized in that based on definition：L= {X_L, Y_LRepresent and have the sample of label, wherein X in target domain_L={ x₁..., x_γ, Y_L={ y₁..., y_γ, including γ sample This；U={ X_URepresent and there is no the sample of label, wherein X in target domain_U={ x_γ+1..., x_γ+u, including u sample；O= {X_O, Y_OSource domain sample is represented, including o sample, specifically includes：

Step 1, Kernel discriminant analysis is used to set L, finds a kind of suitable nuclear mapping space, and by the institute in L, U and O There is sample to be mapped in nuclear space so that source domain sample nuclear space edge distribution close to target domain sample in nuclear space Edge distribution；

Step 2, in the nuclear space that step 1 obtains, using two points of k mean algorithms (Bisecting k-means) in source The sample for possessing similar conditional probability distribution with target domain is selected in field, and is recorded and be selected sample in original sky Between in original sample set S；

Step 3, in the nuclear space that step 1 obtains, the markd sample of sample and target domain picked out with step 2 One model of training jointly, and there is no the sample of label to be predicted in target domain；

Step 4, step 1-3 executes n times, in step 1, except the sample for finding nuclear mapping space for the first time be in set L, It is the sample that nuclear mapping space is found in L and S and concentrating that following cycle, which executes,；It finally obtains and knot is predicted to the N kinds of set U Fruit determines the label that sample is final in set U with majority voting method.

In a kind of above-mentioned power telecom network quality of data method for improving based on transfer learning, the step 1 is specifically wrapped It includes：

Step 1.1, calculating matrix W；W=(W_i)_{I=1 ..., NC}It is block diagonal matrix, W_iIt is l_i×l_iMatrix, W_iEach of Element is 1/l_i, wherein l_iIt is the quantity of the i-th class sample, NC is sample class sum, i.e.,：

Step 1.2 calculates nuclear matrix K；Kernel function κ (x_i, x_j) define the dot-product operation in feature space F, i.e. κ (x_i, x_j)=φ (x_i)·φ(x_j), each element of nuclear matrix K is κ_ij=κ (x_i, x_j)；The present invention select Gaussian kernel as Kernel function, i.e.,：σ ＞ 0 are the bandwidth of Gaussian kernel；

Step 1.3 simplifies object function；The decomposition for using nuclear matrix K feature vector, obtains K=P Λ P^T, wherein Λ The column vector of the diagonal matrix being made of nonzero eigenvalue, P is unit character vector and mutually orthogonal, feature vector and Λ In characteristic value it is corresponding；Then object function is reduced to：

λ β=P^TWPβ

Wherein, β=Λ P^Tα, finding out makes λ value maximum β, and corresponding α can be calculated；

Step 1.4, sample are mapped to nuclear space；Sample z to v's is projected as：

In a kind of above-mentioned power telecom network quality of data method for improving based on transfer learning, the step 2 is based on fixed Justice：

Define 2.1：A given cluster C and its two submanifold C₁And C₂, C₁∪C₂=C andThen：

Par(C,C₁,C₂)--[SSE(C)-SSE(C₁)-SSE(C₂)]

Wherein, SSE (C) represent non-center of mass point in C to the distance of center of mass point and, Par (C, C₁, C₂) represent whether C can divide Solution is two submanifold C₁And C₂, value 1：Can, 0：It cannot；

Define 2.2：Give a cluster C_i, sample standard deviation therein is marked as "+" and "-", then cluster C_iPurity be：

Purity(C_i) represent C_iPurity, i.e., the maximum value of proportion in positive and negative two classes sample；

It specifically includes：

Step 2.1, from C_i2 samples of middle random selection are as initial mean value vector μ₁And μ₂, respectively as submanifold C_i1With C_i2Barycenter；

Step 2.2 calculates cluster C_iIn each sample and μ₁、μ₂Euclidean distance, with μ₁Recently, then the sample is included in cluster C_i1, otherwise, cut-in cluster C_i2；

Step 2.3 is cluster C_i1Calculate new mean vector：If μ₁≠μ′₁, update μ₁For μ ＇₁； To cluster C_i2Do same operation；

If step 2.4, current mean vector do not update, cluster C_iFinally it is divided into two submanifold C_i1And C_i2, otherwise, Step 2.2 is repeated to step 2.3.

In a kind of above-mentioned power telecom network quality of data method for improving based on transfer learning, the step 4 is specifically wrapped It includes：

Step 4.1, given L, U and O, i=1, iterations N；

Step 4.2, LearnKDA=L, if i ＞ 1, LearnKDA=L ∪ S_i-1；

Step 4.3 uses Kernel discriminant analysis method to the sample in set LearnKDA, finds nuclear mapping space；

It is NL that step 4.4, set L, U and O, which are respectively mapped to nuclear space,_i, NU_iAnd NO_i；

Step 4.5, with two points of k mean algorithm clustering methods, in NO_iIn select sample, the sample set being selected is SO_i, with set S_iIndicate SO_iSample set in luv space；

Step 4.6 utilizes SO_iAnd NL_iTrain a MODEL C_i, to set NU_iIn each sample predictions label；

Step 4.7 enables i=i+1, repeats step 4.2 to step 4.5, until i=N；

Step 4.8, the N kind prediction results finally obtained to set U determine that sample is most in set U with majority voting method Whole label.

Therefore, the invention has the advantages that：

It efficiently solves the problems, such as that training set and test set sample distribution are inconsistent by transfer learning, has solved mark The problem that signed-off sample sheet is less and can not train, dramatically saves manpower and financial resources.

Description of the drawings

Fig. 1 a are source domain and target domain in luv space and nuclear space sample distribution comparison diagram (target domain sample point Butut).

Fig. 1 b are source domain and target domain in luv space and nuclear space sample distribution comparison diagram (source domain sample distribution Figure).

Fig. 1 c are source domain and target domain, and in luv space and nuclear space sample distribution comparison diagram, (target domain is in core sky Between middle sample distribution figure).

Fig. 1 d are source domain and target domain, and in luv space and nuclear space sample distribution comparison diagram, (source domain is in nuclear space Middle sample distribution figure).

Fig. 2 is operational flowchart.

Specific implementation mode

Step 1, the Feature Mapping based on kernel function

Kernel discriminant analysis (KDA) has used " geo-nuclear tracin4 " similar to SVM and core PCA methods, i.e., first that data are non-thread Property be mapped to some feature space F, then in this feature space carry out linear discriminant analysis (LDA), thus imply Ground realizes the nonlinear discriminant of the former input space.

If φ is Nonlinear Mapping of the input space to some feature space F, the linear discriminant that find in F needs maximum Change

Wherein, v ∈ F,WithIt is corresponding matrix in F, i.e.,：

Wherein,L is the total quantity of sample, l_iIt is the i-th class sample Quantity, NC be sample class sum.

According to theory of reproducing kernel space, any v ∈ F must be located at all training samples and open collection in F, therefore can find following shape An expansion of the v of formula

It recycles kernel function to replace dot product, the object function of KDA can be obtained：

1.1 matrix W

W=(W_i)_{I=1 ..., NC}It is block diagonal matrix, W_iIt is l_i×l_iMatrix, each element therein is 1/l_i, i.e.,：

1.2 nuclear matrix K

Kernel function κ (x_i, x_j) define the dot-product operation in feature space F, i.e. κ (_xI, x_j)=φ (x_i)φ(x_j), core Each element of matrix K is κ_ij=κ (x_i, x_j).The present invention selects Gaussian kernel as kernel function, i.e.,：σ ＞ 0 are the bandwidth of Gaussian kernel.

1.3 simplify object function

The decomposition for using nuclear matrix K feature vector, obtains K=P Λ P^T, wherein Λ is made of nonzero eigenvalue The column vector of diagonal matrix, P is that unit character is vectorial and mutually orthogonal, and the characteristic value in feature vector and Λ is corresponding.Then Object function is reduced to：

λ β=P^TWPβ

Wherein, β=Λ P^Tα, finding out makes λ value maximum β, and corresponding α can be calculated.

1.4 samples are mapped to nuclear space

Sample z to v's is projected as：

Step 2, the samples selection based on cluster

Bisecting k-means clustering algorithms, i.e. two points of k mean algorithms, it is a change of k-means clustering algorithms Body, primarily to the randomness that improvement k-means algorithms randomly choose initial barycenter causes cluster result is probabilistic to ask Topic, and Bisecting k-means algorithms by randomly choose initial barycenter influenced it is smaller.

In Euclidean space, a cluster C is weighed_iQuality usually using following measurement：Error sum of squares (Sum of The Squared Error, abbreviation SSE), that is, to calculate execute clustering after, a mistake will be calculated to each point Difference, i.e., non-center of mass point to center of mass point u_iDistance, i.e.,：

Before carrying out samples selection operation, it is first described below two definition.

Par(C,C₁,C₂)=[SSE (C)-SSE (C₁)-SSE(C₂)]

Using Bisecting k-means clustering algorithms, the specific implementation procedure for executing samples selection is as follows：

(1) when initial, there is exemplar as data set to be clustered source domain and target domain, and be initialized as one Cluster C₀, i.e. C={ C₀}

(2) a cluster C is taken from C_iK-means cluster operations (k=2) are carried out, two submanifold C are obtained_i1And C_i2

(3) if Purity (C_i)≤0.9 or Par (C_i, C_i1, C_i2C is used in set C in)=1_i1And C_i2Replace C_i。

(4) step (2) (3) is repeated, until the element in set C was traversed.

(5) C={ C are finally obtained₁..., C_k, cluster C_iLabel be C_iThere is exemplar quantity most in middle target domain Label, i.e.,：CL_i=arg max_{J ∈ [1, NC]}nc_ij, wherein nc_ijIt is cluster C_iMiddle jth class target domain has exemplar quantity. In cluster C_iIn select and cluster C_iThe consistent source domain sample of label.

The concrete operations of step (2) are as follows：

(a) from C_i2 samples of middle random selection are as initial mean value vector μ₁And μ₂, respectively as submanifold C_i1And C_i2Matter The heart.

(b) cluster C is calculated_iIn each sample and μ₁、μ₂Euclidean distance, with μ₁Recently, then by sample cut-in cluster C_i1, no Then, cut-in cluster C_i2。

(c) it is cluster C_i1Calculate new mean vector：If μ₁≠μ′₁, update μ₁For μ '₁；To cluster C_i2 Do same operation.

(d) current mean vector does not update, then cluster C_iFinally it is divided into two submanifold C_i1And C_i2, otherwise, repeat to walk Suddenly (b), (c).

Step 3, training grader

Sample and target domain one grader of markd sample training picked out with step 2, and be target domain In there is no the sample of label to be predicted.Sorter model can be from support vector machines (SVM), logistic regression, decision tree, simplicity It is selected in the models such as Bayes, the quality of measurement model is gone with cross validation.

Step 4, step 1-3 repeats n times

Concrete operations are as follows：

(1) L, U and O, i=1, iterations N are given.

(2) LearnKDA=L, if i ＞ 1, LearnKDA=L ∪ S_i-1。

(3) nuclear mapping space is found with the KDA methods in step 1 to the sample in set LearnKDA.

(4) it is NL that set L, U and O, which is respectively mapped to nuclear space,_i, NU_iAnd NO_i。

(5) clustering method in step 2 is used, in NO_iIn select sample, the sample set being selected be SO_i, with set S_iIndicate SO_iSample set in luv space.

(6) SO is utilized_iAnd NL_iTrain a MODEL C_i, to set NU_iIn each sample predictions label.

(7) i=i+1 repeats step (2)-(5), until i=N.

(8) the N kind prediction results to set U are finally obtained, with majority voting method, determine sample in set U finally Label.

Specific embodiment described herein is only an example for the spirit of the invention.Technology belonging to the present invention is led The technical staff in domain can make various modifications or additions to the described embodiments or replace by a similar method In generation, however, it does not deviate from the spirit of the invention or beyond the scope of the appended claims.

Claims

1. a kind of power telecom network quality of data method for improving based on transfer learning, which is characterized in that based on definition：L={ X_L, Y_LRepresent and have the sample of label, wherein X in target domain_L={ x₁..., x_γ, Y_L={ y₁..., y_γ, including γ sample；U ={ X_URepresent and there is no the sample of label, wherein X in target domain_U{x_γ+1..., x_γ+u, including u sample；O={ X_O, Y_OGeneration Table source domain sample, including o sample, specifically includes：

Step 1, Kernel discriminant analysis is used to set L, finds a kind of suitable nuclear mapping space, and by all samples in L, U and O Originally be mapped in nuclear space so that source domain sample nuclear space edge distribution close to target domain sample on the side of nuclear space Fate cloth；

Step 2, in the nuclear space that step 1 obtains, using two points of k mean algorithms (Bisecting k-means) in source domain In select the sample for possessing similar conditional probability distribution with target domain, and record and be selected sample in luv space Original sample set S；

Step 3, in the nuclear space that step 1 obtains, the markd sample of sample and target domain picked out with step 2 is common One model of training, and there is no the sample of label to be predicted in target domain；

Step 4, step 1-3 executes n times, is in set L, subsequently except the sample in nuclear mapping space is found for the first time in step 1 It is the sample that nuclear mapping space is found in L and S and concentrating that cycle, which executes,；The N kind prediction results to set U are finally obtained, With majority voting method, the label that sample is final in set U is determined.

2. a kind of power telecom network quality of data method for improving based on transfer learning according to claim 1, feature It is, the step 1 specifically includes：

Step 1.1, calculating matrix W；W=(W_i)_{I=1 ..., NC}It is block diagonal matrix, W_iIt is l_i×l_iMatrix, W_iIn each element It is 1/l_i, wherein l_iIt is the quantity of the i-th class sample, NC is sample class sum, i.e.,：

Step 1.2 calculates nuclear matrix K；Kernel function κ (x_i, x_j) define the dot-product operation in feature space F, i.e. κ (x_i, x_j) =φ (x_i)·φ(x_j), each element of nuclear matrix K is κ_ij=κ (x_i, x_j)；The present invention selects Gaussian kernel as kernel function, I.e.：σ ＞ 0 are the bandwidth of Gaussian kernel；

Step 1.3 simplifies object function；The decomposition for using nuclear matrix K feature vector, obtains K=P Λ P^T, wherein Λ be by The column vector of the diagonal matrix of nonzero eigenvalue composition, P is that unit character is vectorial and mutually orthogonal, in feature vector and Λ Characteristic value is corresponding；Then object function is reduced to：

λ β=P^TWPβ

3. a kind of power telecom network quality of data method for improving based on transfer learning according to claim 1, feature It is, the step 2 is based on definition：

Par (C, C₁, C₂)=[SSE (C)-SSE (C₁)-SSE(C₂)]

Wherein, SSE (C) represent non-center of mass point in C to the distance of center of mass point and, Par (C, C₁, C₂) represent whether C can be decomposed into Two submanifold C₁And C₂, value 1：Can, 0：It cannot；

It specifically includes：

Step 2.1, from C_i2 samples of middle random selection are as initial mean value vector μ₁And μ₂, respectively as submanifold C_i1And C_i2Matter The heart；

Step 2.2 calculates cluster C_iIn each sample and μ₁、μ₂Euclidean distance, with μ₁Recently, then by sample cut-in cluster C_i1, no Then, cut-in cluster C_i2；

Step 2.3 is cluster C_i1Calculate new mean vector：If μ₁≠μ′₁, update μ₁For μ '₁；It is right Cluster C_i2Do same operation；

If step 2.4, current mean vector do not update, cluster C_iFinally it is divided into two submanifold C_i1And C_i2, otherwise, repeat Step 2.2 is to step 2.3.

4. a kind of power telecom network quality of data method for improving based on transfer learning according to claim 1, feature It is, the step 4 specifically includes：

Step 4.1, given L, U and O, i=1, iterations N；

Step 4.2, LearnKDA=L, if i ＞ 1, LearnKDA=L ∪ S_i-1；

Step 4.5, with two points of k mean algorithm clustering methods, in NO_iIn select sample, the sample set being selected be SO_i, use Set S_iIndicate SO_iSample set in luv space；

Step 4.7 enables i=i+1, repeats step 4.2 to step 4.5, until i=N；

Step 4.8 finally obtains N kind prediction results to set U, with majority voting method, determines sample in set U finally Label.