CN104636449A

CN104636449A - Distributed type big data system risk recognition method based on LSA-GCC

Info

Publication number: CN104636449A
Application number: CN201510038331.9A
Authority: CN
Inventors: 林凡; 王备战; 吴鹏程; 夏侯建兵
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2015-01-27
Filing date: 2015-01-27
Publication date: 2015-05-20

Abstract

The invention discloses a distributed type big data system risk recognition method based on LSA-GCC which comprises the first step of establishing an LSA-GCC model, the LSA-GCC model is used for mapping a data set to a semantic space, and classifying the data set by a clustering algorithm, extracting a specifically classified prototype vector from clustering results, giving a certain weight to each classification, and establishing an initial prototype vector model; the second step of conducting feedforward recognition to a risk through an LSA-SAM safety recognition model, the LSA-SAM safety recognition model conducts information system risk evaluation based on the LSA-GCC model, after data to be evaluated are mapped to the same semantic space, calculating the prototype vector of the each classification and obtaining similarity belonging to the classification, obtaining a cumulative sum of the similarity and the weight of the corresponding classification, finally obtaining risk value of the data to be evaluated by averaging, namely obtaining the risk value at an arrival moment of the data.

Description

Based on the distributed large data system Risk Identification Method of LSA-GCC

Technical field

The invention belongs to field of cloud computer technology, relate to a kind of risk for service-oriented cloud computing system evaluation and test research, specifically a kind of Risk Identification Method based on LSA-GCC (Latent Semantic Analysis-Generalized Clusterbased Classifier, latent semantic analysis and broad sense Cluster Classification device).

Background technology

In recent years, cloud computing technology development is swift and violent, becomes the focus that all circles such as industrial community, academia, government pay close attention to.The essence of cloud computing is a kind of dynamic combination of resources and service technology, and is distributed calculation task by a large amount of virtualisation component formation resource pool, and user can be obtained as required to the service of cloud computing.Cloud computing is also parallel computation, effectiveness calculates, the technology of grid computing and Intel Virtualization Technology integrated use, SaaS, PaaS, IaaS tri-kinds of level frameworks are mainly divided into according to COS, each level framework for service resource type and form different, but all with the form of Web service for user provides resource access entrance, the Web service layer thus to cloud computing system is had higher requirement.

Along with the extensive utilization of cloud computing, and network computer resources Webization and serviceization is universal, and safety problem and the importance thereof of cloud computing service are also more and more obvious.Whether system is safe and reliable, and whether if can predict, whether behavior outcome can assess by identifiable design, service behavior for the abnormal behaviour that key is measurement system.These important parameter of measurements all directly represent the risk height of cloud computing system.So the risk of how accurate assessment ground cloud computing system service layer, be weigh one of whether safe and reliable and key factor of carrying out services selection of cloud computing system.

At present, the WebService service safe of cloud computing system adopts WSDL security strategy mostly.But these are all static Web safety practices, and not virtual for cloud computing, extensive, distributed technical characterstic is optimized, and there is larger technology generation gap therebetween.Thus, under cloud computing environment, WebService will face brand-new security challenge.WebService based on cloud computing environment is dynamic change, and the safety practice required for it can be distinguished to some extent according to application background and service difference.Even if existing WebService safety technique is ripe, a part of safety problem can be solved, but the WebService effect of facing cloud computing environment can not be satisfactory.The security architecture of cloud computing and security strategy propose higher requirement to the design of WebService system services layer.Therefore, it is very necessary for carrying out service-oriented cloud computing system risk evaluation and test research.

Summary of the invention

Therefore, for above-mentioned problem, the present invention is directed to the security event log problem analysis of cloud computing dummy node, propose a kind of based on LSA-GCC (Latent Semantic Analysis-Generalized Clusterbased Classifier, latent semantic analysis and broad sense Cluster Classification device) Risk Identification Method, the method carries out risk feedforward identification by the daily record of acquisition operations system and Web service process, realizes the underlayer grain sizes risk identification of dummy node.

Wherein, risk identification index is generally as follows:

1, warning degree of purity

This index mainly weighs the detectability of model to anomalous event.Warning is whether the value-at-risk calculated according to model judges whether lower than a certain threshold value the information of giving a warning.

(1) accuracy of detection (Detection Precision Rate)

Warning and the number percent of sum of reporting to the police of correct detection, can be expressed as:

DPR = \frac{# RA}{# A} \times 100 %;

In above formula, #RA (Right Alert) is the correct warning sum detected, and #A is sum of reporting to the police.

(2) rate of false alarm (False Alarm Rate)

Warning and the number percent of truly reporting to the police of mistake, can be expressed as:

FAR = \frac{# EA}{# TA} \times 100 %;

In above formula, #EA (Error Alert) is the warning sum of mistake, and #TA (True Alert) is sum of truly reporting to the police.

2. models fitting accuracy

This index mainly weighs the decision-making ability of model to all security incidents.Judgement is whether the value-at-risk that calculates according to model judges this security incident classification lower than a certain threshold value.

(1) correctly judgement rate (Determin Accuracy)

The number percent of correct judgement and all judgements, can be expressed as:

DA = \frac{# RD}{# AD} \times 100 %;

In above formula, #RD (Right Determine) is correct judgement sum, and #AD (All Determine) is all judgement sums.

(2) recall rate (Determine Recall) is judged

Correct judgement and the number percent truly judged, can be expressed as:

DR = \frac{# RD}{# TD} \times 100 %;

In above formula, #RD (Right Determine) is correct judgement sum, and #TD (True Determine) is for truly to judge sum.

System for cloud computing contains a large amount of various checkout equipments, networking component and fictitious host computer node, and these checkout equipments carry monitoring host computer and network operation situation from different perspectives, there is relevance between a large amount of warning that they produce and daily record.The log information analyzing single checkout equipment is traditional safety situation evaluation method, due to the unicity of Data Source and the uncertainty of checkout equipment, cause the inaccuracy of analysis result, simultaneously, traditional safety situation evaluation method is not considered yet and usually be there is incidence relation between the synonym of natural language, polysemy and word.So the present invention proposes the networks security situation assessment model of LSA-SAM (LSA-based Security Assessment Model) based on potential applications, with the daily record of multiple associated assay devices for data source, adopt LSA-GCC method, and the trend of forecast analysis security postures.LSA-GCC method does daily record depth analysis by employing strobe utility to virtual machine in CSOMA framework, carries out feedforward identify by Warning System to risk.Latent semantic analysis obtains security event information from fictitious host computer daily record, under data set being mapped to a semantic space by svd (SVD), and adopt improvement Rocchio algorithm to build LSA-GCC model in conjunction with a clustering algorithm, the prototype vector extracting specific classification gives certain weights, sets up initial prototype vector model.By the method for statistical computation, analysis is carried out to a large amount of text and can find out semantic structure potential in text.LSA-SAM model after improvement can increase substantially the accuracy rate identifying anomalous event, and carries out parallelization speed-raising by LSA-GCC and LSA-SAM of MapReduce framework.

Concrete, a kind of distributed large data system Risk Identification Method based on LSA-GCC of the present invention, in CSOMA framework, adopt strobe utility to do daily record depth analysis to virtual machine by LSA-GCC model, carry out feedforward by the safe model of cognition of LSA-SAM to risk to identify, comprise the steps:

Step 1: set up LSA-GCC model, this LSA-GCC model builds in conjunction with a clustering algorithm by adopting Rocchio algorithm, also the Rocchio model of i.e. a kind of improvement, under data set being mapped to a semantic space, and after adopting clustering algorithm to classify to it, from cluster result, extract the prototype vector of specific classification, and give certain weights to each classification, set up initial prototype vector model; It comprises following performing step:

Step 11: use clustering algorithm to carry out cluster to large-scale text, the clustering algorithm of this use preferably has scalability and applicability;

Step 12: after clustering algorithm cluster, obtains a classification collection then LSA-GCC model is built with this classification collection;

Step 2: by LSA-SAM safe model of cognition, feedforward is carried out to risk and identify, the safe model of cognition of LSA-SAM carries out Risk of Information System evaluation and test based on LSA-GCC model, after data-mapping to be assessed to same semantic space, the prototype vector of classifying with each calculates the similarity belonging to this classification, make the cumulative sum of similarity and corresponding weights of classifying, finally average and obtain the value-at-risk of these data to be assessed, namely try to achieve the value-at-risk in these data arrival moment:

Step 21: the LSA-GCC model first built according to step 1 calculates similarity sim (p, c with a certain security incident _i ⁰);

Step 22: computationally secure event risk degree S _i;

Step 23: calculation risk value S _risk, S _risk=sim (p, c _i ⁰) * S _i;

Step 24: the value-at-risk calculated is mated with the safe class preset, thus obtains its safe class.

Further, in above-mentioned steps 11, there is the clustering algorithm of scalability and applicability, the present invention adopts the Once-clustering algorithm CSPC in incremental clustering algorithm, and Once-clustering algorithm CSPC has close to linear time complexity, this algorithm scan text one time, then text is integrated into (corresponding step (6)) in the classification the most close with the text, concrete, this Once-clustering algorithm CSPC is input as training set WS-DREAM, cluster threshold value r; Export as prototype vector collection GCCs (the corresponding classification of each GCC); Its specific algorithm is described below:

Step (1): make m _cfor category set, m _gccfor the set of prototype vector GCCs;

Step (2): by m _cand m _gccbe initialized as sky;

Step (3): input a new text p;

Step (4): calculate text p and classification collection m _cthe similarity of middle all categories:

C_{q} = sim (D_{q}; D_{d_{i}}) = \frac{Σ_{i = 1}^{k} (D_{d_{ij}} \cdot D_{q_{i}})}{\sqrt{Σ_{i = 1}^{k} {(D_{d_{ij}})}^{2} \cdot Σ_{i = 1}^{k} {(D_{q_{i}})}^{2}}};

Step (5): find out the classification c the highest with text p similarity _i ⁰;

Step (6): if sim is (p, c _i ⁰) >=r, just p is integrated into classification c _i ⁰in;

Step (7): otherwise, create a new classification c _i ⁰, and by new classification c _i ⁰be increased to category set m _cin;

Step (8): repeat above-mentioned steps, until the sample in training set WS-DREAM is empty;

Step (9): if | c _i ⁰|=1, by classification c _i ⁰from category set m _cin get rid of;

Step (10): to category set m _cin all categories use Rocchio formula to carry out calculating the prototype vector GCC obtaining corresponding classification, obtain prototype vector collection m _gcc;

Step (11): program returns m _gcc.。

Prototype vector can improve the robustness of disaggregated model, extracts certain correlation properties to a certain extent.Further, the present invention adopts Rocchio algorithm (GCCs can be designated as [gcc to build the prototype vector collection GCCs of classification ₁, gcc ₂..., gcc _i... ]).In this step, each classification builds prototype vector by following formula:

{gcc}_{i} = α \frac{1}{| c_{i}^{0} |} \underset{d_{m} &Element; C_{i}^{0}}{Σ} d_{m} - β \frac{1}{| D - C_{i}^{0} |} \underset{d_{n} &Element; D - C_{I}^{0}}{Σ} d_{n};

Gcc in formula _irepresent i-th classification C in LSA-GCC _i ⁰prototype vector, with represent classification respectively in document and quantity.This formula is to be expressed to be meant to very clearly, classification in document to be treated as be positive sample, remaining as negative sample, then carry out cumulative and normalization to document, the vector finally aligning negative sample does subtraction to extract the feature with identification capability thus to construct the prototype vector model gcc of classification ₁.These two parameters of α and β are that weight for weighing positive negative sample is to obtain best gcc ₁(in the present invention α=5 β).

In addition, can produce the classification collection that some are little in cluster process, so-called little classification refers in this classification and only comprises a small amount of text.These texts can be likely abnormity point, also may comprise the important information for classifying, and also may be " noise ".Therefore, in order to the disaggregated model that raises speed, the present invention sets a threshold value and integrates or filter these little classifications.

The value of the cluster threshold value r of above-mentioned steps (6) can have influence on quality and the efficiency of whole cluster process, if r value increases, the quantity of subclass and time loss all can increase.In order to obtain a stable threshold value r, present invention employs a kind of Sampling techniques and carry out definite threshold r, concrete steps are described below:

Step 61: Stochastic choice N in text set ₀individual text pair;

Step 62: calculate the similarity sim that each text is right;

Step 63: all similarities obtained in step 62 are calculated, obtains average Similarity value avgsim;

The value of step 64: threshold value r is ∈ * avgsim, wherein ∈ >=1.

In above-mentioned steps, N ₀the quantity that text selected by expression is right, avgsim is N ₀to the mean value of the similarity of text, ∈ is a parameter, for the value according to different applicable cases adjustment threshold value r.Work as N ₀when getting a larger value, avgsim can remain stable.In addition, result shows, when the span of ∈ is 5-13, this experiment achieves the cluster result of better quality.

In addition, log file data amount is huge, normally very surprising quantity.Along with the develop rapidly of computer information technology, main frame in enterprise, server, fire wall, switch, PAA, wireless routing etc. the network equipment, safety equipment and application system are more and more, the massive logs information that these equipment produce becomes the important component part of the data that large data age rapidly increases, and the log management brought thus and the work of security audit also become and becomes increasingly complex.In the face of mass data, depend merely on and manually carry out having managed almost to become the work that can not complete.Therefore, how effectively mass data to be collected effectively, process and analyze and impel people constantly to study new technology to meet the demand of people to the pursuit of computing power.Distributed computing platform is also constantly changed under the promotion of this demand, in succession occurred high-performance calculation, grid computing, general fit calculation, cloud computing etc., the parallel cloud computing method MapReduce that wherein in field of cloud calculation, Google company proposes is then an emerging in recent years study hotspot.It can carry out parallelization process to the extensive problem of complexity, realizes Distributed Calculation fast, is especially applicable to the application of data mining class, machine learning class.The parallelization speed-raising means adopting MapReduce as LSA-GCC and LSA-SAM are verified by the present invention.

Wherein, the LSA-GCC parallelization speed-raising based on MapReduce framework comprises following content:

In LSA-GCC modeling process, the computation complexity of this model and time cost can increase along with the increase of training set D.And MapReduce framework have can greatly shortcut calculation, solve the advantage of large data processing problem and the high problem of time cost.Therefore, the present invention considers to realize LSA-GCC algorithm with MapReduce framework: MR-LSA-GCC.The step of algorithm MR-LSA-GCC is as follows:

Step 1: the training set D through LSA process is divided into m part, and by each several part training set D _ibe assigned in MAP;

Step 2: build MAP function: read the training set D be assigned with _i, use CSCP algorithm to D _icarry out clustering processing and obtain training set D _iin the classification collection m that comprises _c, then for classification collection m _cin classification use Rocchio algorithm to build corresponding prototype vector respectively, obtain D _ithe prototype vector collection m of middle all categories _gcc;

Step 3: build REDUCE function: collect the prototype vector collection m obtained from MAP _gccto prototype vector collection m _gccin prototype vector use CSPC algorithm to carry out merger to obtain final prototype vector collection m _final-gcc.

Then MapReduce framework call above-mentioned steps 2 build MAP function and step build 3 REDUCE function realize LSA-GCC algorithm.

The safe model of cognition of LSA-SAM capableization speed-raising based on MapReduce comprises following content:

The safe model of cognition of LSA-SAM be test sample book T and the LSA-GCC model to have set up are carried out similarity calculate similar value degree sim (p, c _i ⁰), then by similar value degree sim (p, c _i ⁰) and security incident risk S _iobtain the safety value of this sample as product, then this safety value is mated with safe class, thus obtain the safe class of this sample.Because each test sample book is separate, in conjunction with the advantage of MapReduce framework, when data set is larger time, we can consider to accelerate with MapReduce.Therefore, the present invention proposes MR-LSA-SAM algorithm.Algorithm MR-LSA-SAM algorithm steps is as follows:

Step 1: under test sample book T being mapped to the latent semantic space of training set D structure, obtaining the vector representation collection T' with semantic structure, then T' is divided into m part, and by test set T' _ibe assigned in MAP;

Step 2:MAP: read the test set T' be assigned to _i, use the safe model of cognition of LSA-SAM to obtain the safe class of test sample book.

Use MapReduce framework, user only needs to realize these two functions of Map and Reduce, and other thing completes automatically by MapReduce framework, so just enormously simplify the realization of algorithm.Simultaneously because MapReduce framework is inherently designed by large-scale program computation, therefore, algorithm itself is easy to expand on multiple computing node.Adopting the algorithm realized based on MapReduce framework, namely extending on large-scale cluster without the need to revising any code, this to solving of scalable problem is and favourable, when particularly will process large-scale data.The log recording analysis of the present invention's research is exactly large-scale data, and the method for therefore the present invention's proposition, after MapReduce framework is strengthened, more shows its advantage.

For the massive logs analysis that cloud computing system produces, the present invention adopts the broad sense Cluster Classification device (LSA-GCC) based on latent semantic analysis to carry out data mining, realize the event log risk identification of semantic class, different from the event log type that system self defines, this method can find potential risk in a large amount of generic log, and carries out the identification of risk class; The present invention subsequently compared for the recognition effect adopting machine learning method to improve LSA-GCC, and accelerate this identifying with MapReduce, what said method was indicated in experimental verification is combined in recognition accuracy and speed, all there is good raising, compared with adding up with general event log, also have the single-point risk of cloud computing system virtual machine and judge more accurately.

Distributed large data system Risk Identification Method based on LSA-GCC of the present invention, carries out risk feedforward identification by the daily record of acquisition operations system and Web service process, realizes the underlayer grain sizes risk identification of dummy node.The LSA-SAM (Latent Semantic Analysis-Secutity Assessment Model, latent semantic analysis Security Evaluation Model) of improvement is adopted to improve the accuracy rate detecting anomalous event further on this basis.Finally adopt MapReduce to accelerate respectively for LSA-GCC and LSA-SAM, semantic class risk identification faster can be realized.

Accompanying drawing explanation

Fig. 1 is three-dimensional latent semantic space diagram;

Fig. 2 be LSA solve process flow diagram;

Fig. 3 is SVD matrix decomposition diagram;

Fig. 4 is the prototype vector example based on subclass

Fig. 5 is that the recognition accuracy of LSA-GCC and Rocchio two kinds of methods is compared;

Fig. 6 is the safe model of cognition (LSA-SAM) based on potential applications;

Fig. 7 is risk situation simulation;

Fig. 8 is the risk situation after improving;

Fig. 9 is the LSA-SAM modelling effect contrast before improving and after improving;

Figure 10 is the procedural model of MR-LSA-GCC;

Figure 11 is the procedural model of MR-LSA-SAM;

Figure 12 is serial and parallel algorithm experimental result picture;

Figure 13 is MR-LSA-GCC experimental result picture;

Figure 14 is MR-LSA-SAM and LSA-SAM experimental result picture.

Embodiment

Now the present invention is further described with embodiment by reference to the accompanying drawings.

Cloud computing is the product of new generation of PC cluster, parallel computation, grid computing development, has merged multiple concept and the technology of Distributed Calculation.Cloud computing environment typical case presents the feature of the variation of large-scale distributed, complex structure, framework, computing mobilism and service virtualization, and wherein Intel Virtualization Technology is one of gordian technique of cloud computing.Virtually main cloud computing system supplier is provided as to service-oriented (WebService) proposes higher safety and quality requirements, and tradition research focuses on the field such as evaluating information system risk and network invasion monitoring mostly, lack the further investigation for cloud computing, it is very necessary for therefore carrying out service-oriented cloud computing system risk evaluation and test research.

The invention provides a kind of distributed large data system Risk Identification Method based on LSA-GCC, by SOCTPA early warning system, semantic analysis and depth recognition are carried out to risk, wherein latent semantic analysis obtains security risk information from fictitious host computer daily record, and sets up initial prototype vector model.By the method for statistical computation, analysis is carried out to a large amount of text and can find out semantic structure potential in text.Experiment proves that this Risk Identification Method can analyze security incident from massive logs, obtains security risk tolerance.

The present invention, by the middle of the approach application of data mining to the risk assessment of cloud computing, proposes the broad sense cluster risk evaluation and test model under a kind of latent semantic space.Under data set is mapped to a semantic space by svd (SVD) by this model, and after adopting clustering algorithm to classify to it, from cluster result, extract the prototype vector of specific classification, and give certain weights to each classification, set up initial prototype vector model.Risk of Information System evaluation and test is carried out based on this model, after data-mapping to be assessed to same semantic space, the prototype vector of classifying with each calculates the similarity belonging to this classification, make the cumulative sum of similarity and corresponding weights of classifying, finally average and obtain the value-at-risk of these data to be assessed, namely try to achieve the value-at-risk in these data arrival moment.As being not particularly illustrated, the present invention all refers to obtain security risk information from the operating system daily record and Web Application Server daily record of fictitious host computer.

Invention describes latent semantic analysis (LSA), comprise the principle of LSA, solve process flow diagram and application thereof; Describe the broad sense Cluster Classification device (LSA-GCC) based on latent semantic analysis; And LSA-GCC and the LSA-SAM parallelization speed-raising experiment carried out based on MapReduce framework.

One, LSA latent semantic analysis

1.1 latent semantic analysis general introductions

Traditional vector space model based on text keyword (VSM), represents non-structured text in the form of vectors, may use various mathematics computing model.Its advantage is that processing logic is quick, simple.But the word occurred in text often exists certain correlativity, vector space model but thinks between text and Feature Words it is separate, and this can affect the result of calculating.There is a large amount of polysemants and synonym phenomenon in natural language, semantic expression also also exists the relation of context of co-text simultaneously, only can not represent the content of text with isolated key word.And the text similarity of vector space model (VSM) model only depends on word frequency statistics, ignore the feature of natural language, accuracy and the integrality of result certainly will be affected.

The most essential difference of latent semantic analysis (LSA) and vector space model (VSM) is, LSA thinks and there is certain incidence relation between Feature Words and text.LSA concludes and the new theory of knowledge token about knowledge, is a kind of method for automatically realizing the representation of knowledge and extraction and theory, and it can a large amount of text of statistical study, finds the incidence relation between Feature Words, the context implication between extraction Feature Words.Similar with vector space model (VSM), LSA also adopts space vector to represent text, but by svd (SVD, Singular Value Decomposition) process, make LSA can original text vector spatial mappings in the latent semantic space of low-dimensional, just eliminate the impact between synonym, polysemant so to a certain extent, improve the precision of subsequent treatment.

1.2 latent semantic analysis methods

The starting point of LSA there is certain incidence relation between Feature Words and text, analyzed the method for a large amount of text by statistical computation, finds out semantic structure potential in text.Each vocabulary is considered as in the system of space being a point of coordinate with document by LSA, and being also considered as by each document is a point of coordinate with vocabulary in the system of space.Think that Feature Words will be put in document to go to understand, and document is made up of Feature Words, embodies the semantic space relation of one " document-Feature Words ".

Represent different from the higher-dimension of document in vector space model (VSM), the key idea of latent semantic analysis (LSA) is by the vector space with semantic structure of original text vector spatial mappings to a low-dimensional by svd (SVD), i.e. latent semantic space, as shown in Figure 1.

The implementation procedure of 1.3 potential applications

Realize LSA, need the potential mathematical model by setting up semantic space, this directly has influence on the performance of LSA.Since the proposition LSA way of thinking, by constantly attempting and improving, during preference pattern, need according to the feature of required process data, specific request for utilization, the method that when considering calculating, the many factors such as binding ability of the optimization criterion of the ability to express of the storage space cost of memory consumption, data, computation complexity, semantic model, model, update algorithm complexity, model selects the best being applicable to particular demands to deal with problems.LSA solves process flow diagram as shown in Figure 2,

(1) generator matrix

Document sets (wherein m represents the quantity of document) and feature word set (wherein the quantity of n representation feature word) are constructed " Feature Words-document " matrix A of N × M dimension:

A＝(a _ij) _n×m(4-1)

The Feature Words of document forms the row vector of matrix A, and n is the quantity of Feature Words; Document forms column vector, and m is the quantity of document; The initial value of aij is the number of times that i-th Feature Words occurs in a jth document, by obtaining the weight of this element after calculating this value, trifle introduction after computing method.If belong to other two sections of documents of same class, its semantic meeting is more relevant.In different classes of document, its semantic meeting is uncorrelated, and as a rule, A is High Order Sparse Matrix.

(2) svd (SVD)

SVD is defined as follows:

A＝U∑V ^T(4-2)

Wherein V is the orthogonal matrix of n × n, and its row vector is matrix A A ^torthogonal eigenvectors, be the right singular vector of matrix A; U is the orthogonal matrix of m × m, and its column vector is matrix A ^tthe orthogonal eigenvectors of A, is called the left singular vector of matrix A; UU ^t=V ^tv=I _m.∑=diag (l ₁, l ₁₂, l ₁₃..., l _r), wherein l ₁, l ₁₂, l ₁₃..., l _rbe all singular values of A and l ₁>=l ₁₂>=l ₁₃>=...>=l _r.

Owing to there is a large amount of texts and Feature Words, inevitably cause huge " Feature Words-document " matrix.So when actual computation, need to simplify decomposition result further.

" document-Feature Words " matrix dimensionality reduction, namely drop to low-dimensional from higher-dimension.Get positive integer k, k<=r and k<<min (m, n), to be Ak be the k-order approximate matrix obtaining A

A_{k} = U_{k} Σ_{k} V_{k}^{T} - - - (4 - 3)

∑ is the diagonal matrix of k × k, and diagonal element is k maximum singular value of matrix A; V _kthe matrix of n × k size, row be that the front k of V is capable, U _kthe matrix of m × k size, U _krow are front k row of matrix U.According to result after A svd, in a geometric space, each document, each Feature Words can find the point of fixity of its correspondence.Under the condition of given k, SVD is utilized to find approximate matrix A _kmethod maintain the immanent structure (word-word, word-text, text-text) contacted between Feature Words and document reflected in A, i.e. potential applications, eliminate the use because of synonym or polysemant and " noise " that produce simultaneously.In a sense, SVD can excavate the main semantic information of text set, weakens the impact that noise brings.

As the matrix decomposition schematic diagram that Fig. 3 is SVD, this figure can contribute to understanding potential applications structure.Decompose through SVD, any one matrix, obtains corresponding matrix U, T, ∑; Above-mentioned three matrixes of abbreviation, select suitable k, obtain the breakdown of matrix A original matrix the best under 2-normal form meaning of this to be also order be k is approached.Wherein, the row of singular value vector matrix can be regarded in the expression of k dimension space document and Feature Words as, namely represent the volume coordinate of document and unique point.Characteristic sum document all belongs to same space, and this space is exactly potential semantic structure space, and by becoming more readily available the similarity relation between them.

Reduction process on the process nature of SVD matrix decomposition generation latent semantic space, the key of this process is to get suitable k value, if k is excessive, so original vector space and new semantic space too close, calculated amount is larger, just loses the meaning of LSA; If k is too small, so differentiate the scarce capacity of Feature Words or document, very little, experiment effect is just bad for newly-generated semantic space characteristic information.Therefore, how to balance accuracy rate and efficiency, in specific experiment, according to the theory of factorial analysis, also can manually adjust k value, choose front k maximum main gene, given threshold value, makes k meet following contribution rate inequality:

Σ_{i = 1}^{k} a_{i} / Σ_{i = 1}^{r} a_{i} &GreaterEqual; θ, r = \min (n, m) - - - (4 - 4)

Wherein θ is certain value between 0.4-1, comprises the threshold value of raw information.Contribution rate inequality is the similarity degree of reflection luv space and k n-dimensional subspace n.The value of k is generally between 300-1000 in an experiment.

1.4 based on the broad sense Cluster Classification device of potential applications

Text data is often with exponential speed increment, and how efficiently these text datas of organization and administration have become a problem important and anxious to be resolved.How efficiently text classification is process text being assigned to predefined classification, be used to the important method solving the text data that organization and administration are a large amount of.Text classification is the important application of Data Mining one, and many people have done a lot about the research of this respect, and a lot of sorting technique is also suggested.Conventional sorting algorithm has decision tree, Rocchio, naive Bayesian, neural network, support vector machine, Floquet model expansion, KNN, genetic algorithm, maximum entropy, Generalized Instance Set etc.

Along with the develop rapidly of computer information technology, the data volume produced of every day is huge.Treatment and analyses is carried out to these data and needs to consider the factors such as the semantic structure of time cost, calculation cost and natural language.Therefore, we have proposed a broad sense Cluster Classification device based on latent semantic analysis (LSA-GCC, LSA-based Generalized Cluster based Classifier), make full use of the rapidly and efficiently advantage of LSA and Rocchio algorithm.

Under first training set is mapped to the semantic space of a low dimension by this sorter, then the Once-clustering algorithm (CSPC of a constraint is adopted, constrained single pass clustering algorithm) and Rocchio algorithm build the prototype vector model of the subclass under training set all categories, then based on each similarity of classifying in this model solution sample to be sorted and model, the classification that similarity is the highest is selected to classify belonging to this sample to be sorted.CSPC clustering algorithm can cover the potential subclass of each classification, and the model therefore constructed by LSA-GCC is better than Rocchio model.Meanwhile, LSA-GCC model builds under semantic space, meets the requirement that text describes.

The basic thought of Rocchio algorithm is for training text collection, and for each classification builds a prototype vector, prototype vector computing formula is as follows:

C_{j} = α \frac{1}{| D_{j} |} \underset{d_{m} &Element; D_{j}}{Σ} d_{m} - β \frac{1}{| D - D_{j} |} \underset{d_{n} &Element; D - D_{j}}{Σ} d_{n} - - - (4 - 5)

In above formula, C _jbased on classification C _jthe prototype vector of structure, D is whole text training set, D _jwith | D _j| represent classification C _jin collection of document and number of documents, α and β is used to the importance of weighing positive sample set and negative sample collection.The process producing prototype vector can regard the process of a study as, and Rocchio model is made up of these prototype vectors.A given text T, calculates the similarity of the text and each prototype vector, and text T is distributed to the highest classification of similarity.

1.5LSA-GCC algorithm

When the quantity of training text collection is larger, will face a problem to the classification of text, that is exactly that calculation cost is high.Calculation cost height is not often suitable for the application with requirement of real-time.A kind of method that effectively can make up this defect is structure broad sense Cluster Classification model.This model can substitute original training sample set pair text and classify.Meanwhile, because this model is the summary (can be referred to as " barycenter " of original training sample collection) to original training sample collection, therefore its classifying quality can be better.

Rocchio is a kind of effective linear classifier, and it builds the prototype vector model of training sample set by efficient clustering method.Therefore, the vector model that it builds is one of effective ways improving calculated performance.But Rocchio algorithm has two obvious defects.One is its tentation data space be one group of linear separability from lineoid region, but the distribution of the data of many real worlds is non-linear; Another is it is that each classification only builds a prototype vector.Cluster is a kind of unsupervised machine learning method, and it is feasible for excavating by the method for cluster the potential fine-grained relation that training sample concentrates.Therefore, the Once-clustering method CSPC that we have proposed a kind of constraint trains to obtain fine-grained category set to training sample set.

Build the prototype vector of these category sets, be called broad sense prototype vector collection (GCCs).Then replace original training sample set with this broad sense prototype vector collection thus construct the disaggregated model based on broad sense cluster.As shown in Figure 4, assuming that triangle represents a large classification, it comprises three subclasses; Square represents another large classification, and it comprises four subclasses.In Rocchio algorithm, triangle classification and square classification will be represented (that is only having two " barycenter ") by respective broad sense prototype vector respectively.And under desirable LSA-GCC model, according to CSPC algorithm, triangle classification is by generation three subclasses, square classification is by generation four subclasses; Then be that these two classifications produce the broad sense prototype vector (that is have seven " barycenter ") corresponding with respective subclass according to Rocchio algorithm.Finally we use these broad sense prototype vectors instead of all training sets further to calculate.

Different classification is represented by different Feature Words, and the prototype vector that LSA-GCC builds is all the diverse location being non-linearly distributed in data space.Therefore, LSA-GCC well can make up the defect of Rocchio algorithm, the semantic meaning representation ability of the model constructed by reinforcement.Because LSA-GCC is derived from prototype vector, and prototype vector is training sample set " barycenter ", so this model is insensitive to individualized training sample.Moreover, based on the LSA-GCC of prototype vector owing to reducing the quantity for the training text calculated, greatly accelerate the decision process that the later stage carries out classifying to a certain extent.Thus no matter method proposed by the invention can obtain good effect from efficiency or validity.

2LSA-GCC modeling

We adopt Rocchio algorithm to prepare to build LSA-GCC model (a kind of Rocchio model of improvement) for the classification in later stage in conjunction with a clustering algorithm.In order to ensure scalability and the applicability of method proposed by the invention, need to use the clustering algorithm with scalability and applicability to carry out cluster to large-scale text.

Large-scale text cluster is the data clusters problem of a higher-dimension, and High Dimensional Clustering Analysis makes most of traditional clustering algorithm lose efficacy.Various clustering algorithm has been suggested and has solved extensive and high dimensional data problem, as subspace clustering, association cluster etc.Incremental clustering algorithm time loss is few, non-iterative and only scan text one time.Based on such feature, incremental clustering algorithm can solve the problem.

Once-clustering algorithm is a kind of incremental clustering algorithm, have close to linear time complexity, what therefore the present invention adopted is a kind of this clustering algorithm of constraint single pass CSPC (whole cluster process step describes as Suo Shi (1) in algorithm 1-(12)), this algorithm scan text one time, is then integrated into text (corresponding step describes as Suo Shi (7) in algorithm 1-(8)) in the classification the most close with the text.

Algorithm 1 (ProcedureGCC (D)): the false code building prototype vector collection is as follows:

Input: training set WS-DREAM, cluster threshold value r;

Export: prototype vector collection GCCs (the corresponding classification of each GCC);

ProcedureGCC(D)

(1) m is established _cfor category set, m _gccfor the set of prototype vector GCCs;

(2) by m _cand m _gccbe initialized as sky;

(3)Repeat；

(4) the text p that input one is new;

(5) text p and classification collection m is calculated _cthe similarity of middle all categories;

C_{q} = sim (D_{q}; D_{d_{i}}) = \frac{Σ_{i = 1}^{k} (D_{d_{ij}} \cdot D_{q_{i}})}{\sqrt{Σ_{i = 1}^{k} {(D_{d_{ij}})}^{2} \cdot Σ_{i = 1}^{k} {(D_{q_{i}})}^{2}}};

(6) the classification c the highest with text p similarity is found out _i ⁰;

(7) if sim is (p, c _i ⁰) >=r;

(8) just p is integrated into classification c _i ⁰in;

(9) otherwise,

(10) the classification c that establishment one is new _i ⁰;

(11) and by new classification c _i ⁰be increased to category set m _cin;

(12) until the sample in training set is empty;

(13) if | c _i ⁰|=1;

(14) by classification c _i ⁰from category set m _cin get rid of;

(15) to category set m _cin all categories use Rocchio formula to carry out calculating the prototype vector GCC obtaining corresponding classification, obtain prototype vector collection m _gcc;

(16) program returns m _gcc.

A classification collection can be obtained after CSPC algorithm cluster then LSA-GCC model is built with this classification collection.

Prototype vector can improve the robustness of disaggregated model, extracts certain correlation properties to a certain extent. and the present invention adopts Rocchio algorithm to build the prototype vector GCCs (step is as Suo Shi the step (15) in algorithm 1) of classification.In this step, each classification builds prototype vector by following formula:

{gcc}_{i} = α \frac{1}{| c_{i}^{0} |} \underset{d_{m} &Element; C_{i}^{0}}{Σ} d_{m} - β \frac{1}{| D - C_{i}^{0} |} \underset{d_{n} &Element; D - C_{I}^{0}}{Σ} d_{n} - - - (4 - 6)

Gcc in formula _irepresent i-th classification C in LSA-GCC _i ⁰prototype vector, with represent classification respectively in document and quantity.This formula is to be expressed to be meant to very clearly, classification in document to be treated as be positive sample, remaining as negative sample, then carry out cumulative and normalization to document, the vector finally aligning negative sample does subtraction to extract the feature with identification capability thus to construct the prototype vector model gcc of classification _i.These two parameters of α and β are that weight for weighing positive negative sample is to obtain best gcc _i, α=5 β in the present invention.

Can produce the classification collection that some are little in cluster process, so-called little classification refers in this classification and only comprises a small amount of text.These texts can be likely abnormity point, also may comprise the important information for classifying, and also may be " noise ".Therefore, in order to the disaggregated model that raises speed, the present invention sets a threshold value and integrates or filter these little classifications (step describes as Suo Shi (13) in algorithm 1 and (14)).

In algorithm 1, the value of the cluster threshold value r of step (7) can have influence on quality and the efficiency of whole cluster process, if r value increases, the quantity of subclass and time loss all can increase.In order to obtain a stable threshold value r, present invention employs a kind of Sampling techniques and carrying out definite threshold r.Concrete steps are described below:

Step 1: Stochastic choice N in text set ₀individual text pair.

Step 2: calculate the similarity sim that each text is right.

Step 3: all similarities obtained in step 2 are calculated, obtains average Similarity value avgsim.

The value of step 4: threshold value r is ∈ * avgsim, wherein ∈ >=1.

In above-mentioned steps, N ₀the quantity that text selected by expression is right, avgsim is N ₀to the mean value of the similarity of text, ∈ is a parameter, for the value according to different applicable cases adjustment threshold value r.When N0 gets a larger value, avgsim can remain stable.In this research, N ₀value is 8000, and result shows, when the span of ∈ is 5-13, this experiment achieves the cluster result of better quality.

Compare LSA-GCC model and Rocchio model, experimental result is as shown in following table 1 and table 2.

Table 1LSA-GCC

Table 2Rocchio

Parameter in table 1 and table 2, r is classification thresholds, Right refers to the record number of correct classification, Error refers to by the record number of mis-classification, Uncertain refers to and is specifying the record number cannot assigning to classification all in Modling model under classification thresholds, Error event, Information event, Failure Audit event, Warning event, Success Audit event, Success event are the judgement to some security incidents, Recall is the recall rate that these judge, Accuracy is the accuracy of model.

Fig. 1 is that schematic diagram is compared in the recognition accuracy of LSA-GCC and Rocchio two kinds of methods.

3 based on the risk identification model of potential applications

3.1 risk identification indexs

1. warning degree of purity

(1) accuracy of detection (Detection Precision Rate)

DPR = \frac{# RA}{# A} \times 100 % - - - (4 - 7)

(2) rate of false alarm (False Alarm Rate)

FAR = \frac{# EA}{# TA} \times 100 % - - - (4 - 8)

2. models fitting accuracy

(1) correctly judgement rate (Determin Accuracy)

DA = \frac{# RD}{# AD} \times 100 % - - - (4 - 9)

(2) recall rate (Determine Recall) is judged

Correct judgement and the number percent truly judged, can be expressed as:

DR = \frac{# RD}{# TD} \times 100 % - - - (4 - 10)

3.2 Risk Identification Method

System for cloud computing contains a large amount of various checkout equipments, networking component and fictitious host computer node, and these checkout equipments carry monitoring host computer and network operation situation from different perspectives, there is relevance between a large amount of warning that they produce and daily record.

The log information analyzing single checkout equipment is traditional safety situation evaluation method, due to the unicity of Data Source and the uncertainty of checkout equipment, cause the inaccuracy of analysis result. simultaneously, traditional safety situation evaluation method is not considered yet and usually be there is incidence relation between the synonym of natural language, polysemy and word.

So the present invention proposes the networks security situation assessment model of LSA-SAM (LSA-based Security Assessment Model) based on potential applications, also be the safe model of cognition of LSA-SAM, with the daily record of multiple associated assay devices for data source, adopt LSA-GCC method, and the trend of forecast analysis security postures.

Fig. 6 is the estimation flow of the safe model of cognition of LSA-SAM, and as shown in the figure, the LSA-GCC model first built according to step 1 calculates similarity sim (p, c with a certain security incident _i ⁰); Then computationally secure event risk degree S _i; Calculation risk value S _risk, S _risk=sim (p, c _i ⁰) * S _i; The value-at-risk calculated is mated with the safe class preset, thus obtains its safe class.

Security incident of the present invention integrate consider Windows operating system in virtual machine event log set as SES={SUCCESS EVENT, SUCCESS AUDIT EVENT, INFORMATION EVENT, WARNING EVENT, FAILURE EVENT, ERROR EVENT}.And in relatively representational Tomcat Webservice server, adopt the event log set that Log4j gathers: SESt={AUDIT, DEBUG, INFO, WARN, ERROR, FATAL}.Safety value (as shown in table 1) is given according to corresponding security incident.Safe class Gs={ safety, safer, generally, dangerous }, each grade, in certain numerical range, obtains value-at-risk by the safe model of cognition of LSA-SAM, just belongs to this safe class in the numerical range that this value-at-risk falls into a certain safe class.Safe class scope is as shown in table 3.

OS security incident	WebService event	Weights
			SUCCESS EVENT	AUDIT	1.0
SUCCESS AUDIT EVENT	DEBUG	0.8
			INFORMAION EVENT	INFO	0.6
WARNING EVENT	WARN	0.4
			FAILURE EVENT	ERROR	0.2
ERROR EVENT	FATAL	0.0

Table 3 security incident weights grade classification

Grade	Span
		Safety	8.5～1.0
Safer	6.0～8.5
		Generally	3.5～6.0
Dangerous	0～3.5

Table 4 risk class scope

This experiment overall condition of LSA-SAM algorithm to security risk is verified.Only have a security incident to occur in the same timeslice of this experimental hypothesis, given security incident is also predicted and imparts value-at-risk.Part of test results as shown in Figure 7.

As can be seen from Figure 7, experimental result and original risk situation basically identical.But we also see some phenomenons misfitted, as the situation of 1,2,3,4 these four points marked in figure.Have in the preceding article and set the security situation of risk, the value-at-risk scope of unsafe incidents is 0 ~ 3.5, and this value-at-risk of 4 is all 0, so these 4 is judge and be classified as the situation that abnormal conditions give the alarm.But its corresponding original case is but in safe range, should not give the alarm.Does is what reason this? these four points are exactly the Uncertain situation in experiment one in fact, namely not by situation that model covers.Because cannot be assigned in the middle of corresponding classification for it, thus just it as being anomalous event, processed by keeper.In experiment, mark 1 and mark 3, mark 2 and mark 4 are identical security incident respectively, if after the person of being managed that gives the alarm first is judged to be security incident, still giving the alarm when again occurring, is then that needs are dealt with problems.Now, just can play the advantage of LSA-SAM model, the judgement according to keeper learns, and using this o'clock as a new little classification, then calculates such other " barycenter ".After study, risk situation as shown in Figure 7.Visible, after study, the situation of false alarm is improved.We also contrast improving front and after improving model each standard, as shown in Figure 8.

As can be seen from Figure 9, before improving, the correct verification and measurement ratio DPR of anomalous event of LSA-SAM model only has about 70%, and after improving, the correct verification and measurement ratio of model reaches about 98%; From rate of false alarm FAR, after improving, LSA-SAM model only has about 2%, and improves front LSA-SAM model up to 29%.Therefore, by study, the improvement of LSA-SAM model performance is significant.And DPR and FAR of LSA-SAM model can not have larger fluctuation along with the increase of data set, also further illustrate the correctness of LSA-SAM model.

Meanwhile, from Fig. 7, we also see, mark 5 is the point that multiple continuous print value-at-risk is identical with mark 6 respectively.Because experiment has supposed that synchronization only has an event to occur, if so these continuous print points are similar events, so the behavior of frequent operation is also a kind of abnormal conditions, is attacked.Therefore, for this situation, in time to give the alarm, transfer to keeper to process.

For only having an event to occur in the same timeslice of experimental hypothesis, be irrational in the middle of practice.But hypothesis herein is just convenient to do experiment simulation.As by the middle of LSA-SAM model use to reality, only will needed the value-at-risk of the minimum risk value had of all security incidents as this moment.Certainly two kinds of situations to also be considered: have continuous multiple identical security incident to occur in (1) same timeslice, now value-at-risk will be reduced in unsafe scope, and give the alarm; (2) where out of joint have the value-at-risk of multiple security incident in same timeslice in unsafe range, now will give the alarm respectively to these security incidents, can be clearly to make keeper on earth.

The parallelization speed-raising of 4LSA-GCC model and LSA-SAM

Log file data amount is huge, normally very surprising quantity.Along with the develop rapidly of computer information technology, main frame in enterprise, server, fire wall, switch, PAA, wireless routing etc. the network equipment, safety equipment and application system are more and more, the massive logs information that these equipment produce becomes the important component part of the data that large data age rapidly increases, and the log management brought thus and the work of security audit also become and becomes increasingly complex.In the face of mass data, depend merely on and manually carry out having managed almost to become the work that can not complete.Therefore, how effectively mass data to be collected effectively, process and analyze and impel people constantly to study new technology to meet the demand of people to the pursuit of computing power.Distributed computing platform is also constantly changed under the promotion of this demand, in succession occurred high-performance calculation, grid computing, general fit calculation, cloud computing etc., the parallel cloud computing method MapReduce that wherein in field of cloud calculation, Google company proposes is then an emerging in recent years study hotspot.It can carry out parallelization process to the extensive problem of complexity, realizes Distributed Calculation fast, is especially applicable to the application of data mining class, machine learning class.The parallelization speed-raising means adopting MapReduce as LSA-GCC and LSA-SAM are verified by the present invention.

4.1 raise speed based on the LSA-GCC parallelization of MapReduce framework

Step 1: the training set D through LSA process is divided into m part, and each several part training set Di is assigned in MAP;

Step 2:MAP: read the training set Di be assigned with, CSCP algorithm is used to carry out to Di the classification collection mc that clustering processing obtains comprising in training set Di, then use Rocchio algorithm to build corresponding prototype vector respectively for the classification in classification collection mc, obtain the prototype vector collection mgcc of all categories in Di;

Step 3:REDUCE: collect the prototype vector collection mgcc obtained from MAP, uses CSPC algorithm to carry out merger to the prototype vector in prototype vector collection mgcc and obtains final prototype vector collection mfinal-gcc.

The whole process flow diagram of MR-LSA-GCC algorithm as shown in Figure 10.

Algorithm 2 is Implementation of pseudocodes of MR-LSA-GCC algorithm below.

Algorithm 2MR-LSA-GCC false code:

4.2 raise speed based on the safe model of cognition of LSA-SAM capableization of MapReduce

Step 1: under test sample book T being mapped to the latent semantic space of training set D structure, obtaining the vector representation collection T' with semantic structure, then T' is divided into m part, and test set T'i is assigned in MAP;

Step 2:MAP: read the test set T'i be assigned to, uses the safe model of cognition of LSA-SAM to obtain the safe class of test sample book;

The whole process flow diagram of MR-LSA-SAM algorithm as shown in figure 11.Its false code is as follows:

The programmed algorithm false code of algorithm 3MR-LSA-SAM

Known by the analysis of false code, use MapReduce framework user only to need to realize these two functions of Map and Reduce, other thing completes automatically by MapReduce framework, so just enormously simplify the realization of algorithm.Simultaneously because MapReduce framework is inherently designed by large-scale program computation, therefore, algorithm itself is easy to expand on multiple computing node.Adopting the algorithm realized based on MapReduce framework, namely extending on large-scale cluster without the need to revising any code, this to solving of scalable problem is and favourable, when particularly will process large-scale data.The log recording analysis of the present invention's research is exactly large-scale data, and the method for therefore the present invention's proposition, after MapReduce framework is strengthened, more shows its advantage.

1, based on the speed-up ratio analysis of MapReduce framework

Adopt Parallel Algorithm optimization problem, a most important object be exactly reduce algorithm solve the time, the present invention wishes to predict theoretically the speed-up ratio that two algorithms of proposition strengthen.

Algorithm speed-up ratio S: to a problem, employing serial algorithm solves the time ratio spent with Parallel Algorithm:

S = \frac{T_{s}}{T_{p}} - - - (4 - 11)

Here Ts and Tp represents serial and parallel algorithm respectively in Solve problems institute's time spent time.

The theoretical speed-up ratio of 2MR-LSA-GCC algorithm

According to the MR-LSA-GCC theoretical model of Figure 10, Tp comprises five parts:

(1) Tfork represents program is copied to other time from node from host node;

(2) Tmap represents the time performing LSA-GCC algorithm, i.e. the working time of LSA-GCC;

(3) Tout represents the time that operation result is sent to REDUCE and spends by MAP, i.e. transmission time;

(4) Treducer represents the time performing mering program and run;

(5) Tresult represents that the result by REDUCE writes the time of hard disk.

Therefore, Tp=Tfork+Tmap+Tout+Treducer+Tresult.We have:

S_{MR - LSA - GCC} = \frac{T_{s}}{T_{p}} = \frac{1}{Tfork + Tmap + Tout + Treducer + Tresult} - - - (4 - 12)

From MR-LSA-GCC model, we know that copying of program exports Exactly-once with result, and therefore, the consumption of this part time is less, negligible.Thus we have:

S_{MR - LSA - GCC} = \frac{T_{s}}{T_{p}} \approx \frac{1}{Tfork + Tout + Treducer} - - - (4 - 13)

Suppose in MR-LSA-GCC, have m virtual machine as from node to perform LSA-GCC algorithm, thus have Ts=m*Tmap.So have:

S_{MR - LSA - GCC} = \frac{m}{1 + \frac{T_{out}}{T_{map}} + \frac{T_{reducer}}{T_{map}}} - - - (4 - 14)

As can be seen from formula, the speed-up ratio of MR-LSA-GCC depends on the time performing LSA-GCC algorithm.Therefore, it is very complicated that MR-LSA-GCC is applicable to Solve problems, needs the problem that just can obtain solution for a long time.

Situation working time of different pieces of information collection LSA-GCC model and MR-LSA-GCC model is shown in Figure 13.Wherein horizontal ordinate represents the quantity of data set, and ordinate represents the working time of corresponding data set LSA-GCC model and MR-LSA-GCC model in varied situations.As can be seen from Figure 13, along with the continuous increase of data volume, the velocity ratio that LSA-GCC model time increases is very fast, and the velocity ratio that the time of MR-LSA-GCC model increases is slower.Experiment shows that MR-LSA-GCC model also achieves to a certain extent to the acceleration of LSA-GCC model up to 12.16 times.This demonstrate that correctness and the validity of MR-LSA-GCC model.

3, the theoretical speed-up ratio of MR-LSA-SAM algorithm

According to the MR-LSA-SAM theoretical model of Figure 12, Tp comprises three parts:

Tfork represents program is copied to other time from node from host node;

Tmap represents the time performing LSA-SAM algorithm, i.e. the working time of LSA-SAM

Tresult represents that the result by REDUCE writes the time of hard disk.

Therefore, Tp=Tfork+Tmap+Tresult.We have:

S_{MR - LSA - SAM} = \frac{T_{s}}{T_{p}} \approx \frac{T_{s}}{{T_{fork} + T}_{Map} + T_{resulr}} - - - (4 - 15)

From MR-LSA-SAM model, we know that copying of program exports Exactly-once with result, and therefore, the consumption of this part time is less, negligible.Thus we have:

S_{MR - LSA - SAM} = \frac{T_{s}}{T_{p}} \approx \frac{T_{s}}{T_{map}} - - - (4 - 16)

Suppose in MR-LSA-SAM, have m PC as from node to perform LSA-SAM algorithm, thus have Ts=m*Tmap.So have:

S_{MR - LSA - SAM} = \frac{T_{s}}{T_{p}} \approx m - - - (4 - 17)

As can be seen from formula, the speed-up ratio of MR-LSA-SAM is m.Therefore, MR-LSA-SAM is applicable to solving the problem that problem can be divided into multiple separate subproblem and carry out solving.

Situation working time of LSA-SAM model and MR-LSA-SAM model is shown in Figure 14.Wherein horizontal ordinate represents the quantity of data set, and ordinate represents the working time of corresponding data set LSA-GCC model and MR-LSA-SAM model in varied situations.As can be seen from Figure 14, along with the continuous increase of data volume, the time of LSA-GCC model almost linearly increases fast, and the increase of the time of MR-LSA-SAM model is also close to linear increase, but its velocity ratio is slower.Experiment shows that MR-LSA-GCC model also achieves to a certain extent to the acceleration of LSA-GCC model up to 15.53 times.This demonstrate that correctness and the validity of MR-LSA-SAM model.

Distributed large data system Risk Identification Method based on LSA-GCC of the present invention, carries out risk feedforward identification by the daily record of acquisition operations system and Web service process, realizes the underlayer grain sizes risk identification of dummy node.The LSA-SAM (Latent Semantic Analysis-Secutity Assessment Model, latent semantic analysis Security Evaluation Model) of improvement is adopted to improve the accuracy rate detecting anomalous event further on this basis.Finally MapReduce is adopted to accelerate respectively for LSA-GCC and LSA-SAM.Experimental result shows that risk identification speed-up ratio can reach 12.56 and 15.3 respectively, has good risk identification effect, namely can be used as the reference frame of dummy node Prevention-Security, simultaneously also as the leading indicator parameter of risk profile.

Although specifically show in conjunction with preferred embodiment and describe the present invention; but those skilled in the art should be understood that; not departing from the spirit and scope of the present invention that appended claims limits; can make a variety of changes the present invention in the form and details, be protection scope of the present invention.

Claims

1., based on a distributed large data system Risk Identification Method of LSA-GCC, comprise the steps:

Step 1: set up LSA-GCC model, under this LSA-GCC model is used for that data set is mapped to a semantic space, and after adopting clustering algorithm to classify to it, the prototype vector of specific classification is extracted from cluster result, and give certain weights to each classification, set up initial prototype vector model; It comprises following performing step:

Step 11: use clustering algorithm to carry out cluster to large-scale text;

Step 12: after clustering algorithm cluster, obtains a classification collection, then builds LSA-GCC model with this classification collection;

Step 2: by LSA-SAM safe model of cognition, feedforward is carried out to risk and identify, the safe model of cognition of LSA-SAM carries out Risk of Information System evaluation and test based on LSA-GCC model, after data-mapping to be assessed to same semantic space, the prototype vector of classifying with each calculates the similarity belonging to this classification, make the cumulative sum of similarity and corresponding weights of classifying, finally average and obtain the value-at-risk of these data to be assessed, namely try to achieve the value-at-risk in these data arrival moment, specifically comprise the steps:

Step 22: computationally secure event risk degree S _i;

Step 23: calculation risk value S _risk, S _risk=sim (p, c _i ⁰) * S _i;

2. the distributed large data system Risk Identification Method based on LSA-GCC according to claim 1, is characterized in that: in above-mentioned steps 11, adopts Once-clustering algorithm CSPC to carry out cluster as clustering algorithm to large-scale text; Concrete, this Once-clustering algorithm CSPC is input as training set WS-DREAM,

Cluster threshold value r; Export as prototype vector collection GCCs; Its specific algorithm is described below::

Step (2): by m _cand m _gccbe initialized as sky;

Step (3): input a new text p;

Step (11): program returns m _gcc.

3. the distributed large data system Risk Identification Method based on LSA-GCC according to claim 2, is characterized in that: in prototype vector collection GCCs, and each classification builds prototype vector by following formula:

In formula, gcc _irepresent i-th classification C in LSA-GCC _i ⁰prototype vector, with represent classification respectively in document and quantity; These two parameters of α and β are that weight for weighing positive negative sample is to obtain best gcc _i.

4. the distributed large data system Risk Identification Method based on LSA-GCC according to claim 3, is characterized in that: α=5 β.

5. the distributed large data system Risk Identification Method based on LSA-GCC according to claim 2, is characterized in that: the sampling process of the cluster threshold value r of step (6) is described below:

Step 61: Stochastic choice N in text set ₀individual text pair;

Step 62: calculate the similarity sim that each text is right;

The value of step 64: threshold value r is ∈ * avgsim, wherein ∈ >=1;

Wherein, N ₀the quantity that text selected by expression is right, avgsim is N ₀to the mean value of the similarity of text, ∈ is a parameter, for the value according to different applicable cases adjustment threshold value r.

6. the distributed large data system Risk Identification Method based on LSA-GCC according to claim 5, is characterized in that: the span of ∈ is 5-13.

7. the distributed large data system Risk Identification Method based on LSA-GCC according to claim 2, is characterized in that: set up in the process of LSA-GCC model, adopts MapReduce framework to realize LSA-GCC algorithm:

Step 2: build MAP function: read the training set D be assigned with _i, use CSCP algorithm to D _icarry out clustering processing, obtain training set D _iin the classification collection m that comprises _c, then for classification collection m _cin classification use Rocchio algorithm to build corresponding prototype vector respectively, obtain D _ithe prototype vector collection m of middle all categories _gcc;

Step 3: build REDUCE function: collect the prototype vector collection m obtained from MAP _gcc, to prototype vector collection m _gccin prototype vector use CSPC algorithm to carry out merger to obtain final prototype vector collection m _final-gcc;

Step 4:MapReduce framework calls MAP function and REDUCE function realizes LSA-GCC algorithm.

8. the distributed large data system Risk Identification Method based on LSA-GCC according to claim 1, is characterized in that: the safe model of cognition of LSA-SAM is realized by MapReduce framework, and it comprises following process:

Step 2: build MAP function: read the test set T' be assigned to _i, use the safe model of cognition of LSA-SAM to obtain the safe class of test sample book.