CN104809114A

CN104809114A - Video big data oriented parallel data mining method

Info

Publication number: CN104809114A
Application number: CN201410035192.XA
Authority: CN
Inventors: 宫夏屹; 柴旭东; 王恒; 谢晓丹; 曲慧杨; 谷牧
Original assignee: Beijing Simulation Center
Current assignee: Beijing Simulation Center
Priority date: 2014-01-24
Filing date: 2014-01-24
Publication date: 2015-07-29

Abstract

The invention relates to a video big data oriented parallel data mining method. The video big data oriented parallel data mining method comprises the steps of 1 establishing a video big data mining system, 2 using a big data indexing and description module to establish video big data index, 3 using a feature extraction and video abstraction accelerating module to accelerate the key information extraction process of video big data and 4 adopting a data mining algorithm and strategy module to mine video key information data. By means of the video big data oriented parallel data mining method, video data mining process can be optimized, algorithm applicability is improved, and video big data mining can be quickly and efficiently performed.

Description

A kind of parallel data mining method towards the large data of video

Technical field

The present invention relates to a kind of data digging method, particularly a kind of parallel data mining method towards the large data of video.

Background technology

Large data refer to cannot within a certain period of time with the data acquisition that traditional database software instrument captures its content, manage and processes.Large data have 4 characteristic features: the large scale of construction, diversity, value density is low, speed is fast.The public safety video of magnanimity has the characteristic feature of large data as a kind of unstructured data, and is important directions that large data mining is studied for the data mining of the large data of video, is also technological difficulties.Domestic research work in large data is analyzed, can find that the research of large data is at present also more scattered, mostly based on Hadoop technology on large data processing platform (DPP) framework, large quantifier elimination concentrates in the mining analysis method of large data, does not also form the correlation technique system supporting the exploitation of large data processing platform (DPP).And the research and apply of data mining technology in public security work is still in the starting stage, many public business infosystems also rest on primary treatment level, lack comprehensive Application and Development, and intelligentized analysis is studied and judged, scientific warning.Set up not yet completely towards the standards system of public business simultaneously.

Due in actual public safety service application, large data digging system is usually directed to the video data of magnanimity, and the description of the large data of video and video index are difficult to carry out; Frequently-used data mining algorithm does not consider the multi-class of data, it is made to be difficult to be suitable in the excavation of unstructured data, it is large that simultaneously traditional P mining method runs expense, the problem that adaptability is very poor, this just needs a kind of method effectively can carrying out index construct and P mining to the large data of video, to ensure the efficient analysis process of the large data of video, thus support the service application of public safety field.

Summary of the invention

The diversity had for the video data of the applications such as public security, the requirement that value density is low, processing speed is fast, study the P mining technology of large data, from aspects such as large data description, feature extraction, data mining and intelligent association analyses, the solution of integration is proposed.Generally speaking, provide a kind of parallel data mining method towards the large data of video herein, solve incidence relation in the large data of video and excavate, the problem that efficient, intelligent analysis links.

Object of the present invention is achieved through the following technical solutions:

Towards a parallel data mining method for the large data of video, the method comprises:

1) the large data digging system of video is built;

2) large data directory and describing module build the large data directory of video;

3) feature extraction, video frequency abstract accelerating module carries out key message leaching process to the large data of video provides acceleration;

4) Parallel Algorithms for Data Mining and policy module are excavated Video Key information data.

The large data digging system of described video comprises:

Large data directory and describing module, for building the index of the large data of video;

Feature extraction, video frequency abstract accelerating module, for carrying out intellectual analysis to the large data of support video, and realize the extraction of Video Key feature and the acceleration of video frequency abstract process based on CUDA;

Parallel Algorithms for Data Mining and policy module, for classifying to video data, association analysis.

The index of the large data of described video comprises access level index, R tree index and the category index of supporting all kinds of video data.

Described Parallel Algorithms for Data Mining and policy module adopt the improvement Apriori algorithm based on MapReduce programming model to carry out data mining to the large data of video, and concrete steps are as follows:

401) transaction database is carried out horizontal division by MapReduce storehouse, is divided into the data subset that n scale is suitable, and n data subset is sent to the node that m performs Map task;

402) n data subset is formatd, produce <key1, value1> couple, specifically be formatted as <Tid, list>, here Tid represents the transaction identifiers in transaction database, and list is list value corresponding to the affairs in transaction database;

403) task of Map function scans each record <Tid, the list> of the data subset of input, and produce the set of a local candidate, be denoted as Cp, the support counting of each candidate is 1;

404) on the machine of every platform execution Map task, an optional Combiner function is increased, first Map function exports once to merge in this locality and exports <itemsets by Combiner function, sup>, sup represents the support counting of itemsets in data subset, then utilize partition functions hash (key) mod R the middle key-value pair that Combiner function produces to be divided into the individual different subregion of R, each subregion is assigned to the Reduce function of specifying;

405) node being assigned with Reduce task reads the data <itemsets of Combiner function submission, sup>, because many different candidate item rallies are mapped to identical Reduce function, therefore to key assignments itemsets sort make to have same candidate item collection data aggregate together, form <itemsets, list (sup) >;

406) the item Lp of the output of r Reduce function after is relatively gathered, just obtain the set of final frequent item set, be denoted as L.

The invention has the advantages that:

This method achieves the foundation of the unified index towards the large data of video, can support to retrieve accessing video data rapidly.By introducing CUDA framework, concurrent technique is adopted to accelerate the extraction process of video feature extraction, video frequency abstract further; By introducing the improvement Apriori algorithm based on MapReduce programming model, optimizing video data digging process, improve algorithm applicability, making can carry out quickly and efficiently the data mining of the large data of video.Be applicable to that system scale is large, the large and data mining of the large data of video stored for formula respectively of the video data volume, be applicable to public safety field.

Accompanying drawing explanation

Fig. 1: the inventive method process flow diagram.

Embodiment

A kind of parallel data mining method towards the large data of video of the present invention is described in detail below in conjunction with Fig. 1.The concrete steps of the method are as follows.

The first step: build the large data digging system of video

The large data digging system of video comprises: large data directory and describing module, feature extraction, video frequency abstract module and Parallel Algorithms for Data Mining and policy module.Large data directory and describing module build the index of the large data of video, comprise level index, R tree index and category index etc. to support the access of all kinds of video data; Feature extraction, video frequency abstract accelerating module carry out intellectual analysis to the large data of support video, realize the extraction of Video Key feature and the acceleration of video frequency abstract process based on CUDA; Parallel Algorithms for Data Mining and policy module are classified to video data, association analysis.

Second step: large data directory and describing module build the large data directory of video

Large data directory and describing module adopt and store index model, by setting up level index tree, R sets index and category index and jointly forms a unified interface, namely construct a unified access interface and user interactions, user is conducted interviews to large data by this interface.

The large data of video have multi-class feature, and for this feature, setting up with classification is the category index of content, by the comprehensive inquiry of category index to required thematic data.It is a kind of hierarchical data structure dynamic index algorithm that R sets index, adopt minimum boundary rectangle (Minimum Bounding Rectangle, MBR) complicated spatial object is similar to, without the need to predicting the index range of whole survey region, be applicable to regional space data, therefore spatial data can adopt R to set sets up index, provide simple and query interface fast.Set up the relation between two kinds of index content, because MBR and category index cannot direct opening relationships, consider that separately setting up the 3rd stores index model to set up both contacts, and be supplied to the interface accessing public safety data of user, this interface can conduct interviews to two kinds of data simultaneously.Storing content that index model comprises MBR and R, to set index corresponding, comprises corresponding content and category index simultaneously and set up and contact.

3rd step: feature extraction, video frequency abstract accelerating module carries out key message leaching process to the large data of video provides acceleration, for leaching process accelerates.

After second step sets up the large data directory of video, feature extraction, video summarization system can carry out information extraction to the large data of video.Feature extraction, video frequency abstract accelerating module, based on CUDA framework, utilize the acceleration of method realization to feature extraction, video frequency abstract process of parallel processing.CUDA provides a very powerful processing platform of GPU easily, can provide the speed-up ratio of several times and even hundreds of times in Video processing.Based on CUDA framework, the treatment progress of feature extraction, video frequency abstract is divided into host end and device holds two parts, Host end refers to the part performed on CPU, and device end is then the part performed on display chip, and it can walk abreast and carry out video data process.The program of Device end is also called " kernel ".Usual host program of holding by after DSR, can copy in the internal memory of video card, then performs device end program by display chip, result is fetched from the internal memory of video card after completing by host program of holding again.

Under CUDA framework, least unit when display chip performs is thread.Several thread can form a block.Thread in a block can access the internal memory that same is shared, and can carry out synchronous action fast.The thread number that each block can comprise is limited.But, perform the block of same program, can grid be formed.Thread in different block cannot access same shared internal memory, therefore cannot directly intercommunication or carry out synchronously.Therefore, the degree that the thread in different block can cooperate is lower.But, utilize this pattern, program can be allowed not worry the thread number restriction that in fact display chip can perform simultaneously.Such as, one has the display chip seldom measuring performance element, the thread order in each block may be performed, and non-concurrent performs.Different grid then can perform different programs (i.e. kernel).The relation of Grid, block and thread.

Each thread has the space of own share register and local memory.Each thread in same block then has shared a share memory.In addition, all thread(comprise the thread of different block) all share a global memory, constantmemory and texture memory.Different grid then has respective global memory, constantmemory and texture memory.So just greatly can promote the processing speed of feature extraction to video data and video frequency abstract.

4th step: Parallel Algorithms for Data Mining and policy module are excavated Video Key information data

Parallel Algorithms for Data Mining and policy module adopt the improvement Apriori algorithm based on MapReduce programming model to carry out data mining to the large data of video.Service logic complicated in multiple programming can be carried out abstract by MapReduce programming model, represents simply calculating as interface, and all hides the parallelization process of complexity, fault-tolerant, Data distribution8 and load balance.

The execution step of the improvement Apriori algorithm of MapReduce programming model is as follows:

Step one: horizontal division is carried out by being used for storing the large data transactions database studied herein in MapReduce storehouse, is divided into the data subset that n scale is suitable, is sent to n data subset the node that m performs Map task.

Step 2: n data subset is formatd, produce <key1, value1> couple, specifically be formatted as <Tid, list>, here Tid represents the transaction identifiers in transaction database, and list is list value corresponding to the affairs in transaction database.

Step 3: the task of Map function scans each record <Tid, the list> of the data subset of input, and produce the set of a local candidate, be denoted as Cp, the support counting of each candidate is 1.Map function generates and exports middle <key2, value2> couple, and be defined as <itemsets here, 1> couple, itemsets represent the candidate in Cp.Here is the false code section of map:

Step 4: increase an optional Combiner function on the machine of every platform execution Map task, first Map function exports once to merge in this locality and exports <itemsets by Combiner function, sup>, sup represents the support counting of itemsets in data subset, then utilize partition functions hash (key) mod R the middle key-value pair that Combiner function produces to be divided into the individual different subregion of R, each subregion is assigned to the Reduce function of specifying.

Step 5: the node being assigned with Reduce task reads the data <itemsets of Combiner function submission, sup>, because many different candidate item rallies are mapped to identical Reduce function, therefore to key assignments itemsets sort make to have same candidate item collection data aggregate together, form <itemsets, list (sup) >.Intermediate data after the sequence of working terminal traversal, by <itemsets, list (sup) > passes to Reduce function, then Reduce function adds up the support counting of same candidate item collection itemsest, just obtain the actual support counting of this candidate in whole transaction database, then compare with minimum support counting min_sup, determine the set of Local frequent itemset, be denoted as Lp.

Step 6: the item Lp of the output of r Reduce function after is relatively gathered, just obtains the set of final frequent item set, be denoted as L.

Algorithm performs end.

Should be appreciated that above is illustrative and not restrictive by preferred embodiment to the detailed description that technical scheme of the present invention is carried out.Those of ordinary skill in the art can modify to the technical scheme described in each embodiment on the basis of reading instructions of the present invention, or carries out equivalent replacement to wherein portion of techniques feature; And these amendments or replacement, do not make the essence of appropriate technical solution depart from the spirit and scope of various embodiments of the present invention technical scheme.

Claims

1., towards a parallel data mining method for the large data of video, it is characterized in that, the method comprises:

1) the large data digging system of video is built;

2. a kind of parallel data mining method towards the large data of video according to claim 1, it is characterized in that, the large data digging system of described video comprises:

3. a kind of parallel data mining method towards the large data of video according to claim 2, is characterized in that, the index of the large data of described video comprises access level index, R tree index and the category index of supporting all kinds of video data.

4. according to a kind of parallel data mining method towards the large data of video according to claim 1, it is characterized in that, described Parallel Algorithms for Data Mining and policy module adopt the improvement Apriori algorithm based on MapReduce programming model to carry out data mining to the large data of video, and concrete steps are as follows: