CN106445988A

CN106445988A - Intelligent big data processing method and system

Info

Publication number: CN106445988A
Application number: CN201610382955.7A
Authority: CN
Inventors: 程明强; 蒋朦; 曹国梁; 耿志贤
Original assignee: COEUSYS Inc
Current assignee: Silver Li'an financial information services (Beijing) Co., Ltd.
Priority date: 2016-06-01
Filing date: 2016-06-01
Publication date: 2017-02-22

Abstract

Embodiments of the invention provide an intelligent big data processing method and system. The system comprises a data structured module, a representative learning module and an application algorithm module, wherein the data structured module is used for pre-processing original big data and networking the pre-processed original big data to obtain a relationship network with nodes and edges; the representative learning module is used for obtaining high-dimensional vectors of the relationship network by adoption of an embedded mapping-based representative learning algorithm; and the application algorithm module is used for obtaining an application service request of a user, determining a processing algorithm corresponding to the application service request, and determining a result of the application service request by utilizing the processing algorithm corresponding to the application service request and the high-dimensional vectors, obtained by the representative learning module, of the nodes of the relationship network. The system provided by the embodiments of the invention can effectively extract the feature information in the big data and uniformly express the feature information in a form of high-dimensional vectors, is high in calculation efficiency, high in correctness and sensitive in response to user requests, and can provide a uniform effective processing method for a plurality of application services.

Description

A kind of intelligent processing method of big data and system

Technical field

The present embodiments relate to field of computer technology, more particularly, to a kind of intelligent processing method of big data and be System.

Background technology

Just there is huge change in insurance industry, the extensive application of big data changes insurance company in fact because of scientific and technological progress The mode now servicing.Existing insurance website and software generally have collected mass data, contain a large amount of useful informations, including The personal information of user, consumption habit etc..Only make full use of insurance big data, could be in Risk Pricing, product design, battalion All many-sided requirements adapting to the big data epoch such as pin strategy, customer service, risk management and control.

Currently in insurance industry, generally using Database Systems, insurance data is stored and managed.Data In the system of storehouse, can there is substantial amounts of relation data and text message in usual data storage by the way of form in form, storage The form of data can also be diversified.Such as, the personal brief introduction of user and the description information of product are generally in data Stored with the form of text-string in storehouse, and the age of user and product price are generally entered in the form of non-negative numeral Row storage.Although current data processing technique can be extracted to numerical value such as the numeral formatting and classifications and be mated, It is that useful feature information but cannot therefrom be extracted to unstructured datas such as texts.

The product that common insurance business includes insurance data is precisely recommended, is purchased dangerous user's classification and fraud insurance fraud inspection Survey etc..In insurance services marketing service, or being to allow user pass through search to obtain insurance products and then purchase, or using popular The methods such as degree recommendation, correlation rule recommendation and collaborative filtering recommending actively are recommended insurance products to user.Wherein, popularity pushes away Recommending is to show user to recommend currently most popular insurance products, and shortcoming is a lack of personalized consideration, and accuracy is low.Correlation rule pushes away Recommending is by data analysis, learns the rule that user buys between interest and unique characteristics and product feature, such as more than 40 years old Women be more easy to buy healthy class insurance it is recommended that accuracy also not high.Collaborative filtering recommending is based on a basic assumption, right Hereafter the user that similar insurance products had interest can buy similar insurance products, and the product bought by similar users is hereafter also Can be bought by similar user, this recommendation, when the behavior of sole user is little, has Sparse degree height it is impossible to be had Effect calculates and recommends.

When carrying out purchasing dangerous user's classification, because class of subscriber can describe the habits and customs of user, make friends and be accustomed to, consume Custom etc., different classifications needs to extract different user characteristicses.Generally using by the way of be to carry from the consumer record of user Take the features such as user's monthly income, moon cost, the standard deviation of returns in year, the cost standard deviation in year, a large amount of by mark Class of subscriber label, train supervised learning model, to test user classify.This method had both needed dependence experience to extract Big measure feature, with greater need for collecting substantial amounts of flag data, can cause the problems such as cost height, poor accuracy.

Fraud insurance fraud detection, that is, judge that the Shen of certain user protects whether behavior is fraud, and most crucial task is to collect to use Feature in declaring behavior for the family.Existing fraud insurance fraud detecting system is mainly protected from inclusion userspersonal information, institute Shen Insurance product information, Shen are protected in procedure information etc. and are extracted substantial amounts of numerical statistic result, a portion user are carried out simultaneously Mark, judges whether it is fraudulent user using manpower, then trains supervised learning model, protects behavior to Shen and classifies.However, This system needs dependence experience to extract feature and collect flag data, causes effectively to implement.

As can be seen here, the intelligent processing system of existing insurance big data at least has as a drawback that：1) existing insurance Data technique lacks the analysis to unstructured data, lost mass efficient information, the analysis result of impact insurance business； 2) existing insurance commending system, the dangerous user's categorizing system of purchase and fraud insurance fraud detecting system etc. are too dependent on the spy of manpower Levy extraction, accuracy is low, computational efficiency is poor, slow to user's request response, affect Consumer's Experience；3) different insurance services Generally adopt different data processings and feature extracting method, cause substantial amounts of redundant data to process, and the number of different service Feature according to unit is not compatible.

Content of the invention

The purpose of the embodiment of the present invention is to provide a kind of intelligent processing method of big data and system, can be from multiple big Efficiently extract characteristic information in data source, need not manually participate in, and computational efficiency is high, accuracy is high, user's request is rung Should be sensitive, unified processing method effectively can be provided for multiple application services.

The technical scheme that the embodiment of the present invention adopts is as follows：

A kind of intelligent processing system of big data of embodiment of the present invention system, this system include data structured module, Representative learning module and application algoritic module；

Wherein, described data structured mould, for pre-processing to original big data, and to described pretreated Original big data carries out networking, obtains comprising the relational network on node and side；

Described representative learning module is used for described relational network using the representative learning algorithm based on embedded mapping, obtains The high dimension vector of the node of described relational network；

Described application algoritic module is used for obtaining the application service request of user；Determine that described application service request is corresponding Processing Algorithm, and ask, using described application service, the described pass that corresponding Processing Algorithm and described representative learning module obtain It is the high dimension vector of the node of network, determine the result of described application service request.

Alternatively, comprise Multidimensional Relation network in described relational network, then described representative learning module is specifically for institute State Multidimensional Relation network and carry out embedded mapping, obtain the high dimension vector of the node of described Multidimensional Relation network.

Alternatively, comprise semantic network in described relational network, then described representative learning module is specifically for institute's predicate Adopted network carries out embedded mapping, obtains the high dimension vector of the node of described semantic network.

Alternatively, described data structured module is specifically for the behavior number in described pretreated original big data According to carrying out networking, obtain comprising the behavior network on node and side；

Networking is carried out to the attribute data in described pretreated original big data, obtains comprising the genus on node and side Property network；And,

Networking is carried out to the text data in described pretreated original big data, obtains comprising the language on node and side Adopted network；

Wherein, described behavior network, described net with attributes and described semantic network have collectively constituted described relational network.

The embodiment of the present invention additionally provides a kind of intelligent processing method of big data, and the method includes：

Original big data is pre-processed；

Networking is carried out to described pretreated original big data, obtains comprising the relational network on node and side；

To described relational network using the representative learning algorithm based on embedded mapping, obtain the node of described relational network High dimension vector；

Obtain the application service request of user；

Determine that corresponding Processing Algorithm is asked in described application service；

Ask the high dimension vector of the node of corresponding Processing Algorithm and described relational network using described application service, determine The result of described application service request.

Alternatively, comprise Multidimensional Relation network in described relational network, then described to described relational network using based on embedding Enter the representative learning algorithm of mapping, obtain the high dimension vector of the node of described relational network, including：To described Multidimensional Relation network Carry out embedded mapping, obtain the high dimension vector of the node of described Multidimensional Relation network.

Alternatively, comprise semantic network in described relational network, then described described relational network is reflected using based on embedded The representative learning algorithm penetrated, obtains the high dimension vector of the node of described relational network, including：Described semantic network is embedded Mapping, obtains the high dimension vector of the node of described semantic network.

Alternatively, described networking is carried out to described pretreated original big data, obtain comprising the pass on node and side It is network, including：Networking is carried out to the behavioral data in described pretreated original big data, obtains comprising node and side Behavior network；

Described behavior network, described net with attributes and described semantic network have collectively constituted described relational network.

The embodiment of the present invention additionally provides a kind of intelligent processing method of big data, including：

Obtain the application service request of user and the higher-dimension of the node of relational network being transformed by original big data Vector；

Alternatively, described by the relational network that original big data is transformed it is：By described original big data through pre- Carry out the relational network obtained by networking after reason.

The technical scheme of the embodiment of the present invention has advantages below：Described data structured module can be to original big data Pre-processed and networking is so that described original big data is converted into network data or structured data, thus described table Levy the representative learning algorithm that study module can utilize network data, to realize quick, the unified feature extraction to data, and It is indicated in the form of high dimension vector；Described application algoritic module can be asked according to the application service of user, determine and correspond to Processing Algorithm, and calculated using the feature representing in the form of vectors that described representative learning module is extracted, at determination Reason result.Different from prior art, in the embodiment of the present invention, the process of whole feature extraction, without the participation of people, utilizes based on embedding The representative learning algorithm entering mapping is automatically performed, and computational efficiency is high；Also greatly remain original big during feature extraction Structural information (i.e. effective information) in data, thus improve the accuracy task such as being classified or being predicted；Moreover, Due to employing the representative learning algorithm based on embedded mapping so that the data characteristics system excavating from original big data is permissible Unification is indicated by the form of high dimension vector, thus the system in the embodiment of the present invention is not limited only to specifically apply for certain Service, can provide unified processing method effectively for multiple application services.

Brief description

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing Have technology description in required use accompanying drawing be briefly described it should be apparent that, drawings in the following description are these Some bright embodiments, for those of ordinary skill in the art, on the premise of not paying creative work, can also root Obtain other accompanying drawings according to these accompanying drawings.

Fig. 1 is a kind of flow chart of the intelligent processing method of big data provided in an embodiment of the present invention；

Fig. 2 is a kind of structural representation of behavior network；

Fig. 3 is the flow chart of the intelligent processing method of another big data provided in an embodiment of the present invention；

Fig. 4 is a kind of structure composition schematic diagram of the intelligent processing system of big data provided in an embodiment of the present invention；

Fig. 5 is the structure composition schematic diagram of the intelligent processing system of another big data provided in an embodiment of the present invention.

Specific embodiment

Purpose, technical scheme and advantage for making the embodiment of the present invention are clearer, below in conjunction with the embodiment of the present invention In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described it is clear that described embodiment is The a part of embodiment of the present invention, rather than whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art The every other embodiment being obtained under the premise of not making creative work, broadly falls into the scope of protection of the invention.

In order to preferably explain the embodiment of the present invention, before the embodiment of the present invention is described, related notion is entered Row is explained.

Inseparable elementary cell, such as certain " client or user " when data cell refers to represent relation data, certain Individual " age bracket ", some " product ", a certain " product classification " etc..These elementary cells have entity in life.With What data cell was relative is non-data cell, refers to that customer relationship, client belong to certain group such as a series of together to the behavior of product, product Become the structure of these data cells.

Behavioral data refers to user to data produced by product generation behavior, and such as user buys, quits the subscription of or evaluate certain Insurance products and the data that produces.Behavioral data describes the relation between two or more data cells, generally describes Relation between " user " and " product ".

Attribute data refers to the relation between the data cells such as user, product and its attribute, the age of such as user, product Species etc..Attribute data describes the relation of data cell and its attribute, generally describes " user " and its attribute, or Relation between " product " and its attribute ".

Text data refers to the text containing vocabulary or phrase.Can be using vocabulary or phrase as data cell.

Structural data is the data referring to be represented with data or unified structure, such as numeral or symbol, deposits Storage can be realized come logical expression with bivariate table structure in lane database.

Unstructured data, for structural data, referring to cannot be with the number of digital or unified representation According to it has not been convenient to be showed with database two dimension logical table, such as text, image, sound, webpage, all kinds of form etc..

Multidimensional Relation refers to that this relation involves multiple data cells (or referring to the multiple nodes in network), is many numbers Interaction according to unit.Two-dimentional relation is the interaction of only two data cells.Buying behavior is high in the case that information is plentiful The behavior of dimension relation, generally potentially includes user, product, buys place and buying pattern etc., but if information is not exclusively, It is possible to the behavior of simply two-dimentional relation, such as contain only user and product.Legacy data processing system is only it is contemplated that two The behavior of dimension relation, but the behavior of Multidimensional Relation cannot be processed.And the data of the Multidimensional Relation that the behavior of Multidimensional Relation produces exists It is generally existing in current every field.

Additionally, the development with network technology is so that the quantity of unstructured data increases increasingly.At this moment, only can be right The limitation of the data handling system that structural data is managed and analyzes is exposed to more and more obvious.Moreover, very In conglomerate, it is not limited only in insurance, the feature extraction to big data remains a need for using expert it is impossible to only lean on computer complete Become.The system that big data is processed also generally existing accuracy is low, computational efficiency is poor, and slow to user's request response etc. one Series of problems.

In order to solve the above problems, embodiments provide a kind of intelligent processing method of big data, as Fig. 1 institute Show, methods described includes：

S101：Original big data is pre-processed.

Original big data can be collected and come by each website or application program (Application, APP), Thus potentially include the structured datas such as behavioral data, attribute data it is also possible to including text data etc. unstructured data, and The form of data is also likely to be diversified.Therefore, before data being extracted with feature or utilizes serving data, can First to pre-process to original big data.The method of data prediction includes data scrubbing, data integration, data conversion, number According to analysis data reduction etc..

Alternatively, in embodiments of the present invention, carrying out pretreatment to original big data can be, to described original big data Carry out data analysis and cleaning, original big data is carried out with statistical analysis, remove the data content not conforming to rule or mistake, permissible It is that invalid data form is filtered, such as removal ought to be floating number, but be filled to the numerical value such as the price of character string type, also Can be are carried out time or unit unified or to disappearance fingering row fill in, smooth noise data etc., thus can By the standardized format of big data, to remove abnormal data, correct mistake or remove repeated data etc..

S102：Networking is carried out to described pretreated original big data, obtains comprising the relational network on node and side.

Node in described relational network, is by the data cell conversion in described pretreated original big data Come, the side in described relational network, for representing the relation between described nodes and node.

Big data is typically stored in table form, but this traditional data storage method is it is impossible to logarithm Uniformly store on a large scale according to carrying out and manage, and contained semantic information (this language in substantial amounts of text data can be lost Adopted information is useful information, and for providing a user with, accurate application service is most important) it is most important that, the table of fragmentation Lattice storage mode is it is impossible to conveniently and efficiently being conducted interviews by subsequent applications service and realizing frequency height, response using it is impossible to meet The demand of fireballing application service.

In the embodiment of the present invention, by networking is carried out to original big data, can be by the big data in form or magnanimity Data is converted into relational network, efficiently solves the problems referred to above.First, by after pretreated original big data networking, These data can be uniformly processed in the way of using node and side, data storage and the cost of management are greatly reduced.Secondly, pin To text datas such as the vocabulary in pretreated original big data and phrases, carried out networking, constructed semantic network, Remain the semantic information in text, so that the accuracy of application service subsequently with effectively utilizes, can be improved.Additionally, will locate in advance It is possible to utilize the representative learning of network data after original big data after reason is expressed as comprising node and the relational network on side Algorithm, to realize quick, the unified feature extraction to data, thus accomplishing quick response different application service request.

Alternatively, described pretreated original big data can include behavioral data, attribute data and text data, then Carry out networking to described pretreated original big data can include：To the row in described pretreated original big data Carry out networking for data, for example, be converted into behavior network by the behavioral data such as buying, evaluating；Or, can also include to institute The attribute data stated in pretreated original big data carries out networking, for example, be converted into the attribute informations such as age, price Net with attributes；Or can also include carrying out networking to the text data in described pretreated original big data, For example the text datas such as product introduction or evaluation content are converted into the semantic network with word and phrase as node.Then described row Collectively constitute described relational network for network, described net with attributes and described semantic network.

S103：To described relational network using the representative learning algorithm based on embedded mapping, obtain described relational network The high dimension vector of node.

Representative learning is one of the studying a question of core during machine learning data is excavated.In embodiments of the present invention, lead to Cross to described relational network using the representative learning algorithm based on embedded mapping, the node in described relational network is for example used Family, product and phrase etc., unification is indicated with the higher vector of dimension, and remains the structure letter in original big data Breath.Wherein, each vector can represent one of described relational network node, and one of this vector dimension illustrates this section One feature of point.Relation (side in other words) between described relational network interior joint and node, be converted into the higher-dimension of node to , if there is relation between node 1 and node 2 (i.e. in described relational network in the similarity between amount and the high dimension vector of node In connected by side), then similarity between the high dimension vector of the high dimension vector of node 1 and node 2 is high, conversely, then similarity Low.

By way of above-mentioned representative learning, it is to avoid depend on the manual features extraction side of expertise in prior art Formula it is achieved that the feature of data rule is met for obtained from drive with big data, and after feature represents in vector form, So that subsequently may be directly applied to multiple-task, including classification, cluster, prediction etc..

Further, using the representative learning algorithm based on embedded mapping, described relational network can be retained as much as possible In structural information, and different structural informations can be retained for different networks.For example, for " user-product " Behavior network, can retain buying behavior information so that the user that in vector, similar features represent has similar buying habit, The product that similar features represent has similar purchase crowd, and 50 dimensional vectors in high dimension vector such as can be selected to preserve So that there are two nodes (user and the products of " buying behavior relation " this structure in " buying behavior relation " this structural information Product) vector similarity between corresponding high dimension vector is high, and 50 other dimensional vectors in high dimension vector can also be selected to protect Deposit " similar purchase intention " this object information so that there is this structure of two nodes of " similar purchase intention " this structure Two nodes (user and user) corresponding high dimension vector between vector similarity high.It follows that this will carry significantly Rise the accuracy of the tasks such as the corresponding classification of later stage application service and prediction, solve and in prior art, cannot effectively extract data In structural information, lost the problem of mass efficient information.

Additionally, common learning method is to be represented using the higher-dimension that matrix or tensor resolution obtain node, but this kind of side Method often faces the problem of complexity too high (cube rank) it is impossible to be widely used in the industrialization scene of mass data, and And computational efficiency is not also high.And in embodiments of the present invention, using school's method of embedded mapping, the process employs negative sampling Technology (Negative Sampling), samples study for mass data, thus ensure that study engineering with carrying out rational proportion Preferable learning outcome can be reached with the less time.And by described relational network high dimension vector is indicated it Afterwards, the time of study not only can be shortened, computational efficiency, the request of quick response user can also be greatly improved.

The realization of representative learning algorithm, except mapping using based on embedded, also other modes, such as singular value decomposition, Non-negative Matrix Factorization etc., but this these method is only limitted to two-dimentional relation network, and calculating speed is also very slow.The present invention It is contemplated that in the current either application scenarios of insurance industry, financial industry, shopping and electric business etc., collecting in embodiment Big data increasingly tend to variation, using the relational network obtaining after the technical finesse of the embodiment of the present invention, often not only It is limited to two-dimentional relation network, be Multidimensional Relation network in most cases.The scale of data is often also suitable big, therefore From the representative learning algorithm based on embedded mapping, can be applied not only to two-dimentional relation network and multi-dimensional relation network, and The acceleration of calculating speed can be realized, greatly shorten calculating time, quick response application demand.

Specifically, can realize protecting knot using " state is penetrated " in category theory using the representative learning algorithm of " embedded mapping " The dimensionality reduction " embedding " of structure mapping is realizing representative learning.It is directed to the data in described relational network, by retaining described relation The learning algorithm of the structural information in network, the high dimension vector obtaining node represents.

S104：Obtain the application service request of user.

User is browsing webpage, using certain APP, or situations such as click on certain function button of certain operation interface Under, it is likely to trigger application service request, therefore can obtain this application service request, to determine the phase that subsequently should adopt Close algorithm.

S105：Determine that corresponding Processing Algorithm is asked in described application service.

S106：Asked using described application service corresponding Processing Algorithm and described relational network node higher-dimension to Amount, determines the result of described application service request.

The service definition of application layer can be sequence, classification, cluster, prediction, the task such as association analysis and abnormality detection, These tasks can be completed with specific Processing Algorithm, according to obtained high dimension vector after representative learning, using above-mentioned task Corresponding Processing Algorithm (i.e. corresponding Processing Algorithm is asked in application service) it is possible to obtain accurately and efficiently solution, and Return to user.

Specifically, can preassign or obtain the corresponding relation between described application service request and Processing Algorithm, For example when application service request is Products Show, it is known that recommended products actual be exactly to be predicted, predict the use that obtains A series of products that family most probable is bought, Processing Algorithm calculates the high dimension vector of user node and the high dimension vector of product node Similarity degree, then if preassigning or obtaining this application service request and the corresponding relation of this Processing Algorithm, then It is possible to determining that corresponding Processing Algorithm is asked in described application service is to calculate user node after receiving this application service request The high dimension vector of high dimension vector and product node similarity degree.Finally, using high dimension vector and the product section of user node The high dimension vector of point, carries out Similarity Measure it is possible to obtain and some row products of user's similarity highest, that is, obtain described The result of application service request.

In embodiments of the present invention, networking is carried out by pretreated original big data, obtain comprising node and side Relational network, and to described relational network using representative learning algorithm based on embedded mapping, obtain described relational network The high dimension vector of node, that is, achieve the feature extraction to original big data, and whole process need not rely on the experience of expert, Without the participation of people, it is automatically performed using the representative learning algorithm based on embedded mapping, computational efficiency is high.Different from existing skill Art, also greatly remains effective information in embodiments of the present invention during feature extraction, thus improve follow-up dividing The accuracy of the task such as class or prediction.Further, in embodiments of the present invention, because the feature unification of data is by high dimension vector Form be indicated, so that can ask according to application service, determining Processing Algorithm, thus carrying out using with high dimension vector The feature representing, to determine the result of described application service request, the Intelligent treatment side of the big data described in the embodiment of the present invention Method, is not limited only to certain specific application service, can provide unified processing method effectively for multiple application services.

It should be noted that the intelligent processing method of the big data described in the embodiment of the present invention, can be applied not only to Insurance field, can also be applied to other field, for example, be applied to financial field, purchase and consumption field etc., be particularly suited for To the situation comprising structured data and non-structural data is processed, and need to process the occasion of the data of Multidimensional Relation, relatively Prior art will have obvious advantage.

It should be noted that in S106, using the high dimension vector of the node of described relational network, determining described application service During the result of request, it is possible to use the high dimension vector of all nodes of described relational network goes to determine described application service request Result；The high dimension vector of the part of described relational network can also only be utilized, go to determine the result of described application service request.Tool Body ground, can only go to determine the result of described application service request using the node related to described application service request.For example, When application service request is Products Show, can only be entered using the high dimension vector of product node and the high dimension vector of user node Row calculates.

Alternatively, in step s 102, specifically how this carry out networking, carry out network to text data to behavioral data Change or networking is carried out to attribute data, be referred in the following manner.

1st, networking is carried out to the behavioral data in described pretreated original big data

Specifically, behavioral data describes the relation between two or more data cells, carries out network to behavioral data Change and refer to this relation is expressed as the side of network, data cell is expressed as the node of network.This network can be two-dimentional relation net Network, can also be Multidimensional Relation network, correspondingly, behavioral data is carried out relation can be expressed as during networking the side of two dimension Or the side of higher-dimension.Will buy, quit the subscription of or evaluate etc. behavior representation be network side.Wherein, two dimension when referring on contain Have two nodes, higher-dimension when referring on contain multiple nodes.

For example：The behavioral data of simple user can be expressed as the two-dimentional relation form of " user-product ".This Outward, user behavior may also have abundant contextual information, can will form n-tuple relation figure after contextual information node, Three-dimensional relationship figure as " user-product-evaluation ".Taking the behavior that Mr. Zhang is bought to insurance products A as a example, Mr. Zhang purchases Buy this and give being evaluated as of this insurance products A：Although price is expensive, but be worth.Behavioral data is carried out to above-mentioned data Networking can obtain behavior network as shown in Figure 2.In fig. 2, " Mr. Zhang " and " insurance products A " is expressed as the behavior The node of network, buying behavior constitutes the side between above-mentioned two node.Additionally, the phrase evaluated or word " expensive " And " worth ", it is expressed as the node of network, this part in fact belongs to and carries out networking to text data, by retouching below It is explained in detail in stating.Thus form the behavior network of " user-product-evaluation ", namely three-dimensional relationship network.

2nd, networking is carried out to the text data in described pretreated original big data

Text data is carried out with networking is exactly the node that the data cell of vocabulary or phrase composition is expressed as network, from And text is built into the relational network with vocabulary or phrase as node.Between the node being formed with vocabulary or phrase in network Side, describes them and occurs in the frequency in sentence or document.For example, if " expensive " and " worth " this two phrases are common Go out among 3 sentences, then " expensive " and " worth " can be able to deposit between them as the two of relational network node It is attached on side, the weight on side could be arranged to 3；If " expensive " and " very cheap " never goes out in sentence jointly in network Existing, then there is not side between this two nodes and be attached.In addition, these are with the node of vocabulary or phrase composition and other nodes The side of formation such as (as user, products), belongs to behavioral data, describes the relation between two or more data cells.

So that above-mentioned Mr. Zhang is bought and evaluates to insurance products A as a example, the text datas such as evaluation content can be entered Row structuring, that is, carry out participle, phrase extraction, classification mark, sentiment analysis etc., thus natural language is stated as and can be located The data structure of reason.Specifically, according to " although price is expensive, but be worth ", " expensive " and " worth " can be known It is core vocabulary, and " expensive " describes the feature in " price " aspect for the product, " worth " reflects the positive purchase of user Buy phychology and emotion.Thus when networking is carried out to this article notebook data, " expensive " and " worth " is expressed as the section of network Point, this two nodes and other nodes, the such as side of user and product formation, belong to behavioral data.

It follows that networking is carried out to text data, not only achieve the analysis to unstructured data, and permissible Vocabulary or phrase etc. and behavioral data are associated, remain certain useful information.

3rd, networking is carried out to the attribute data in described pretreated original big data

Attribute data describes the relation of data cell and its attribute, carries out networking to attribute data and refers to this relation It is expressed as the side of network, data cell is expressed as the node of network.Attribute data both can be classification information, such as health insurance Or travel accident insurance, can also be the numerical informations such as age or price.Thus networking is carried out to attribute data, can be by class Other information is expressed as the node of network, and the numerical information in the attribute informations such as age, price is carried out, behind by stages, carrying out node Change and represent.

For example, the age is the Mr. Zhang of 25 years old, have purchased the insurance products that price is 2000.In this example, permissible Certain age range comprising 25 years old is expressed as node, such as can by the age 24-30 year between youth be expressed as node " between twenty and fifty "；Certain price range comprising the numerical value that price is 2000 can be expressed as node, such as by price in 1000- It is expressed as node " entry level insurance products " between 5000.After above-mentioned process, be eventually converted into " user-age level " and The net with attributes of " product-price range ".

Alternatively, after networking being carried out to described pretreated original big data, can be to described relational network The regular Mass storage of row format and management are entered, to facilitate follow-up feature extraction and use in node and side.Therefore, in S102 Afterwards, can also include：

S102’：The node of described relational network and side are saved in database.

For example, two kinds of forms can be stored in described database in order to preserve node and the side of described relational network respectively, In the form of preservation nodal information, often row is ID, title and inquiry frequency of node etc..Preserve often going in the form of side information It is the ID on side, the ID of interdependent node and generation time etc..After networking is carried out to described pretreated original big data, real On border, the data of all of networking before processing is all changed into structural data.In actual applications, for structuring number According to being managed (Structured Data Management), there is several data management technique, such as distributed storage, Cloud database, NOSQL database (non-relational database) and move database etc..For example BaseX, MongoDB and No2DB are Java, C++ and C# language is relied on to be developed into popular three kind NO-SQL database respectively；MySQL and HBase is frequently-used data Library software；Cyberrelationship storage in AllegroGraph, DEX, Neo4j and FlockDB be rely on SPARQL, Java and The graphic data base of Scala.

Alternatively, when realizing step S103, because described relational network both may expand semantic network it is also possible to include by bag Net with attributes and behavior network.They both may belong to isomorphic relations network, be likely to belong to two-dimentional relation network it is also possible to belong to In Multidimensional Relation network.Therefore, to described relational network using the representative learning algorithm based on embedded mapping, obtain described relation The high dimension vector of the node of network can include：Multidimensional Relation network in described relational network is carried out with embedded mapping, obtains The high dimension vector of the node of described Multidimensional Relation network；Or, the two-dimentional relation network in described relational network is embedded Mapping, obtains the high dimension vector of the node of described two-dimentional relation network；Or, the semantic network in described relational network is carried out Embedded mapping, obtains the high dimension vector of the node of described semantic network；Or the homogeneous network in described relational network is carried out The embedded mapping of row, obtains the high dimension vector of the node of described homogeneous network.

First, described semantic network is carried out with embedded mapping (Text Embedding)

Using the method for embedded mapping, the node of the word in semantic network and phrase form is expressed as high dimension vector, and And after embedded mapping so that represent in node the node of close word or phrase high dimension vector similarity very high, that is, Close word is made to have similar semanteme to phrase.

Specifically, mapping method can be embedded by the word based on Skip-gram model, by learning the vector representation of word, come Reach the purpose that accurate prediction closes on word.Most effectively learning objective (i.e. maximized object function) is：It is hidden in sentence In after certain word, by other words closing in given sentence, the vector of the optimal word being hidden can be obtained.? Under natural voice, can be filled into, between the hiding word of word place vacancy, there is similar semantic, then embedded So that the similarity of their vector is very high during mapping.

In brief, the object function that the embedded mapping of semantic network maximizes conditional probability is given neighbor node (phase The node connecting) vector, the vector of prediction destination node is so that have between the node that is connected with some given nodes Similar vector.Can also be expanded further, incorporate the multiple elements such as word, phrase and phrase categories, realize semantic level Representative learning.

Select scale c of the contextual information of text of training, namely window size, by current word w_tAs input, will The identical element closed on as the maximized object function of the training pattern of output layer is：

Wherein, w_iRefer to i-th word in text.

By this object function of this maximization, study obtains the vector representation w of each word_(i)So that given vector w_(t)During with position t, learn this object function and can be obtained by the vector meeting of position (t+j) and the word of this position in actual document Vector similarity very high (probability is maximized) so that close word has similar semanteme to phrase, allow the language of word Justice can be retained.

For example, " today ", " noon ", " eating " these words of closing on occur in semantic network, may be from original Text message " this noon has eaten rice " in big data and " this noon has eaten plain rice ".Side using the embodiment of the present invention Method, now " plain rice ", the vector of " rice " are exactly w_(t), " today ", " noon ", the vector of " eating " are exactly w_(t+j), that is, w_(t-3),w_(t-2),w_(t-1), by the representative learning algorithm based on embedded mapping, obtain " plain rice " and " rice " corresponding vector Similarity is very high, and that is, " plain rice " and " rice " this two languages or phrase have similar semantic.And conventionally, as " plain rice " and " rice " is two different terms then it is assumed that " plain rice " and " rice " is different it is impossible to retain semantic information.

2nd, described two-dimentional relation network is carried out with embedded mapping (Bipartite Network Embedding)

Two-dimentional relation network refers to that the node that in network, every a line all corresponds in two nodes, and network only has two Class, such as " user-product " are exactly a kind of two-dimentional relation network.

Described two-dimentional relation network is carried out with embedded mapping and refers to, using the embedded method mapping, will there is two-dimentional relation Node in the behavior network and net with attributes of (as user-product, user-age, product-price etc.) (as user, product, The nodes such as age level, price layer) it is expressed as high dimension vector.

As the embedded mapping of semantic network, the embedded mapping of two-dimentional relation network, maximize the target of conditional probability Function is the vector of given neighbor node (node being connected), the vector of prediction destination node so that with some given sections Point v_jThe node v being connected_iBetween there is similar vector.

Assume to contain A class node and B class node in two-dimentional relation network.Then pass through this object function of this maximization, permissible In given B class node v_jWhen, draw and v_jThe vector of the node being connected, can be with A class node v_iVector similar, i.e. condition Maximization.

Can define by the v in B class node_jThe v of A class node can be produced_iRepresent conditional probability be：

Wherein u_iIt is v_iHigh dimension vector, u_jIt is v_jHigh dimension vector.

It is assumed that A class node represents user, B class node represents product taking the two-dimentional relation network that " user-product " forms as a example Product, then by the way, can predict which user may buy in the case of giving certain product, or Say that it is how many for can be calculated user buying the probability of this product.

For example, after carrying out networking to data, there is two-dimentional relation network is：User's A- products C, user A- produces Product D, user's B- products C.So object function is：During given " products C " node, by change (study) " user A " node with " user B " corresponding vector of node, the vector of all nodes that transference " products C " node is connected both with " user A " node Vectorial similar and similar to the vector of " user B " node, the then vectorial phase of vector sum " user B " node of " user A " node Seemingly.By the way, successfully save the structural information in network, greatly improve the accurate of the corresponding problem of follow-up solution Property.

3rd, described Multidimensional Relation network is carried out with embedded mapping (Tensor Network Embedding)

Multidimensional Relation network refers to have side to be corresponding three nodes in network, such as " the user-product-comment shown in Fig. 2 Valency " network belongs to Multidimensional Relation network.Multidimensional Relation (High-order Relation) is also common in data, such as comments Valency behavior is related to user, product simultaneously and evaluates text, so that non-matrix, ternary relation rather than simple two with tensor Portion's figure is representing such behavioral data.

Described Multidimensional Relation network is carried out with embedded mapping and refers to, using the embedded method mapping, will there is Multidimensional Relation Node in the behavior network and net with attributes of (as user-product-evaluation) is expressed as high dimension vector.

As the embedded mapping of semantic network, the embedded mapping of Multidimensional Relation network, maximize the target of conditional probability Function is the vector of given neighbor node (node being connected), the vector of prediction destination node so that with some given sections Between the node that point is connected, there is similar vector.

Realize the embedded mapping of Multidimensional Relation network, need to update object function, can have two kinds of processing methods.A kind of It is n-tuple relation of every sampling, update the vector representation of associated nodes, then maximized object function is as follows：

Wherein, S is the set of node, A_(j)Refer to the Multidimensional Relation set being associated with j node, r_(m/j)Refer to therein One Multidimensional Relation, m is the numbering of this Multidimensional Relation, λ_m,/jIt is the weight of this Multidimensional Relation, P₁It is to give this Multidimensional Relation when institute The probability of associated nodes, L₁For each node j, maximize Multidimensional Relation interior joint associated by it between any two vector Similarity.

When another kind is sampling n-tuple relation, split into several binary crelations, and update the vector representation of associated nodes, Maximize object function as follows：

Wherein,It is the set that Multidimensional Relation splits into all two-dimentional relations after multiple two-dimentional relations, r_mIt is m-th two dimension Relation, λ_mIt is the weight of m-th two-dimentional relation, P2 is the probability of associated node when giving this Multidimensional Relation, L₂It is for each Two-dimentional relation after individual fractionation, maximizes vector similarity between two nodes of this relation.

As an example it is assumed that data is carried out with the Multidimensional Relation network after networking being：User A- products C-purchase ground Point E, user A- products C-purchase place F, user B- products C-purchase place E.

After so object function is exactly given " products C " node and " buying ground E " node, it is associated (passing through Side be connected) node vector similar, thus allow " user A " node vector sum " user B " node vector similar.Certainly, We may travel through each given information, after such as given " user A " node and " products C " node, allows " buying place E " to save Point is similar with " buying place F " corresponding vector of node.

If adopting maximum target function L₁, that is, give certain relation other nodes (such as products C and purchase E), allow A node being hidden is learnt (as user node).

If adopting maximum target function L₂, Multidimensional Relation is split into A-C, C-E, A-E, A-C, A-F, C-F etc. 9 Two-dimentional relation, then calls the embedded Mapping implementation of two-dimentional relation.

Semantic network is carried out with embedded mapping, two-dimensional network is carried out with embedded mapping and higher-dimension network carried out by above-mentioned Embedded mapping understands, by described relational network is adopted with the representative learning algorithm based on embedded mapping, can be by relational network Node unification be indicated with the higher vector of dimension, each dimension of vector represents the feature of this node, realizes The feature extraction of original big data.And due in high dimension vector and remain structural information in original big data, such as semantic Information, buying behavior information etc., greatly promote the accuracy of the tasks such as the corresponding classification of later stage application service and prediction.And this The representative learning algorithm based on embedded mapping in inventive embodiments, can also be applied to the data of Multidimensional Relation it is adaptable to each Plant complicated applied environment, and calculating speed is quickly, can be with quick response application demand.

Alternatively, when realizing step S105-S106, can by application service request be converted into sequence, classification, cluster, The task such as prediction, association analysis and abnormality detection, these tasks can be completed with specific Processing Algorithm, can preassign or The corresponding relation that person obtains between these task and Processing Algorithm is (i.e. corresponding between described application service request and Processing Algorithm Relation), thus when getting application service request, it is known that adopted which kind of Processing Algorithm.In order to be better understood from this How bright embodiment, it is thus understood that these tasks are corresponding with which kind of Processing Algorithm, is completed with Processing Algorithm, the embodiment of the present invention will Related content is done with detailed introduction.

1st, sort (Ranking) task

Sorting task is often based upon certain specific similarity and realizes, and generally involves the phase of the node of described relational network Calculate like degree, including Pearson's degree of association (Pearson Correlation) and cosine similarity (Cosine Similarity) Deng.

For example, when application service request needs the problem solving it is, certain product given, list therewith purchased During the most like product of aspect, this problem can be converted into Sorting task.

Processing Algorithm：We can find the height of this product node by executing in the high dimension vector that S101-S103 obtains Dimensional vector u_i, then problem be converted into and obtain and u_iA series of similarity highest product node.Because each product node has A high dimension vector is had to represent, usually K dimension (K is usually the numeral between 200 to 500), thus can be by seeking vector Scalar product is obtaining the similarity between node.This problem final is converted into be asked and vectorial u_iOn scalar product, maximum is a series of Vector.It is achieved that Sorting task has obtained the result of application service request in other words by above-mentioned algorithm.

2nd, classification (Classification) task

Classification task includes two classification and many classification, and SVM (Support Vector Machine) and logic are returned Supervised learning algorithms such as (Logistic Regression) is returned to can effectively solve the problem that classification task；

For example, application service request needs the problem solving may be given a large number of users, interval according to age level, income Determine class of subscriber etc. information.But in practical application, often existence information disappearance in data, how by unknown age, receipts The user information such as entering is categorized into correct age level and income is interval, is an important problem.This problem can be converted For classification task.

Processing Algorithm：The high dimension vector of the nodes such as user, age level, income interval can be obtained by representative learning, that Only need to calculate the similarity with the high dimension vector of age node layer for the high dimension vector of user node, and calculate user node High dimension vector and the similarity of the high dimension vector of the interval node of income, choose the high dimension vector similarity highest with user node Age node layer and the interval node of income.Just this user can be categorized into correct age level and income is interval.

3rd, cluster (Clustering) task

Cluster task is often completed with unsupervised-learning algorithm such as arest neighbors, spectral clusterings.

For example, the problem of application service request needs solution may be：Given a large number of users, in the situation of unknown classification Under, user is polymerized to K class according to buying behavior custom, same strategy can be formulated to same class user.Can will be somebody's turn to do Problem is converted into cluster task.

Processing Algorithm：Can be represented according to the high dimensional feature of user, the algorithm using K-means or KNN is quickly realized Cluster.The difficult point of generally clustering problem is how to reduce the dimension of structured message, and this dimension is up to the quantity of user, that is, save Quantity N of point, but dimension is successfully reduced to K by embedded mapping.

4th, predict (Prediction) task

Prediction task typically utilizes matrix decomposition (Matrix Factorization) or tensor resolution (Tensor Factorization), realize the filling to matrix and higher-dimension tensor, thus the missing values (Missing in prediction data Value).

For example, the problem of application service request needs solution may be：Predict whether certain user can buy certain in the future Product.It is true that recommendation problem can be converted into forecasting problem, that is, provide a series of of user's most probable purchase that prediction obtains Product.

Processing Algorithm：We can obtain given user node high dimension vector by the method described in the embodiment of the present invention With the high dimension vector of product node, user node high dimension vector is given by calculating similar to the high dimension vector of product node Degree, can by with user node similarity highest Products Show to this given user.

5th, association analysis (Correlation Analysis) task

Application service request need solve problem may be：Judge age level, the interval valency with product of income of user Whether lattice interval is relevant.

Processing Algorithm：By the method described in the embodiment of the present invention, can obtain age node layer, the interval node of income and The high dimension vector of price range node, thus by the quick similarity calculating between them it is possible to understand different user Incidence relation between attribute (age level of user and income) and product attribute (price range of product) and the intensity associating.

6th, abnormality detection (Outlier Detection) task

Application service request need solve problem may be：Judge that whether certain user is different in its customer group of being located Conventional family, such as fraudulent user etc..

Processing Algorithm：By the method described in the embodiment of the present invention, the high dimension vector of all user nodes can be obtained, lead to Cross the similarity between the high dimension vector calculating active user's node and the high dimension vector of other users node, if similarity is very Greatly it is believed that active user is abnormal user.

Alternatively, after execution step S101-S103, that is, complete and data mining is carried out to original big data, obtain After the data characteristics that unified high dimension vector represents, if original big data has renewal, can be only to the data execution updating Step S101-S103 it is not necessary to execute a S101-S103 more again to all data.

It is alternatively possible to be in the case that data has renewal, just right to realize to new data execution step S101-S103 The data mining of new data or the just execution when new data accumulated is to some, or can be periodically to new data Execution step S101-S103.

The embodiment of the present invention additionally provides a kind of intelligent processing method of big data, as shown in Fig. 2 the method includes：

S301：The application service request that obtains user and the node of relational network that is transformed by original big data High dimension vector.

In embodiments of the present invention, the feature that can be represented with direct access high dimension vector, from without using original Big data carries out feature mining.Can be completed on other devices using the process that original big data carries out feature mining, This is not restricted for the embodiment of the present invention.The process of the feature mining being excavated using original big data may be referred to S101- S103, the embodiment of the present invention will not be described here.

S302：Determine that corresponding Processing Algorithm is asked in described application service.

S303：Asked using described application service corresponding Processing Algorithm and described relational network node higher-dimension to Amount, determines the result of described application service request.

The specific implementation of S302 and S303 may be referred to S105-S106.

In embodiments of the present invention, the feature that direct access high dimension vector represents, it is right to be asked using described application service The Processing Algorithm answered and the high dimension vector of the node of described relational network, determine the result of described application service request.The present invention The intelligent processing method of the big data described in embodiment, is not limited only to certain specific application service, can be multiple application clothes Business provides unified processing method effectively.

Corresponding to the embodiment of the method described in Fig. 1, present invention also offers a kind of intelligent processing system of big data, such as scheme Shown in 4, including data structured module 401, representative learning module 402 and application algoritic module 403.

Described data structured module 401, for pre-processing to original big data, and to described pretreated Original big data carries out networking, obtains comprising the relational network on node and side.Wherein, the node in described relational network, by Data cell in described pretreated original big data is transformed, the side in described relational network, described for representing Relation between nodes and node.By networking is carried out to original big data, can by the big data in form or Mass data is converted into relational network, such that it is able to these data be uniformly processed by the way of node and side, is greatly reduced Data storage and the cost of management.Secondly, for text datas such as the vocabulary in pretreated original big data and phrases, will It carries out networking, constructs semantic network, remains the semantic information in text, subsequently can be improved with effectively utilizes The accuracy of application service.Additionally, after being expressed as comprising node and the relational network on side by pretreated original big data, The representative learning algorithm of network data just can be utilized, to realize quick, the unified feature extraction to data, thus accomplishing fast Speed response different application service request.

Described representative learning module 402, for described relational network is adopted with the representative learning algorithm based on embedded mapping, Obtain the high dimension vector of the node of described relational network.Described representative learning module 402 is by adopting base to described relational network In the representative learning algorithm of embedded mapping, by the node in described relational network, such as user, product and phrase etc., unified use Being indicated, wherein, each vector can represent one of described relational network node to the higher vector of dimension, this vector One of dimension illustrate a feature of this node.Relation between described relational network interior joint and node is (in other words Side), it is converted into the similarity between the high dimension vector of node and the high dimension vector of node, thus remaining in original big data Structural information, greatly promotes the accuracy of the tasks such as the corresponding classification of later stage application service and prediction.

Application algoritic module 403, the application service for obtaining user is asked；Determine that described application service request is corresponding Processing Algorithm, and ask, using described application service, the institute that corresponding Processing Algorithm and described representative learning module 402 obtain State the high dimension vector of the node of relational network, determine the result of described application service request.That is, in described representative learning module After the unity of form of the feature high dimension vector in big data is represented by 402, application algoritic module 403 can be using these systems One feature being represented with high dimension vector, going to provide the solution of various application services to return application service in other words needs to solve Problem result.

In embodiments of the present invention, described data structured module 401 be used for original big data pre-processed and Networking, thus described representative learning module 402 can utilize the representative learning algorithm of network data, to realize fast to data Fast, unified feature extraction, described application algoritic module 403 can be asked according to the application service of user, determines corresponding place Adjustment method, and calculated using the feature representing in the form of vectors that described representative learning module 402 is extracted, processed Result returns to user.Different from prior art, in the embodiment of the present invention, the process of whole feature extraction is without the participation of people, profit It is automatically performed with the representative learning algorithm based on embedded mapping, computational efficiency is high；Also greatly retain during feature extraction Structural information (i.e. effective information) in original big data, thus improve the accuracy task such as being classified or being predicted； Moreover, due to employing the representative learning algorithm based on embedded mapping so that the data excavated from original big data Feature system can be indicated in the form of unification is by high dimension vector, thus the system in the embodiment of the present invention is not limited only to as certain Specific application service, can provide unified processing method effectively for multiple application services.

Alternatively, because described relational network both may expand semantic network it is also possible to include net with attributes and behavior net by bag Network.They both may belong to isomorphic relations network, be likely to belong to two-dimentional relation network it is also possible to belong to Multidimensional Relation network. Therefore, described representative learning module 402 can be specifically for carrying out embedded reflecting to the Multidimensional Relation network in described relational network Penetrate, obtain the high dimension vector of the node of described Multidimensional Relation network；Or, specifically for the two dimension pass in described relational network It is that network carries out embedded mapping, obtain the high dimension vector of the node of described two-dimentional relation network；Or, specifically for described pass It is that semantic network in network carries out embedded mapping, obtain the high dimension vector of the node of described semantic network；Or specifically for Homogeneous network in described relational network is entered with every trade and embeds mapping, obtain the high dimension vector of the node of described homogeneous network.

Alternatively, in embodiments of the present invention, described original big data can be by each website or APP collection Come, potentially include the structured datas such as behavioral data, attribute data it is also possible to includings text data etc. unstructured data, Inventive embodiments here does not limit.

Described data structured module 401 carries out pretreatment to original big data, and described original big data is entered Row data analysis and cleaning, carry out statistical analysis to original big data, remove the data content not conforming to rule or mistake, Ke Yishi Invalid data form is filtered, such as removal ought to be floating number, but be filled to the numerical value such as the price of character string type, also may be used Be are carried out time or unit unified or to disappearance fingering row fill in, smooth noise data etc., such that it is able to By the standardized format of big data, remove abnormal data, correct mistake or remove repeated data etc..

Alternatively, described pretreated original big data can include behavioral data, attribute data and text data, then Described data processing module carries out networking to described pretreated original big data and can include：To described pretreated Behavioral data in original big data carries out networking, for example, be converted into behavior network by the behavioral data such as buying, evaluating；Or Person, can also include carrying out networking to the attribute data in described pretreated original big data, such as by age, price It is converted into net with attributes Deng attribute information；Or can also include to the text in described pretreated original big data Data carries out networking, for example, the text datas such as product introduction or evaluation content are converted into the language with word and phrase as node Adopted network.Then described behavior network, described net with attributes and described semantic network have collectively constituted described relational network.

Alternatively, described application algoritic module 403 utilizes the high dimension vector of the node of described relational network, determines described answering During with the result of service request, it is possible to use the high dimension vector of all nodes of described relational network goes to determine described application service The result of request；The high dimension vector of the part of described relational network can also only be utilized, go to determine described application service request Result.Specifically, can only go to determine the knot of described application service request using the node related to described application service request Really.For example, when application service request is Products Show, the high dimension vector of product node and the height of user node can only be utilized Dimensional vector is calculated.

It should be noted that in embodiments of the present invention, implementing of modules, may be referred to embodiment of the method Description, such as, with regard to specifically how carrying out the representative learning algorithm based on embedded mapping, may be referred to the description of embodiment of the method, The embodiment of the present invention will not be described here.

System described in the embodiment of the present invention, can be in the form of software or program, in one or multiple stage computers Or server is realized, embodiment of the present invention here does not limit.

In order to be better understood from the embodiment of the present invention, by the intelligent processing system of the big data described in the embodiment of the present invention It is described in detail as a example being applied to insurance.

User carries out improving personal information, checks in personal computer (personal computer, PC) or mobile terminal When insuring detailed rules and regulations, purchase danger, moving back danger or set up the operation such as social networks, aforesaid operations information, shape can be collected by server Become original big data, described original big data can be stored in database in table form.Described in the embodiment of the present invention System can obtain above-mentioned original big data.

For example, by collecting operation information, userspersonal information's table as shown in table 1, such as may in database, be saved Product information table shown in table 2, the dangerous behavior table of purchase as shown in table 3 and as shown in table 4 move back dangerous behavior table.

Table 1 userspersonal information's table

Table 2 product information table

Dangerous name	Classification	Price	Shou Xian company	Product introduction	……
						Dangerous A	Vehicle insurance	……	……	Premium is low, Claims Resolution is convenient	……
Dangerous B	Life insurance	……	……	Whole-life insurance, age at issue scope are wide	……
						Dangerous C	Health insurance	……	……	It is high that major disease compensates the amount of money	……
……	……	……	……	……	……

Dangerous behavior table purchased by table 3

ID	Dangerous name	Purchase strategical vantage point point (GPS)	Buy the amount of money	User evaluates
					User 1	Dangerous A	XX company	……	It is convenient to buy:)
User 2	Dangerous C	XX enterprise	……	A danger always should be bought outside
					User 3	Dangerous B	1.765	……	……
User 4	Dangerous A	XX road	……	To Ai Chejia danger！
					User 5	Dangerous B	XX cell	……	……
……	……	……	……	……

Table 4 moves back dangerous behavior table

ID	Dangerous name	Move back strategical vantage point point (GPS)	Move back the dangerous amount of money	Move back dangerous reason
					User 3	Dangerous B	XX street (in family)	……	……
……	……	……	……	……

First, the data structured module in described system can carry out data analysis and cleaning to above-mentioned data.With right Shown in table 3 structure danger behavior table in data carry out data carry out data analysis and cleaning as a example.Data analysis refers to by data Statistics and associate acquisition more information, described data structured module can will " job site ", " in family ", " market put attached The information such as closely " replenishes on geographical location information.Data cleansing refers to remove illegal numerical value or even remember illegal data Record removes, such as, when " purchase strategical vantage point point " is for real number, described data structured module can hide this numerical value；When record in table 3 " ID " or " dangerous name " numerical value illegal when, described data structured module can remove this purchase and nearly record.Table 5 is table 3 Middle data carries out the result after data analysis and cleaning through described data structured module.

Purchase danger behavior table after data analysis and cleaning for the table 5

ID	Dangerous name	Purchase strategical vantage point point (GPS)	Buy the amount of money	User evaluates
					User 1	Dangerous A	XX company【Job site】	……	It is convenient to buy:)
User 2	Dangerous C	XX enterprise【Job site】	……	A danger always should be bought outside
					User 3	Dangerous B	【Disappearance】	……	……
User 4	Dangerous A	XX road【Near certain marketing point】	……	To Ai Chejia danger！
					User 5	Dangerous B	XX cell【In family】	……	……
……	……	……	……	……

Next, networking can be carried out to original big data after data analysis and cleaning, obtain comprising node and The relational network on side.By above table, in original big data, there is substantial amounts of text envelope, therefore described data structured Module can carry out networking to text data, obtains the node being made up of phrase or word, and the side between node, that is, Obtain comprising the semantic network on node and side.Subsequent characterizations study module can utilize representative learning method to this semantic network, Learn semantic information therein.For example, can be by the literary composition after data analysis and cleaning in table 1 to table 4 using participle instrument Notebook data extracts, and obtains the text data of " document-phrase " form as shown in table 6, in table 6, each phrase can be expressed as One of semantic network node.Between the node of phrase composition, if jointly occurring in sentence or document, between them There may be side to be attached, the frequency that the weight on side is occurred in jointly by them in sentence or document determines.As " tourism " section Point has side to be connected between " going out far short of what is expected " node, has side to be connected between " going out far short of what is expected " node and " overworked " node.

Table 6

Conventional operational extremely busy go out far short of what is expected
	Go out far short of what is expected often overworked
In evil case divorced has a son
	Hobby is traveled out far short of what is expected
In evil case
	The low Claims Resolution of premium is convenient
Whole-life insurance age at issue scope is wide
	It is high that major disease compensates the amount of money
It is convenient to buy
	A danger always should be bought outside
To Ai Chejia danger
	Price is too high improper

Furthermore, it is possible to the content in table be carried out networking be converted into relational network：As the content transformation of table 1 is " to use Multiple two-dimentional relations such as family ID- sex ", " ID-age bracket ", " ID-occupation " and " ID-self-introduction phrase " Network；The content transformation of table 2 is " dangerous name-classification ", " dangerous name-price range ", " dangerous name-Shou Xian company " and " dangerous name-product Introduce phrase " etc. multiple two-dimentional relation networks；The content transformation of table 3 is " ID-danger name-purchase strategical vantage point point-amount of money interval- Evaluate phrase " Multidimensional Relation network, the content transformation of table 4 is " ID-danger name-move back strategical vantage point point-amount of money interval-move back danger The Multidimensional Relation network of reason phrase ".

In the relational network eventually forming, both contained above-mentioned semantic network, and also contains and turned by the content of table 1- table 4 The multiple Multidimensional Relation networks changed and come and two-dimentional relation network, have genus in these Multidimensional Relation networks and two-dimentional relation network Property network has behavior network again；With ID, user property, product attribute, place, phrase etc. as node in relational network, with Interaction/relation between them is as the side of described relational network.

It should be noted that permission node overlapping in relational network, above-mentioned two-dimentional relation network and Multidimensional Relation network can To be fused into the relational network containing plurality of classes node with " ID ", " dangerous name ", " phrase " etc., i.e. multi-source heterogeneous network. After original big data is converted into relational network by described data structured module, representative learning module can be to described relation Data in network carries out representative learning.It is assumed that the number of dimensions of high dimension vector is K (the usual value of K is between 200 to 500), The result of representative learning is (as phrase node, user node, user property node, product node by the node in relational network Deng) it is expressed as multiple high dimension vectors, remain the incidence relation (i.e. side) in this relational network in high dimension vector.

By analysis above, in embodiments of the present invention, in described relational network, include semantic network, two dimension Relational network and Multidimensional Relation network.Then described representative learning module can be to described semantic network using based on embedded mapping Representative learning algorithm, can be specifically：By " tourism ", " going out far short of what is expected ", " overworked ", " in evil case ", " great disease Node in the semantic networks such as disease " is expressed as high dimension vector, such as u=[u₁,u₂,…,u_K], and pass through representative learning algorithm, can Similar to the vector of " going out far short of what is expected " node with the vector of excavating " tourism " node, the vector of " going out far short of what is expected " node and " mistake Degree is tired " vector of the vector of node is similar, the vector of " overworked " node, the vector of " in evil case " node with " great The vector of disease " node is similar.Thus remaining the structural information of data in network.

Described representative learning module can be described to two-dimentional relation network using the representative learning algorithm based on embedded mapping The representative learning result of two-dimentional relation network can be：By " ID ", " user property ", " product IDs (dangerous name) ", " product genus Property " etc. node be expressed as high dimension vector, the product of attribute similarity high by the vector similarity that makes the user node of attribute similarity The vector similarity of moral integrity point is high, remains the structural information in described relational network, the final user's section making trip more Between point, between classification identical product node, there is similar vector.

Described representative learning module can be described to Multidimensional Relation network using the representative learning algorithm based on embedded mapping The representative learning result of Multidimensional Relation network can be：The node such as " user ", " dangerous name ", " place ", " evaluation phrase " is represented For high dimension vector so that have similar purchase, move back danger custom user node vector similarity high, buy, quit the subscription of user's phase As product node vector similarity high, remain the structural information in relational network.

In embodiments of the present invention, described representative learning module can be based on embedded mapping (Embedding), combination Skip-gram and Negative Sampling realizes it is ensured that the computation complexity of algorithm is low, and algorithm extensibility is strong.

In embodiments of the present invention, each node unification high dimension vector of relational network is indicated, and retains Structural information in relational network, is directed to different mission requirements in the follow-up application service, and application algoritic module is permissible The high dimension vector calling wherein part of nodes is calculated, and computation complexity is low.

For example, it is assumed that the application service request of user needs the problem solving to recommend for insurance products, we can use should Realize insurance products with algoritic module to recommend.Insurance products are recommended to be given user, find with this user in buying behavior Similar, quit the subscription of the most different products in behavior.With this application service corresponding Processing Algorithm of request it is then：Using cosine similarity Vector similarity computational methods such as (Cosine similarity), select the vector of the vector sum product node of user node, meter Calculate the similarity of the vector of vector sum product node of user node.Such as, in representative learning, we can be in user node The vector of vector sum product node in preserve " purchase dangerous behavior " information by the 1st to 100 dimensional vector, if i.e. user A purchase Buy product A, then the 1st to 100 dimension of the vector of user A node and the 1st to 100 dimension of the vector of product A node are similar；We " moving back dangerous behavior " can also be preserved by the 101st to 200 dimensional vector in the vector of the vector sum product node of user node Information, if that is, user A has quit the subscription of product B, the 101st to 200 dimension of the vector of user's A node and the vector of product B node The 101st to 200 dimension similar.Therefore, if it is desirable to give user's A recommended products, then it is the finding with the vector of user's A node 1 to 100 dimensional vector is similar, the vector of the 101st to 200 dimensional vector dissimilar product node.

In the same manner, described application algoritic module can also realize user class using the high dimension vector obtaining through representative learning Detection of other classification and fraud insurance fraud user etc., the embodiment of the present invention will not be described here.

Corresponding to the intelligent processing method of the big data described in Fig. 3, embodiments provide a kind of big data Intelligent processing system, as shown in figure 5, this system can include：

Acquisition module 501, the relation that the application service for obtaining user is asked and is transformed by original big data The high dimension vector of the node of network.

Determining module 502, for determining the described application service corresponding Processing Algorithm of request, please using described application service Seek the high dimension vector of the node of corresponding Processing Algorithm and described relational network, determine the result of described application service request.

In embodiments of the present invention, the feature that described acquisition module 501 can be represented with direct access high dimension vector, thus Described determining module utilizes described application service to ask the high dimension vector of the node of corresponding Processing Algorithm and described relational network, Determine the result of described application service request.The intelligent processing system of the big data described in the embodiment of the present invention, is not limited only to certain Individual specific application service, can provide unified processing method effectively for multiple application services.

Device embodiment described above is only that schematically the wherein said unit illustrating as separating component can To be or to may not be physically separate, as the part that unit shows can be or may not be physics list Unit, you can with positioned at a place, or can also be distributed on multiple NEs.Can be selected it according to the actual needs In the purpose to realize this embodiment scheme for some or all of module.Those of ordinary skill in the art are not paying creativeness Work in the case of, you can to understand and to implement.

Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can Mode by software plus necessary general hardware platform to be realized naturally it is also possible to pass through hardware.Based on such understanding, on That states that technical scheme substantially contributes to prior art in other words partly can be embodied in the form of software product, should Computer software product can store in a computer-readable storage medium, such as ROM/RAM, magnetic disc, CD etc., including some fingers Order is with so that a computer equipment (can be personal computer, server, or network equipment etc.) executes each enforcement Example or some partly described methods of embodiment.

Finally it should be noted that：Above example only in order to technical scheme to be described, is not intended to limit；Although With reference to the foregoing embodiments the present invention is described in detail, it will be understood by those within the art that：It still may be used To modify to the technical scheme described in foregoing embodiments, or equivalent is carried out to wherein some technical characteristics； And these modification or replace, do not make appropriate technical solution essence depart from various embodiments of the present invention technical scheme spirit and Scope.

Claims

1. a kind of intelligent processing system of big data is it is characterised in that include：

Data structured module, for pre-processing to original big data, and to described pretreated original big data Carry out networking, obtain comprising the relational network on node and side；

Representative learning module：Using the representative learning algorithm based on embedded mapping, obtain described pass for described relational network It is the high dimension vector of the node of network；

Application algoritic module：Application service for obtaining user is asked；Determine described application service request corresponding place adjustment Method, and ask, using described application service, the described relational network that corresponding Processing Algorithm and described representative learning module obtain Node high dimension vector, determine the result of described application service request.

2. system according to claim 1 is it is characterised in that comprise Multidimensional Relation network in described relational network, then institute State representative learning module specifically for described Multidimensional Relation network is carried out with embedded mapping, obtain the section of described Multidimensional Relation network The high dimension vector of point.

3. system according to claim 1 is it is characterised in that comprise semantic network in described relational network, then described table Levy study module specifically for described semantic network is carried out with embedded mapping, obtain described semantic network node higher-dimension to Amount.

4. the system according to Claims 2 or 3 is it is characterised in that comprise two-dimentional relation network, then in described relational network Described representative learning module, specifically for described two-dimentional relation network is carried out with embedded mapping, obtains described two-dimentional relation network The high dimension vector of node.

5. system according to claim 1 is it is characterised in that described original big data includes behavioral data, attribute data And text data.

6. according to claim 1 or 5 system it is characterised in that described data structured module is specifically for described Behavioral data in pretreated original big data carries out networking, obtains comprising the behavior network on node and side；

Networking is carried out to the attribute data in described pretreated original big data, obtains comprising the attribute net on node and side Network；And,

Networking is carried out to the text data in described pretreated original big data, obtains comprising the semantic net on node and side Network；

7. system according to claim 1 is it is characterised in that described data structured module is specifically for described original Big data carries out data analysis and cleaning.

8. system according to claim 1 is it is characterised in that described application algoritic module is specifically for using described relation The high dimension vector of the part of nodes in network, and the described application service corresponding Processing Algorithm of request, determine described application clothes The result of business request.

9. a kind of intelligent processing system of big data is it is characterised in that include：

Acquisition module, the section of relational network that the application service for obtaining user is asked and is transformed by original big data The high dimension vector of point；

Determining module, for determining the described application service corresponding Processing Algorithm of request, asks to correspond to using described application service Processing Algorithm and described relational network node high dimension vector, determine the result of described application service request.

10. system according to claim 9 is it is characterised in that the described relational network being transformed by original big data For：Relational network obtained by networking is carried out after pretreatment by described original big data.

A kind of 11. intelligent processing methods of big data are it is characterised in that include：

Original big data is pre-processed；

To described relational network using the representative learning algorithm based on embedded mapping, obtain the higher-dimension of the node of described relational network Vector；

Obtain the application service request of user；

Ask the high dimension vector of the node of corresponding Processing Algorithm and described relational network using described application service, determine described The result of application service request.

12. methods according to claim 11 are it is characterised in that comprise Multidimensional Relation network, then in described relational network The described representative learning algorithm described relational network being adopted based on embedded mapping, obtains the higher-dimension of the node of described relational network Vector, including：

Described Multidimensional Relation network is carried out with embedded mapping, obtains the high dimension vector of the node of described Multidimensional Relation network.

13. methods according to claim 11 are it is characterised in that comprising semantic network in described relational network, then described To described relational network using representative learning algorithm based on embedded mapping, obtain described relational network node higher-dimension to Amount, including：

Described semantic network is carried out with embedded mapping, obtains the high dimension vector of the node of described semantic network.

14. methods according to claim 12 or 13 are it is characterised in that comprise two-dimentional relation net in described relational network Network, then the described representative learning algorithm described relational network being adopted based on embedded mapping, obtains the node of described relational network High dimension vector, including：

Described two-dimentional relation network is carried out with embedded mapping, obtains the high dimension vector of the node of described two-dimentional relation network.

15. methods according to claim 11 are it is characterised in that described original big data includes behavioral data, attribute number According to and text data.

16. methods according to claim 11 or 15 it is characterised in that described to described pretreated original big data Carry out networking, obtain comprising the relational network on node and side, including：

Networking is carried out to the behavioral data in described pretreated original big data, obtains comprising the behavior net on node and side Network；

17. methods according to claim 11 are it is characterised in that described carry out pretreatment to original big data and include to institute State original big data and carry out data analysis and cleaning.

18. methods according to claim 11 are it is characterised in that described ask corresponding process using described application service The high dimension vector of the node of algorithm and described relational network, determines the result of described application service request, including：

Using the high dimension vector of the part of nodes in described relational network, and the adjustment of described application service request corresponding place Method, determines the result of described application service request.

A kind of 19. intelligent processing methods of big data are it is characterised in that include：

Obtain the application service request of user and the high dimension vector of the node of relational network being transformed by original big data；

20. methods according to claim 19 are it is characterised in that the described relational network being transformed by original big data For：Relational network obtained by networking is carried out after pretreatment by described original big data.