CN106033473A - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN106033473A
CN106033473A CN201510126382.7A CN201510126382A CN106033473A CN 106033473 A CN106033473 A CN 106033473A CN 201510126382 A CN201510126382 A CN 201510126382A CN 106033473 A CN106033473 A CN 106033473A
Authority
CN
China
Prior art keywords
data
eigenvalue
disappearance
data record
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510126382.7A
Other languages
Chinese (zh)
Inventor
王瑜
闵万里
徐季秋
张立诚
车品觉
邵貌貌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201510126382.7A priority Critical patent/CN106033473A/en
Publication of CN106033473A publication Critical patent/CN106033473A/en
Pending legal-status Critical Current

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a data processing method and device. The method comprises the steps of obtaining a plurality of data sources, matching and integrating a plurality of data sources to generate a first feature wide table; extracting a first data record having a missing feature value and a second data record having no missing feature value in the first feature wide table; generating an extended model of the missing feature value based on the second data record; extending the missing feature value in the first data record according to the extended model to generate a second feature wide table. The data processing method of an embodiment of the invention is advantageous in that, the second feature wide table having no missing feature value can be generated, and completeness and consistency of data are ensured.

Description

The treating method and apparatus of data
Technical field
The present invention relates to technical field of information processing, particularly relate to the treating method and apparatus of a kind of data.
Background technology
At present, due to the fast development of Internet technology, the Internet has formed the data of considerable scale Measure, and data type also presents diversification.Then, data are yet suffered from not when using internet data Complete defect, and the problem that there is also discordance between the data of separate sources.
Such as, what internet retailer was had may be up to up to a hundred the most for portraying the feature of personage, And these features are frequently not that certain behavior according to user just can determine that (e.g., feature: have room, true Upper internet retailer is difficult to allow user directly input with or without house property, because this is uncorrelated with its business), but The quadratic character obtained by (e.g., buying finishing material) after a series of behaviors of counting user always.When The when that the data of these features combining with other data originated, the data in other source are in characteristic dimension On Deletional will be the most serious.And for example, when multiple data sources coupling generates after feature width table, with The biggest discordance is there is also in the aspect at family, when particularly the user in four or five line cities is mated, net When network retailer combines with the data of other companies, the problem of discordance is more prominent.
Summary of the invention
It is contemplated that one of technical problem solved the most to a certain extent in correlation technique.
To this end, the first of the present invention purpose is to propose the processing method of a kind of data.The method can be given birth to Become the second feature width table without disappearance eigenvalue, it is ensured that the completeness of data and concordance.
Second object of the present invention is to propose the processing means of a kind of data.
To achieve these goals, the processing method of the data of first aspect present invention embodiment, including: obtain Take multiple data source, mate and merge the plurality of data source to generate fisrt feature width table, wherein said One feature width table includes that multiple data record, each described data record include major key and multiple eigenvalue;Carry Take and the wide table of described fisrt feature has the first data record of disappearance eigenvalue and not there is disappearance eigenvalue The second data record;The extended model of described disappearance eigenvalue is generated according to described second data record;With And according to described extended model, the described disappearance eigenvalue in described first data record is extended with life Become second feature width table.
Multiple data sources are fused into a fisrt feature width by the processing method of the data of the embodiment of the present invention Table, and according to the data record repairing without disappearance eigenvalue in fisrt feature width table, there is disappearance feature Disappearance eigenvalue in the data record of value, thus generate the second feature width table without disappearance eigenvalue, it is ensured that The completeness of data;Tool is repaired according to the data record without disappearance eigenvalue in fisrt feature width table There is the disappearance eigenvalue in the data record of disappearance eigenvalue, it is possible to make fully to intersect between multiple feature shadow Ringing, the second feature width table therefore generated also is able to ensure the concordance of data;Second feature width table is used to enter Positive effect is had more when the external recommendation of row, selection, risk assessment etc..
To achieve these goals, the processing means of the data of second aspect present invention embodiment, including: obtain Delivery block, is used for obtaining multiple data source;Coupling and merging module, be used for mating and merge the plurality of number According to source to generate fisrt feature width table, wherein said fisrt feature width table includes multiple data record, Mei Gesuo State data record and include major key and multiple eigenvalue;Extraction module, is used for extracting in described fisrt feature width table There is the first data record of disappearance eigenvalue and not there is the second data record of disappearance eigenvalue;Generate mould Block, for generating the extended model of described disappearance eigenvalue according to described second data record;And expanded mode Block, for being extended the described disappearance eigenvalue in described first data record according to described extended model To generate second feature width table.
Multiple data sources are fused into a fisrt feature width by the processing means of the data of the embodiment of the present invention Table, and according to the data record repairing without disappearance eigenvalue in fisrt feature width table, there is disappearance feature Disappearance eigenvalue in the data record of value, thus generate the second feature width table without disappearance eigenvalue, it is ensured that The completeness of data;Tool is repaired according to the data record without disappearance eigenvalue in fisrt feature width table There is the disappearance eigenvalue in the data record of disappearance eigenvalue, it is possible to make fully to intersect between multiple feature shadow Ringing, the second feature width table therefore generated also is able to ensure the concordance of data;Second feature width table is used to enter Positive effect is had more when the external recommendation of row, selection, risk assessment etc..
Aspect and advantage that the present invention adds will part be given in the following description, and part will be retouched from following Become obvious in stating, or recognized by the practice of the present invention.
Accompanying drawing explanation
Present invention aspect that is above-mentioned and/or that add and advantage are from the following description of the accompanying drawings of embodiments Will be apparent from easy to understand, wherein,
Fig. 1 is the flow chart of the processing method of data according to an embodiment of the invention;
Fig. 2 is the flow chart of the processing method of data in accordance with another embodiment of the present invention;
Fig. 3 is the flow chart of the processing method of the data according to another embodiment of the present invention;
Fig. 4 is the flow chart of S304 according to an embodiment of the invention;
Fig. 5 is the schematic diagram of the processing method of data according to an embodiment of the invention;
Fig. 6 is the structured flowchart of the processing means of data according to an embodiment of the invention;
Fig. 7 is the structured flowchart of the processing means of data in accordance with another embodiment of the present invention;
Fig. 8 is the structured flowchart of generation module 400 according to an embodiment of the invention.
Detailed description of the invention
Embodiments of the invention are described below in detail, and the example of described embodiment is shown in the drawings, wherein certainly Begin to same or similar label eventually represent same or similar element or there is the unit of same or like function Part.The embodiment described below with reference to accompanying drawing is exemplary, is only used for explaining the present invention, and can not It is interpreted as limitation of the present invention.On the contrary, embodiments of the invention include falling into attached claims All changes, amendment and equivalent in the range of spirit and intension.
In describing the invention, it is to be understood that term " first ", " second " etc. are only used for describing Purpose, and it is not intended that indicate or hint relative importance.In describing the invention, explanation is needed It is that unless otherwise clearly defined and limited, term " is connected ", " connection " should be interpreted broadly, example As, can be fixing connection, it is also possible to be to removably connect, or be integrally connected;Can be to be mechanically connected, It can also be electrical connection;Can be to be joined directly together, it is also possible to be indirectly connected to by intermediary.For ability For the those of ordinary skill in territory, above-mentioned term concrete meaning in the present invention can be understood with concrete condition. Additionally, in describing the invention, except as otherwise noted, " multiple " are meant that two or more.
In flow chart or at this, any process described otherwise above or method description are construed as, table Show and include one or more generation for the executable instruction of the step that realizes specific logical function or process Module, fragment or the part of code, and the scope of the preferred embodiment of the present invention includes other realization, Wherein can not by order that is shown or that discuss, including according to involved function by basic mode simultaneously Or in the opposite order, performing function, these should be by embodiments of the invention those of skill in the art Member is understood.
In correlation technique, when using internet data, often there is the incomplete defect of data and difference The problem of discordance between the data in source.In order to overcome the defect in correlation technique and problem, it is right to need Internet data preferably merges so that the abundant cross influence of feature between different pieces of information source, to this end, The embodiment provides the treating method and apparatus of a kind of data, it is possible to realize between different pieces of information source Data fusion.Below with reference to the accompanying drawings the treating method and apparatus of the data of the embodiment of the present invention is described.
Fig. 1 is the flow chart of the processing method of data according to an embodiment of the invention;Fig. 5 is root The schematic diagram of processing method according to the data of one embodiment of the invention.
As it is shown in figure 1, the processing method of these data includes:
S101, obtains multiple data source, mates and merge multiple data source to generate fisrt feature width table, Wherein fisrt feature width table includes that multiple data record, each data record include major key and multiple eigenvalue.
In one embodiment of the invention, multiple data sources describe same human subject.It is to say, many numbers Source according to source can be different, but are both needed to describe same things, as described people, company etc..
Specifically, each data source can be a feature width table, including multiple data records, each data Record also includes major key and multiple eigenvalue;It addition, each data source can also be the narrow table of multiple feature, use In multiple eigenvalues of description same things, now also need to merge to generate a feature multiple narrow tables Wide table, for the representation of multiple data sources, embodiments of the invention do not limit.
In conjunction with Fig. 5 for example, data source 1, data source 2 ..., data source N etc. can be obtained multiple Data source (N is the positive integer more than 2).Data source 1 can be A industry personage, and data source B is permissible For B enterprise personage etc., such as, a data source 1 can be as shown in table 1, and data source 2 can be as Shown in table 2.Should be understood that Tables 1 and 2 is lifted only for the data source of the convenient explanation embodiment of the present invention Example, can not be as the restriction to the embodiment of the present invention, in actual utilization, and the feature dimensions of data source Degree is basic more than 100, the most up to ten thousand, and the number of data source is more than two.
ID Age Consumption grade There is room Monthly income …… The renewal time
…… …… …… …… …… …… ……
123 30 High 1 40000 …… 2014/9/1
124 35 In 1 25000 …… 2014/9/10
125 20 In 0 25000 …… 2014/9/10
126 28 In 0 15000 …… 2014/9/10
127 40 Low 0 10000 …… 2014/9/10
128 19 Low 0 8000 …… 2014/9/10
129 22 Low 0 5000 …… 2014/9/10
…… …… …… …… …… …… ……
Table 1
ID Age There is car …… The renewal time
…… …… …… …… ……
123 30 1 …… 2014/7/1
125 20 0 …… 2014/8/10
127 35 1 …… 2014/8/11
131 35 1 …… 2014/8/12
132 35 0 …… 2014/8/13
133 35 0 …… 2014/8/14
…… …… …… …… ……
Table 2
After obtaining multiple data sources, then carry out coupling and the merging of multiple data source.Specifically, with same One main body is major key, and multiple data sources are merged into a fisrt feature width table.It is to say, according to uniquely Multiple data sources are matched to same by major key (as the title of people, ID (identity number) card No. etc. combine the major key of generation) Under individual data framework.Such as, two shown in Tables 1 and 2 data source through overmatching and merges generation such as table Shown in 3 a fisrt feature width table, from table 3 it is observed that the data record (OK) of only minority Being complete, under big data framework (parallel computation and distributed storage), the repairing of disappearance eigenvalue can Being to take the mean, or taking at random, this simple processing mode can not reach the purpose of data fusion.
ID Age Consumption grade There is room Monthly income There is car …… The renewal time
…… …… …… …… …… …… …… ……
123 30 High 1 40000 1 …… 2014/9/1
124 35 In 1 25000 …… 2014/9/10
125 20 In 0 25000 0 …… 2014/9/10
126 28 In 0 15000 …… 2014/9/10
127 40 Low 0 10000 1 …… 2014/9/10
128 19 Low 0 8000 …… 2014/9/10
129 22 Low 0 5000 …… 2014/9/10
131 35 1 …… 2014/8/12
132 35 0 …… 2014/8/13
133 45 1 …… 2014/8/14
…… …… …… …… …… …… …… ……
Table 3
S102, extracts and has the first data record of disappearance eigenvalue in the wide table of fisrt feature and do not have disappearance Second data record of eigenvalue.
Such as, the first data note with disappearance eigenvalue in the wide table of the fisrt feature shown in table 3 of extraction Record is as shown in table 4;Not having in the wide table of the fisrt feature shown in table 3 extracted lacks the second of eigenvalue Data record is as shown in table 5.
ID Age Consumption grade There is room Monthly income There is car …… The renewal time
…… …… …… …… …… …… …… ……
124 35 In 1 25000 …… 2014/9/10
126 28 In 0 15000 …… 2014/9/10
128 19 Low 0 8000 …… 2014/9/10
129 22 Low 0 5000 …… 2014/9/10
131 35 1 …… 2014/8/12
132 35 0 …… 2014/8/13
133 45 1 …… 2014/8/14
…… …… …… …… …… …… …… ……
Table 4
ID Age Consumption grade There is room Monthly income There is car …… The renewal time
…… …… …… …… …… …… …… ……
123 30 High 1 40000 1 …… 2014/9/1
125 20 In 0 25000 0 …… 2014/9/10
127 40 Low 0 10000 1 …… 2014/9/10
…… …… …… …… …… …… …… ……
Table 5
S103, generates the extended model of disappearance eigenvalue according to the second data record.
Specifically, the second data record is the data record without disappearance eigenvalue, say, that be Standby data record.Generate the extension of disappearance eigenvalue according to complete data record in conjunction with data mining technology Model, so that repairing the disappearance eigenvalue in the first data record with the eigenvalue of complete data record.
S104, is extended generating second to the disappearance eigenvalue in the first data record according to extended model Feature width table.
Multiple data sources are fused into a fisrt feature width table by the data processing method of the embodiment of the present invention, And according to the data record repairing without disappearance eigenvalue in fisrt feature width table, there is disappearance eigenvalue Data record in disappearance eigenvalue, thus generate without disappearance eigenvalue second feature width table, it is ensured that The completeness of data;Repair according to the data record without disappearance eigenvalue in fisrt feature width table and have Disappearance eigenvalue in the data record of disappearance eigenvalue, it is possible to make fully to intersect between multiple feature shadow Ringing, the second feature width table therefore generated also is able to ensure the concordance of data;Second feature width table is used to enter Positive effect is had more when the external recommendation of row, selection, risk assessment etc..
Fig. 2 is the flow chart of the processing method of data in accordance with another embodiment of the present invention.
As in figure 2 it is shown, the processing method of these data includes:
S201, obtains multiple data source, mates and merge multiple data source to generate fisrt feature width table, Wherein fisrt feature width table includes that multiple data record, each data record include major key and multiple eigenvalue.
S202, extracts and has the first data record of disappearance eigenvalue in the wide table of fisrt feature and do not have disappearance Second data record of eigenvalue.
S203, generates the extended model of disappearance eigenvalue according to the second data record.
S204, is extended the disappearance eigenvalue in the first data record according to extended model.
Above-mentioned S201-S204 respectively with the S101-S104 one_to_one corresponding of above-described embodiment, be referred to above-mentioned Embodiment, does not repeats them here.
S205, merges the first data record after the second data record and extension, to generate second feature width table.
The method of the embodiment of the present invention, extracts the first data note in the wide table of fisrt feature with disappearance eigenvalue Record and do not have the second data record of disappearance eigenvalue, and count with second after the first data record extension Merge to generate second feature width table according to record, simplify step, improve efficiency.
Fig. 3 is the flow chart of the processing method of the data according to another embodiment of the present invention.
As it is shown on figure 3, the processing method of these data includes:
S301, obtains multiple data source, mates and merge multiple data source to generate fisrt feature width table, Wherein fisrt feature width table includes that multiple data record, each data record include major key and multiple eigenvalue.
S302, extracts and has the first data record of disappearance eigenvalue in the wide table of fisrt feature and do not have disappearance Second data record of eigenvalue.
Above-mentioned S301 and S302 respectively with above-mentioned S101 and S102, or S201 and S202 one_to_one corresponding, It is referred to above-described embodiment, does not repeats them here.
S303, trains disappearance eigenvalue characteristic of correspondence eigenvalue in the second data record as target , and train item to generate training dataset as variable remaining eigenvalue in the second data record.
S304, generates the extended model of disappearance eigenvalue according to training dataset.
S304 specifically includes in one embodiment of the invention:
S3041, determines the linked character of disappearance eigenvalue characteristic of correspondence.Specifically, can be according to statistics Gain knowledge and determine linked character, do not repeat them here.
S3042, concentrates from training data and extracts the variable training item that linked character is corresponding.
Such as, with the presence of 4 features disappearance eigenvalues in table 4, i.e. consume grade, have room, monthly income and There is car, according to the second data record in table 5, generate consumption grade, have room, monthly income and have four, car Feature target training item and variable training item, specifically as shown in table 6-9, wherein gray scale place be classified as mesh Mark training item.
Table 6
Table 7
Table 8
Table 9
S3043, determines the type of disappearance eigenvalue characteristic of correspondence.
In one embodiment of the invention, type includes many characteristic of divisions, two characteristic of divisions and continuous feature In one or more.Such as, consumption grade can be characteristic of division more than, its eigenvalue include height, In, low, have room and to have car be two characteristic of divisions, its eigenvalue includes 1 and 0, and monthly income is continuous feature, Its eigenvalue is in the range of the consecutive numbers more than 0.
S3044, determines training algorithm according to type.
Such as, if type is many characteristic of divisions, can select Bayes's (Bayes) network algorithm or with Machine forest algorithm;If type is two characteristic of divisions, can be with trade-off decision tree or Logistic algorithm;As Fruit type is continuous feature, can select linear regression or neural network algorithm.
S3045, instructs the variable training item that target training item is corresponding with linked character according to training algorithm Practice to generate extended model.
S305, is extended the disappearance eigenvalue in the first data record according to extended model.
S306, merges the first data record after the second data record and extension, to generate second feature width table.
Such as, the second feature width table ultimately generated is as shown in table 10.
ID Age Consumption grade There is room Monthly income There is car …… The renewal time
…… …… …… …… …… …… …… ……
123 30 High 1 40000 1 …… 2014/9/1
124 35 In 1 25000 0.95 …… 2014/9/10
125 20 In 0 25000 0 …… 2014/9/10
126 28 In 0 15000 0.45 …… 2014/9/10
127 40 Low 0 10000 1 …… 2014/9/10
128 19 Low 0 8000 0.13 …… 2014/9/10
129 22 Low 0 5000 0.02 …… 2014/9/10
131 35 In 0.68 23000 1 …… 2014/8/12
132 35 In 0.27 16000 0 …… 2014/8/13
133 45 High 0.93 38000 1 …… 2014/8/14
…… …… …… …… …… …… …… ……
Table 10
The processing method of the data of the embodiment of the present invention, determines the type of disappearance eigenvalue characteristic of correspondence, and Training algorithm is determined, further according to the variable that training algorithm is corresponding with linked character to target training item according to type Training item is trained generating extended model, it is possible to customize extended model according to different data types.
Should be understood that above-mentioned table 1-10 is to understand embodiments of the invention institute illustrated example for convenience, not Can be as limiting the scope of the present invention.
In order to realize above-described embodiment, embodiments of the invention also propose the processing means of a kind of data.
Fig. 6 is the structured flowchart of the processing means of data according to an embodiment of the invention.
As shown in Figure 6, the processing means 10 of data includes: acquisition module 100, mates and merges mould Block 200, extraction module 300, generation module 400 and expansion module 500.
Specifically, acquisition module 100 is used for obtaining multiple data source.In one embodiment of the invention, Multiple data sources describe same human subject.It is to say, the source of multiple data sources can be different, but it is both needed to Same things is described, as described people, company etc..Specifically, each data source can be a feature width Table, including multiple data records, each data record also includes major key and multiple eigenvalue;It addition, it is each Data source can also be the narrow table of multiple feature, for describing multiple eigenvalues of same things, the most also needs Multiple narrow tables are merged to generate a feature width table, for the representation of multiple data sources, the present invention Embodiment do not limit.The example of concrete data source refers to method part and describes, and does not repeats them here.
Coupling and merge module 200 and be used for mating and merge multiple data source to generate fisrt feature width table, Wherein fisrt feature width table includes that multiple data record, each data record include major key and multiple feature Value.More specifically, with same main body as major key, multiple data sources are merged into a fisrt feature width table. It is to say, will be many according to unique major key (as the title of people, ID (identity number) card No. etc. combine the major key of generation) Individual data source is matched under same data framework.
Extraction module 300 has the first data note of disappearance eigenvalue for extracting in the wide table of fisrt feature Record and the second data record without disappearance eigenvalue.
Generation module 400 for generating the extended model of disappearance eigenvalue according to the second data record.More Specifically, the second data record is the data record without disappearance eigenvalue, say, that be complete Data record.Generate the expanded mode of disappearance eigenvalue according to complete data record in conjunction with data mining technology Type, so that repairing the disappearance eigenvalue in the first data record with the eigenvalue of complete data record.
Expansion module 500 is for carrying out the disappearance eigenvalue in the first data record according to extended model Extension is to generate second feature width table.
Multiple data sources are fused into a fisrt feature width table by the data processing equipment of the embodiment of the present invention, And according to the data record repairing without disappearance eigenvalue in fisrt feature width table, there is disappearance eigenvalue Data record in disappearance eigenvalue, thus generate without disappearance eigenvalue second feature width table, it is ensured that The completeness of data;Repair according to the data record without disappearance eigenvalue in fisrt feature width table and have Disappearance eigenvalue in the data record of disappearance eigenvalue, it is possible to make fully to intersect between multiple feature shadow Ringing, the second feature width table therefore generated also is able to ensure the concordance of data;Second feature width table is used to enter Positive effect is had more when the external recommendation of row, selection, risk assessment etc..
Fig. 7 is the structured flowchart of the processing means of data in accordance with another embodiment of the present invention.
As it is shown in fig. 7, the processing means 10 of data includes: acquisition module 100, mate and merge mould Block 200, extraction module 300, generation module 400, expansion module 500 and merging module 600.
Specifically, the module 600 first data note after merging the second data record and extension is merged Record, to generate second feature width table.
The device of the embodiment of the present invention, extracts the first data note in the wide table of fisrt feature with disappearance eigenvalue Record and do not have the second data record of disappearance eigenvalue, and count with second after the first data record extension Merge to generate second feature width table according to record, simplify step, improve efficiency.
Fig. 8 is the structured flowchart of generation module 400 according to an embodiment of the invention.
As shown in Figure 8, generation module 400 includes the first signal generating unit 410 and the second signal generating unit 420.
Specifically, the first signal generating unit 410 will be for lacking eigenvalue characteristic of correspondence in the second data Eigenvalue in record trains item as target, and using remaining eigenvalue in the second data record as Variable training item is to generate training dataset.
Second signal generating unit 420 for generating the extended model of disappearance eigenvalue according to training dataset. In one embodiment of the invention, the second signal generating unit 420 includes that first determines subelement, extraction Subelement, second determine subelement, the 3rd determine subelement and training subelement.
Wherein, first determine subelement for determine disappearance eigenvalue characteristic of correspondence linked character, Wherein can determine linked character according to knowledge of statistics, not repeat them here.
Extract subelement and extract, for concentrating from training data, the variable training item that linked character is corresponding,
Second determines that subelement is for determining the type of disappearance eigenvalue characteristic of correspondence.In the present invention one In individual embodiment, type includes one or more in many characteristic of divisions, two characteristic of divisions and continuous feature, Such as, consumption grade can be characteristic of division more than, and its eigenvalue includes high, medium and low, has room and has Car is two characteristic of divisions, and its eigenvalue includes 1 and 0, and monthly income is continuous feature, the scope of its eigenvalue For the consecutive numbers more than 0.
3rd determines that subelement is for determining training algorithm according to type.Such as, if type is many classification Feature, can select Bayes's (Bayes) network algorithm or random forests algorithm;If type is two points Category feature, can be with trade-off decision tree or Logistic algorithm;If type is continuous feature, can select Linear regression or neural network algorithm.
Training subelement is for the variable instruction corresponding with linked character to target training item according to training algorithm Practice item to be trained generating extended model.
The processing means of the data of the embodiment of the present invention, determines the type of disappearance eigenvalue characteristic of correspondence, and Training algorithm is determined, further according to the variable that training algorithm is corresponding with linked character to target training item according to type Training item is trained generating extended model, it is possible to customize extended model according to different data types.
Should be appreciated that each several part of the present invention can realize by hardware, software, firmware or combinations thereof. In the above-described embodiment, multiple steps or method in memory and can be held by suitably instruction with storage Software or firmware that row system performs realize.Such as, if realized with hardware, with another embodiment party As in formula, can realize by any one in following technology well known in the art or their combination: have For data signal being realized the discrete logic of the logic gates of logic function, there is suitably combination The special IC of logic gates, programmable gate array (PGA), field programmable gate array (FPGA) Deng.
In the description of this specification, reference term " embodiment ", " some embodiments ", " example ", It is concrete that the description of " concrete example " or " some examples " etc. means to combine this embodiment or example describes Feature, structure, material or feature are contained at least one embodiment or the example of the present invention.In this theory In bright book, the schematic representation of above-mentioned term is not necessarily referring to identical embodiment or example.And, The specific features, structure, material or the feature that describe can be in any one or more embodiments or examples In combine in an appropriate manner.
Although an embodiment of the present invention has been shown and described, those of ordinary skill in the art can manage Solve: these embodiments can be carried out in the case of without departing from the principle of the present invention and objective multiple change, Amendment, replacement and modification, the scope of the present invention is limited by claim and equivalent thereof.

Claims (12)

1. the processing method of data, it is characterised in that including:
Obtain multiple data source, mate and merge the plurality of data source to generate fisrt feature width table, wherein Described fisrt feature width table includes that multiple data record, each described data record include major key and multiple feature Value;
Extract and the wide table of described fisrt feature has the first data record of disappearance eigenvalue and not there is disappearance Second data record of eigenvalue;
The extended model of described disappearance eigenvalue is generated according to described second data record;And
According to described extended model the described disappearance eigenvalue in described first data record is extended with Generate second feature width table.
The processing method of data the most according to claim 1, it is characterised in that also include:
Merge the described first data record after described second data record and extension, special to generate described second Levy wide table.
The processing method of data the most according to claim 1 and 2, it is characterised in that described basis Described second data record generates the extended model of described disappearance eigenvalue, including:
Using described disappearance eigenvalue characteristic of correspondence eigenvalue in described second data record as target Training item, and remaining eigenvalue in described second data record is trained number as variable training item to generate According to collection;And
The described extended model of described disappearance eigenvalue is generated according to described training dataset.
The processing method of data the most according to claim 3, it is characterised in that described in described basis Training dataset generates the described extended model of described disappearance eigenvalue, including:
Determine the linked character of described disappearance eigenvalue characteristic of correspondence;
Concentrate from described training data and extract the variable training item that described linked character is corresponding;
Determine the type of described disappearance eigenvalue characteristic of correspondence;
Training algorithm is determined according to described type;And
According to the variable training item that described training algorithm is corresponding with described linked character to described target training item It is trained generating described extended model.
The processing method of data the most according to claim 1, it is characterised in that the plurality of data The same human subject of Source Description.
The processing method of data the most according to claim 4, it is characterised in that described type includes One or more in many characteristic of divisions, two characteristic of divisions and continuous feature.
7. the processing means of data, it is characterised in that including:
Acquisition module, is used for obtaining multiple data source;
Coupling and merge module, is used for mating and merge the plurality of data source to generate fisrt feature width table, Wherein said fisrt feature width table includes that multiple data record, each described data record include major key and multiple Eigenvalue;
Extraction module, for extracting the first data record in the wide table of described fisrt feature with disappearance eigenvalue Not there is the second data record of disappearance eigenvalue;
Generation module, for generating the extended model of described disappearance eigenvalue according to described second data record; And
Expansion module, is used for according to described extended model the described disappearance feature in described first data record Value is extended generating second feature width table.
The processing means of data the most according to claim 7, it is characterised in that also include:
Merge module, the described first data record after merging described second data record and extension, with Generate described second feature width table.
9. according to the processing means of the data described in claim 7 or 8, it is characterised in that described generation Module, including:
First signal generating unit, is used for described disappearance eigenvalue characteristic of correspondence in described second data record Eigenvalue train item as target, and remaining eigenvalue in described second data record is instructed as variable Practice item to generate training dataset;And
Second signal generating unit, for generating the described extension of described disappearance eigenvalue according to described training dataset Model.
The processing means of data the most according to claim 9, it is characterised in that described second generates Unit, including:
First determines subelement, for determining the linked character of described disappearance eigenvalue characteristic of correspondence;
Extract subelement, extract, for concentrating from described training data, the variable training that described linked character is corresponding ?;
Second determines subelement, for determining the type of described disappearance eigenvalue characteristic of correspondence;
3rd determines subelement, for determining training algorithm according to described type;And
Training subelement, for training item and described linked character pair according to described training algorithm to described target The variable training item answered is trained generating described extended model.
The processing means of 11. data according to claim 7, it is characterised in that the plurality of data The same human subject of Source Description.
The processing means of 12. data according to claim 10, it is characterised in that described type bag Include one or more in many characteristic of divisions, two characteristic of divisions and continuous feature.
CN201510126382.7A 2015-03-20 2015-03-20 Data processing method and device Pending CN106033473A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510126382.7A CN106033473A (en) 2015-03-20 2015-03-20 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510126382.7A CN106033473A (en) 2015-03-20 2015-03-20 Data processing method and device

Publications (1)

Publication Number Publication Date
CN106033473A true CN106033473A (en) 2016-10-19

Family

ID=57149566

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510126382.7A Pending CN106033473A (en) 2015-03-20 2015-03-20 Data processing method and device

Country Status (1)

Country Link
CN (1) CN106033473A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108241934A (en) * 2016-12-23 2018-07-03 北京京东尚科信息技术有限公司 Data query method and apparatus
CN108921229A (en) * 2018-07-17 2018-11-30 成都西加云杉科技有限公司 Data reconstruction method and device
CN108962386A (en) * 2017-05-27 2018-12-07 ***通信有限公司研究院 A kind of data processing method, apparatus and system
CN110555070A (en) * 2018-06-01 2019-12-10 百度在线网络技术(北京)有限公司 Method and apparatus for outputting information
CN110785749A (en) * 2018-06-25 2020-02-11 北京嘀嘀无限科技发展有限公司 System and method for generating wide tables
CN111178536A (en) * 2019-11-26 2020-05-19 腾讯云计算(北京)有限责任公司 Data information processing method and device, electronic equipment and storage medium
CN113535817A (en) * 2021-07-13 2021-10-22 浙江网商银行股份有限公司 Method and device for generating characteristic broad table and training business processing model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1372667A (en) * 1999-12-30 2002-10-02 Ge资本商业财务公司 Valuation prediction models in situations with missing inputs
US20070168545A1 (en) * 2006-01-18 2007-07-19 Venkat Venkatsubra Methods and devices for processing incomplete data packets
CN104133866A (en) * 2014-07-18 2014-11-05 国家电网公司 Intelligent-power-grid-oriented missing data filling method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1372667A (en) * 1999-12-30 2002-10-02 Ge资本商业财务公司 Valuation prediction models in situations with missing inputs
US20070168545A1 (en) * 2006-01-18 2007-07-19 Venkat Venkatsubra Methods and devices for processing incomplete data packets
CN104133866A (en) * 2014-07-18 2014-11-05 国家电网公司 Intelligent-power-grid-oriented missing data filling method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张红霞: "缺失值填充:基于信息增益的方法", 《计算机工程与设计》 *
鲁均云: "重复和不完整数据的清理方法研究及应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108241934A (en) * 2016-12-23 2018-07-03 北京京东尚科信息技术有限公司 Data query method and apparatus
CN108241934B (en) * 2016-12-23 2021-02-26 北京京东尚科信息技术有限公司 Data query method and device
CN108962386A (en) * 2017-05-27 2018-12-07 ***通信有限公司研究院 A kind of data processing method, apparatus and system
CN110555070A (en) * 2018-06-01 2019-12-10 百度在线网络技术(北京)有限公司 Method and apparatus for outputting information
CN110555070B (en) * 2018-06-01 2021-10-22 百度在线网络技术(北京)有限公司 Method and apparatus for outputting information
CN110785749A (en) * 2018-06-25 2020-02-11 北京嘀嘀无限科技发展有限公司 System and method for generating wide tables
CN110785749B (en) * 2018-06-25 2020-08-21 北京嘀嘀无限科技发展有限公司 System and method for generating wide tables
US11061882B2 (en) 2018-06-25 2021-07-13 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for generating a wide table
CN108921229A (en) * 2018-07-17 2018-11-30 成都西加云杉科技有限公司 Data reconstruction method and device
CN111178536A (en) * 2019-11-26 2020-05-19 腾讯云计算(北京)有限责任公司 Data information processing method and device, electronic equipment and storage medium
CN113535817A (en) * 2021-07-13 2021-10-22 浙江网商银行股份有限公司 Method and device for generating characteristic broad table and training business processing model
CN113535817B (en) * 2021-07-13 2024-05-14 浙江网商银行股份有限公司 Feature broad table generation and service processing model training method and device

Similar Documents

Publication Publication Date Title
CN106033473A (en) Data processing method and device
Eustace et al. Community detection using local neighborhood in complex networks
Matsebula et al. A big data architecture for learning analytics in higher education
CN103823896A (en) Subject characteristic value algorithm and subject characteristic value algorithm-based project evaluation expert recommendation algorithm
Moorman et al. Filtering methods for subgraph matching on multiplex networks
CN112925908A (en) Attention-based text classification method and system for graph Attention network
CN110992059B (en) Surrounding string behavior recognition analysis method based on big data
CN111191099B (en) User activity type identification method based on social media
Vedanayaki A study of data mining and social network analysis
CN107220902A (en) The cascade scale forecast method of online community network
Zhu et al. A unified model for community detection of multiplex networks
CN104268134A (en) Subjective and objective classifier building method and system
Lo et al. Mining direct antagonistic communities in explicit trust networks
Alzahrani et al. Community detection in bipartite networks using random walks
Froemelt et al. A two-stage clustering approach to investigate lifestyle carbon footprints in two Australian cities
Shao et al. Community detection via local dynamic interaction
CN103294828B (en) The verification method of data mining model dimension and demo plant
CN105045924A (en) Question classification method and system
Guan et al. Analysis of inter-country input–output table based on bibliographic coupling network: How industrial sectors on the GVC compete for production resources
CN105718564A (en) Promotion behavior detection method and apparatus
Lit et al. Understanding twitter telehealth communication during the covid-19 pandemic using hetero-functional graph theory
Choupani et al. Population synthesis in activity-based models: tabular rounding in iterative proportional fitting
Raju et al. Detecting communities in social networks using unnormalized spectral clustering incorporated with Bisecting K-means
CN110288114A (en) Violation electricity consumption behavior prediction method based on power marketing data
Lin Information visualization from the perspective of big data analysis and fusion

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1229909

Country of ref document: HK

RJ01 Rejection of invention patent application after publication

Application publication date: 20161019

RJ01 Rejection of invention patent application after publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: WD

Ref document number: 1229909

Country of ref document: HK