CN106033473A - Data processing method and device - Google Patents
Data processing method and device Download PDFInfo
- Publication number
- CN106033473A CN106033473A CN201510126382.7A CN201510126382A CN106033473A CN 106033473 A CN106033473 A CN 106033473A CN 201510126382 A CN201510126382 A CN 201510126382A CN 106033473 A CN106033473 A CN 106033473A
- Authority
- CN
- China
- Prior art keywords
- data
- eigenvalue
- disappearance
- data record
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides a data processing method and device. The method comprises the steps of obtaining a plurality of data sources, matching and integrating a plurality of data sources to generate a first feature wide table; extracting a first data record having a missing feature value and a second data record having no missing feature value in the first feature wide table; generating an extended model of the missing feature value based on the second data record; extending the missing feature value in the first data record according to the extended model to generate a second feature wide table. The data processing method of an embodiment of the invention is advantageous in that, the second feature wide table having no missing feature value can be generated, and completeness and consistency of data are ensured.
Description
Technical field
The present invention relates to technical field of information processing, particularly relate to the treating method and apparatus of a kind of data.
Background technology
At present, due to the fast development of Internet technology, the Internet has formed the data of considerable scale
Measure, and data type also presents diversification.Then, data are yet suffered from not when using internet data
Complete defect, and the problem that there is also discordance between the data of separate sources.
Such as, what internet retailer was had may be up to up to a hundred the most for portraying the feature of personage,
And these features are frequently not that certain behavior according to user just can determine that (e.g., feature: have room, true
Upper internet retailer is difficult to allow user directly input with or without house property, because this is uncorrelated with its business), but
The quadratic character obtained by (e.g., buying finishing material) after a series of behaviors of counting user always.When
The when that the data of these features combining with other data originated, the data in other source are in characteristic dimension
On Deletional will be the most serious.And for example, when multiple data sources coupling generates after feature width table, with
The biggest discordance is there is also in the aspect at family, when particularly the user in four or five line cities is mated, net
When network retailer combines with the data of other companies, the problem of discordance is more prominent.
Summary of the invention
It is contemplated that one of technical problem solved the most to a certain extent in correlation technique.
To this end, the first of the present invention purpose is to propose the processing method of a kind of data.The method can be given birth to
Become the second feature width table without disappearance eigenvalue, it is ensured that the completeness of data and concordance.
Second object of the present invention is to propose the processing means of a kind of data.
To achieve these goals, the processing method of the data of first aspect present invention embodiment, including: obtain
Take multiple data source, mate and merge the plurality of data source to generate fisrt feature width table, wherein said
One feature width table includes that multiple data record, each described data record include major key and multiple eigenvalue;Carry
Take and the wide table of described fisrt feature has the first data record of disappearance eigenvalue and not there is disappearance eigenvalue
The second data record;The extended model of described disappearance eigenvalue is generated according to described second data record;With
And according to described extended model, the described disappearance eigenvalue in described first data record is extended with life
Become second feature width table.
Multiple data sources are fused into a fisrt feature width by the processing method of the data of the embodiment of the present invention
Table, and according to the data record repairing without disappearance eigenvalue in fisrt feature width table, there is disappearance feature
Disappearance eigenvalue in the data record of value, thus generate the second feature width table without disappearance eigenvalue, it is ensured that
The completeness of data;Tool is repaired according to the data record without disappearance eigenvalue in fisrt feature width table
There is the disappearance eigenvalue in the data record of disappearance eigenvalue, it is possible to make fully to intersect between multiple feature shadow
Ringing, the second feature width table therefore generated also is able to ensure the concordance of data;Second feature width table is used to enter
Positive effect is had more when the external recommendation of row, selection, risk assessment etc..
To achieve these goals, the processing means of the data of second aspect present invention embodiment, including: obtain
Delivery block, is used for obtaining multiple data source;Coupling and merging module, be used for mating and merge the plurality of number
According to source to generate fisrt feature width table, wherein said fisrt feature width table includes multiple data record, Mei Gesuo
State data record and include major key and multiple eigenvalue;Extraction module, is used for extracting in described fisrt feature width table
There is the first data record of disappearance eigenvalue and not there is the second data record of disappearance eigenvalue;Generate mould
Block, for generating the extended model of described disappearance eigenvalue according to described second data record;And expanded mode
Block, for being extended the described disappearance eigenvalue in described first data record according to described extended model
To generate second feature width table.
Multiple data sources are fused into a fisrt feature width by the processing means of the data of the embodiment of the present invention
Table, and according to the data record repairing without disappearance eigenvalue in fisrt feature width table, there is disappearance feature
Disappearance eigenvalue in the data record of value, thus generate the second feature width table without disappearance eigenvalue, it is ensured that
The completeness of data;Tool is repaired according to the data record without disappearance eigenvalue in fisrt feature width table
There is the disappearance eigenvalue in the data record of disappearance eigenvalue, it is possible to make fully to intersect between multiple feature shadow
Ringing, the second feature width table therefore generated also is able to ensure the concordance of data;Second feature width table is used to enter
Positive effect is had more when the external recommendation of row, selection, risk assessment etc..
Aspect and advantage that the present invention adds will part be given in the following description, and part will be retouched from following
Become obvious in stating, or recognized by the practice of the present invention.
Accompanying drawing explanation
Present invention aspect that is above-mentioned and/or that add and advantage are from the following description of the accompanying drawings of embodiments
Will be apparent from easy to understand, wherein,
Fig. 1 is the flow chart of the processing method of data according to an embodiment of the invention;
Fig. 2 is the flow chart of the processing method of data in accordance with another embodiment of the present invention;
Fig. 3 is the flow chart of the processing method of the data according to another embodiment of the present invention;
Fig. 4 is the flow chart of S304 according to an embodiment of the invention;
Fig. 5 is the schematic diagram of the processing method of data according to an embodiment of the invention;
Fig. 6 is the structured flowchart of the processing means of data according to an embodiment of the invention;
Fig. 7 is the structured flowchart of the processing means of data in accordance with another embodiment of the present invention;
Fig. 8 is the structured flowchart of generation module 400 according to an embodiment of the invention.
Detailed description of the invention
Embodiments of the invention are described below in detail, and the example of described embodiment is shown in the drawings, wherein certainly
Begin to same or similar label eventually represent same or similar element or there is the unit of same or like function
Part.The embodiment described below with reference to accompanying drawing is exemplary, is only used for explaining the present invention, and can not
It is interpreted as limitation of the present invention.On the contrary, embodiments of the invention include falling into attached claims
All changes, amendment and equivalent in the range of spirit and intension.
In describing the invention, it is to be understood that term " first ", " second " etc. are only used for describing
Purpose, and it is not intended that indicate or hint relative importance.In describing the invention, explanation is needed
It is that unless otherwise clearly defined and limited, term " is connected ", " connection " should be interpreted broadly, example
As, can be fixing connection, it is also possible to be to removably connect, or be integrally connected;Can be to be mechanically connected,
It can also be electrical connection;Can be to be joined directly together, it is also possible to be indirectly connected to by intermediary.For ability
For the those of ordinary skill in territory, above-mentioned term concrete meaning in the present invention can be understood with concrete condition.
Additionally, in describing the invention, except as otherwise noted, " multiple " are meant that two or more.
In flow chart or at this, any process described otherwise above or method description are construed as, table
Show and include one or more generation for the executable instruction of the step that realizes specific logical function or process
Module, fragment or the part of code, and the scope of the preferred embodiment of the present invention includes other realization,
Wherein can not by order that is shown or that discuss, including according to involved function by basic mode simultaneously
Or in the opposite order, performing function, these should be by embodiments of the invention those of skill in the art
Member is understood.
In correlation technique, when using internet data, often there is the incomplete defect of data and difference
The problem of discordance between the data in source.In order to overcome the defect in correlation technique and problem, it is right to need
Internet data preferably merges so that the abundant cross influence of feature between different pieces of information source, to this end,
The embodiment provides the treating method and apparatus of a kind of data, it is possible to realize between different pieces of information source
Data fusion.Below with reference to the accompanying drawings the treating method and apparatus of the data of the embodiment of the present invention is described.
Fig. 1 is the flow chart of the processing method of data according to an embodiment of the invention;Fig. 5 is root
The schematic diagram of processing method according to the data of one embodiment of the invention.
As it is shown in figure 1, the processing method of these data includes:
S101, obtains multiple data source, mates and merge multiple data source to generate fisrt feature width table,
Wherein fisrt feature width table includes that multiple data record, each data record include major key and multiple eigenvalue.
In one embodiment of the invention, multiple data sources describe same human subject.It is to say, many numbers
Source according to source can be different, but are both needed to describe same things, as described people, company etc..
Specifically, each data source can be a feature width table, including multiple data records, each data
Record also includes major key and multiple eigenvalue;It addition, each data source can also be the narrow table of multiple feature, use
In multiple eigenvalues of description same things, now also need to merge to generate a feature multiple narrow tables
Wide table, for the representation of multiple data sources, embodiments of the invention do not limit.
In conjunction with Fig. 5 for example, data source 1, data source 2 ..., data source N etc. can be obtained multiple
Data source (N is the positive integer more than 2).Data source 1 can be A industry personage, and data source B is permissible
For B enterprise personage etc., such as, a data source 1 can be as shown in table 1, and data source 2 can be as
Shown in table 2.Should be understood that Tables 1 and 2 is lifted only for the data source of the convenient explanation embodiment of the present invention
Example, can not be as the restriction to the embodiment of the present invention, in actual utilization, and the feature dimensions of data source
Degree is basic more than 100, the most up to ten thousand, and the number of data source is more than two.
ID | Age | Consumption grade | There is room | Monthly income | …… | The renewal time |
…… | …… | …… | …… | …… | …… | …… |
123 | 30 | High | 1 | 40000 | …… | 2014/9/1 |
124 | 35 | In | 1 | 25000 | …… | 2014/9/10 |
125 | 20 | In | 0 | 25000 | …… | 2014/9/10 |
126 | 28 | In | 0 | 15000 | …… | 2014/9/10 |
127 | 40 | Low | 0 | 10000 | …… | 2014/9/10 |
128 | 19 | Low | 0 | 8000 | …… | 2014/9/10 |
129 | 22 | Low | 0 | 5000 | …… | 2014/9/10 |
…… | …… | …… | …… | …… | …… | …… |
Table 1
ID | Age | There is car | …… | The renewal time |
…… | …… | …… | …… | …… |
123 | 30 | 1 | …… | 2014/7/1 |
125 | 20 | 0 | …… | 2014/8/10 |
127 | 35 | 1 | …… | 2014/8/11 |
131 | 35 | 1 | …… | 2014/8/12 |
132 | 35 | 0 | …… | 2014/8/13 |
133 | 35 | 0 | …… | 2014/8/14 |
…… | …… | …… | …… | …… |
Table 2
After obtaining multiple data sources, then carry out coupling and the merging of multiple data source.Specifically, with same
One main body is major key, and multiple data sources are merged into a fisrt feature width table.It is to say, according to uniquely
Multiple data sources are matched to same by major key (as the title of people, ID (identity number) card No. etc. combine the major key of generation)
Under individual data framework.Such as, two shown in Tables 1 and 2 data source through overmatching and merges generation such as table
Shown in 3 a fisrt feature width table, from table 3 it is observed that the data record (OK) of only minority
Being complete, under big data framework (parallel computation and distributed storage), the repairing of disappearance eigenvalue can
Being to take the mean, or taking at random, this simple processing mode can not reach the purpose of data fusion.
ID | Age | Consumption grade | There is room | Monthly income | There is car | …… | The renewal time |
…… | …… | …… | …… | …… | …… | …… | …… |
123 | 30 | High | 1 | 40000 | 1 | …… | 2014/9/1 |
124 | 35 | In | 1 | 25000 | …… | 2014/9/10 | |
125 | 20 | In | 0 | 25000 | 0 | …… | 2014/9/10 |
126 | 28 | In | 0 | 15000 | …… | 2014/9/10 | |
127 | 40 | Low | 0 | 10000 | 1 | …… | 2014/9/10 |
128 | 19 | Low | 0 | 8000 | …… | 2014/9/10 | |
129 | 22 | Low | 0 | 5000 | …… | 2014/9/10 | |
131 | 35 | 1 | …… | 2014/8/12 | |||
132 | 35 | 0 | …… | 2014/8/13 | |||
133 | 45 | 1 | …… | 2014/8/14 | |||
…… | …… | …… | …… | …… | …… | …… | …… |
Table 3
S102, extracts and has the first data record of disappearance eigenvalue in the wide table of fisrt feature and do not have disappearance
Second data record of eigenvalue.
Such as, the first data note with disappearance eigenvalue in the wide table of the fisrt feature shown in table 3 of extraction
Record is as shown in table 4;Not having in the wide table of the fisrt feature shown in table 3 extracted lacks the second of eigenvalue
Data record is as shown in table 5.
ID | Age | Consumption grade | There is room | Monthly income | There is car | …… | The renewal time |
…… | …… | …… | …… | …… | …… | …… | …… |
124 | 35 | In | 1 | 25000 | …… | 2014/9/10 | |
126 | 28 | In | 0 | 15000 | …… | 2014/9/10 | |
128 | 19 | Low | 0 | 8000 | …… | 2014/9/10 | |
129 | 22 | Low | 0 | 5000 | …… | 2014/9/10 | |
131 | 35 | 1 | …… | 2014/8/12 | |||
132 | 35 | 0 | …… | 2014/8/13 | |||
133 | 45 | 1 | …… | 2014/8/14 | |||
…… | …… | …… | …… | …… | …… | …… | …… |
Table 4
ID | Age | Consumption grade | There is room | Monthly income | There is car | …… | The renewal time |
…… | …… | …… | …… | …… | …… | …… | …… |
123 | 30 | High | 1 | 40000 | 1 | …… | 2014/9/1 |
125 | 20 | In | 0 | 25000 | 0 | …… | 2014/9/10 |
127 | 40 | Low | 0 | 10000 | 1 | …… | 2014/9/10 |
…… | …… | …… | …… | …… | …… | …… | …… |
Table 5
S103, generates the extended model of disappearance eigenvalue according to the second data record.
Specifically, the second data record is the data record without disappearance eigenvalue, say, that be
Standby data record.Generate the extension of disappearance eigenvalue according to complete data record in conjunction with data mining technology
Model, so that repairing the disappearance eigenvalue in the first data record with the eigenvalue of complete data record.
S104, is extended generating second to the disappearance eigenvalue in the first data record according to extended model
Feature width table.
Multiple data sources are fused into a fisrt feature width table by the data processing method of the embodiment of the present invention,
And according to the data record repairing without disappearance eigenvalue in fisrt feature width table, there is disappearance eigenvalue
Data record in disappearance eigenvalue, thus generate without disappearance eigenvalue second feature width table, it is ensured that
The completeness of data;Repair according to the data record without disappearance eigenvalue in fisrt feature width table and have
Disappearance eigenvalue in the data record of disappearance eigenvalue, it is possible to make fully to intersect between multiple feature shadow
Ringing, the second feature width table therefore generated also is able to ensure the concordance of data;Second feature width table is used to enter
Positive effect is had more when the external recommendation of row, selection, risk assessment etc..
Fig. 2 is the flow chart of the processing method of data in accordance with another embodiment of the present invention.
As in figure 2 it is shown, the processing method of these data includes:
S201, obtains multiple data source, mates and merge multiple data source to generate fisrt feature width table,
Wherein fisrt feature width table includes that multiple data record, each data record include major key and multiple eigenvalue.
S202, extracts and has the first data record of disappearance eigenvalue in the wide table of fisrt feature and do not have disappearance
Second data record of eigenvalue.
S203, generates the extended model of disappearance eigenvalue according to the second data record.
S204, is extended the disappearance eigenvalue in the first data record according to extended model.
Above-mentioned S201-S204 respectively with the S101-S104 one_to_one corresponding of above-described embodiment, be referred to above-mentioned
Embodiment, does not repeats them here.
S205, merges the first data record after the second data record and extension, to generate second feature width table.
The method of the embodiment of the present invention, extracts the first data note in the wide table of fisrt feature with disappearance eigenvalue
Record and do not have the second data record of disappearance eigenvalue, and count with second after the first data record extension
Merge to generate second feature width table according to record, simplify step, improve efficiency.
Fig. 3 is the flow chart of the processing method of the data according to another embodiment of the present invention.
As it is shown on figure 3, the processing method of these data includes:
S301, obtains multiple data source, mates and merge multiple data source to generate fisrt feature width table,
Wherein fisrt feature width table includes that multiple data record, each data record include major key and multiple eigenvalue.
S302, extracts and has the first data record of disappearance eigenvalue in the wide table of fisrt feature and do not have disappearance
Second data record of eigenvalue.
Above-mentioned S301 and S302 respectively with above-mentioned S101 and S102, or S201 and S202 one_to_one corresponding,
It is referred to above-described embodiment, does not repeats them here.
S303, trains disappearance eigenvalue characteristic of correspondence eigenvalue in the second data record as target
, and train item to generate training dataset as variable remaining eigenvalue in the second data record.
S304, generates the extended model of disappearance eigenvalue according to training dataset.
S304 specifically includes in one embodiment of the invention:
S3041, determines the linked character of disappearance eigenvalue characteristic of correspondence.Specifically, can be according to statistics
Gain knowledge and determine linked character, do not repeat them here.
S3042, concentrates from training data and extracts the variable training item that linked character is corresponding.
Such as, with the presence of 4 features disappearance eigenvalues in table 4, i.e. consume grade, have room, monthly income and
There is car, according to the second data record in table 5, generate consumption grade, have room, monthly income and have four, car
Feature target training item and variable training item, specifically as shown in table 6-9, wherein gray scale place be classified as mesh
Mark training item.
Table 6
Table 7
Table 8
Table 9
S3043, determines the type of disappearance eigenvalue characteristic of correspondence.
In one embodiment of the invention, type includes many characteristic of divisions, two characteristic of divisions and continuous feature
In one or more.Such as, consumption grade can be characteristic of division more than, its eigenvalue include height,
In, low, have room and to have car be two characteristic of divisions, its eigenvalue includes 1 and 0, and monthly income is continuous feature,
Its eigenvalue is in the range of the consecutive numbers more than 0.
S3044, determines training algorithm according to type.
Such as, if type is many characteristic of divisions, can select Bayes's (Bayes) network algorithm or with
Machine forest algorithm;If type is two characteristic of divisions, can be with trade-off decision tree or Logistic algorithm;As
Fruit type is continuous feature, can select linear regression or neural network algorithm.
S3045, instructs the variable training item that target training item is corresponding with linked character according to training algorithm
Practice to generate extended model.
S305, is extended the disappearance eigenvalue in the first data record according to extended model.
S306, merges the first data record after the second data record and extension, to generate second feature width table.
Such as, the second feature width table ultimately generated is as shown in table 10.
ID | Age | Consumption grade | There is room | Monthly income | There is car | …… | The renewal time |
…… | …… | …… | …… | …… | …… | …… | …… |
123 | 30 | High | 1 | 40000 | 1 | …… | 2014/9/1 |
124 | 35 | In | 1 | 25000 | 0.95 | …… | 2014/9/10 |
125 | 20 | In | 0 | 25000 | 0 | …… | 2014/9/10 |
126 | 28 | In | 0 | 15000 | 0.45 | …… | 2014/9/10 |
127 | 40 | Low | 0 | 10000 | 1 | …… | 2014/9/10 |
128 | 19 | Low | 0 | 8000 | 0.13 | …… | 2014/9/10 |
129 | 22 | Low | 0 | 5000 | 0.02 | …… | 2014/9/10 |
131 | 35 | In | 0.68 | 23000 | 1 | …… | 2014/8/12 |
132 | 35 | In | 0.27 | 16000 | 0 | …… | 2014/8/13 |
133 | 45 | High | 0.93 | 38000 | 1 | …… | 2014/8/14 |
…… | …… | …… | …… | …… | …… | …… | …… |
Table 10
The processing method of the data of the embodiment of the present invention, determines the type of disappearance eigenvalue characteristic of correspondence, and
Training algorithm is determined, further according to the variable that training algorithm is corresponding with linked character to target training item according to type
Training item is trained generating extended model, it is possible to customize extended model according to different data types.
Should be understood that above-mentioned table 1-10 is to understand embodiments of the invention institute illustrated example for convenience, not
Can be as limiting the scope of the present invention.
In order to realize above-described embodiment, embodiments of the invention also propose the processing means of a kind of data.
Fig. 6 is the structured flowchart of the processing means of data according to an embodiment of the invention.
As shown in Figure 6, the processing means 10 of data includes: acquisition module 100, mates and merges mould
Block 200, extraction module 300, generation module 400 and expansion module 500.
Specifically, acquisition module 100 is used for obtaining multiple data source.In one embodiment of the invention,
Multiple data sources describe same human subject.It is to say, the source of multiple data sources can be different, but it is both needed to
Same things is described, as described people, company etc..Specifically, each data source can be a feature width
Table, including multiple data records, each data record also includes major key and multiple eigenvalue;It addition, it is each
Data source can also be the narrow table of multiple feature, for describing multiple eigenvalues of same things, the most also needs
Multiple narrow tables are merged to generate a feature width table, for the representation of multiple data sources, the present invention
Embodiment do not limit.The example of concrete data source refers to method part and describes, and does not repeats them here.
Coupling and merge module 200 and be used for mating and merge multiple data source to generate fisrt feature width table,
Wherein fisrt feature width table includes that multiple data record, each data record include major key and multiple feature
Value.More specifically, with same main body as major key, multiple data sources are merged into a fisrt feature width table.
It is to say, will be many according to unique major key (as the title of people, ID (identity number) card No. etc. combine the major key of generation)
Individual data source is matched under same data framework.
Extraction module 300 has the first data note of disappearance eigenvalue for extracting in the wide table of fisrt feature
Record and the second data record without disappearance eigenvalue.
Generation module 400 for generating the extended model of disappearance eigenvalue according to the second data record.More
Specifically, the second data record is the data record without disappearance eigenvalue, say, that be complete
Data record.Generate the expanded mode of disappearance eigenvalue according to complete data record in conjunction with data mining technology
Type, so that repairing the disappearance eigenvalue in the first data record with the eigenvalue of complete data record.
Expansion module 500 is for carrying out the disappearance eigenvalue in the first data record according to extended model
Extension is to generate second feature width table.
Multiple data sources are fused into a fisrt feature width table by the data processing equipment of the embodiment of the present invention,
And according to the data record repairing without disappearance eigenvalue in fisrt feature width table, there is disappearance eigenvalue
Data record in disappearance eigenvalue, thus generate without disappearance eigenvalue second feature width table, it is ensured that
The completeness of data;Repair according to the data record without disappearance eigenvalue in fisrt feature width table and have
Disappearance eigenvalue in the data record of disappearance eigenvalue, it is possible to make fully to intersect between multiple feature shadow
Ringing, the second feature width table therefore generated also is able to ensure the concordance of data;Second feature width table is used to enter
Positive effect is had more when the external recommendation of row, selection, risk assessment etc..
Fig. 7 is the structured flowchart of the processing means of data in accordance with another embodiment of the present invention.
As it is shown in fig. 7, the processing means 10 of data includes: acquisition module 100, mate and merge mould
Block 200, extraction module 300, generation module 400, expansion module 500 and merging module 600.
Specifically, the module 600 first data note after merging the second data record and extension is merged
Record, to generate second feature width table.
The device of the embodiment of the present invention, extracts the first data note in the wide table of fisrt feature with disappearance eigenvalue
Record and do not have the second data record of disappearance eigenvalue, and count with second after the first data record extension
Merge to generate second feature width table according to record, simplify step, improve efficiency.
Fig. 8 is the structured flowchart of generation module 400 according to an embodiment of the invention.
As shown in Figure 8, generation module 400 includes the first signal generating unit 410 and the second signal generating unit 420.
Specifically, the first signal generating unit 410 will be for lacking eigenvalue characteristic of correspondence in the second data
Eigenvalue in record trains item as target, and using remaining eigenvalue in the second data record as
Variable training item is to generate training dataset.
Second signal generating unit 420 for generating the extended model of disappearance eigenvalue according to training dataset.
In one embodiment of the invention, the second signal generating unit 420 includes that first determines subelement, extraction
Subelement, second determine subelement, the 3rd determine subelement and training subelement.
Wherein, first determine subelement for determine disappearance eigenvalue characteristic of correspondence linked character,
Wherein can determine linked character according to knowledge of statistics, not repeat them here.
Extract subelement and extract, for concentrating from training data, the variable training item that linked character is corresponding,
Second determines that subelement is for determining the type of disappearance eigenvalue characteristic of correspondence.In the present invention one
In individual embodiment, type includes one or more in many characteristic of divisions, two characteristic of divisions and continuous feature,
Such as, consumption grade can be characteristic of division more than, and its eigenvalue includes high, medium and low, has room and has
Car is two characteristic of divisions, and its eigenvalue includes 1 and 0, and monthly income is continuous feature, the scope of its eigenvalue
For the consecutive numbers more than 0.
3rd determines that subelement is for determining training algorithm according to type.Such as, if type is many classification
Feature, can select Bayes's (Bayes) network algorithm or random forests algorithm;If type is two points
Category feature, can be with trade-off decision tree or Logistic algorithm;If type is continuous feature, can select
Linear regression or neural network algorithm.
Training subelement is for the variable instruction corresponding with linked character to target training item according to training algorithm
Practice item to be trained generating extended model.
The processing means of the data of the embodiment of the present invention, determines the type of disappearance eigenvalue characteristic of correspondence, and
Training algorithm is determined, further according to the variable that training algorithm is corresponding with linked character to target training item according to type
Training item is trained generating extended model, it is possible to customize extended model according to different data types.
Should be appreciated that each several part of the present invention can realize by hardware, software, firmware or combinations thereof.
In the above-described embodiment, multiple steps or method in memory and can be held by suitably instruction with storage
Software or firmware that row system performs realize.Such as, if realized with hardware, with another embodiment party
As in formula, can realize by any one in following technology well known in the art or their combination: have
For data signal being realized the discrete logic of the logic gates of logic function, there is suitably combination
The special IC of logic gates, programmable gate array (PGA), field programmable gate array (FPGA)
Deng.
In the description of this specification, reference term " embodiment ", " some embodiments ", " example ",
It is concrete that the description of " concrete example " or " some examples " etc. means to combine this embodiment or example describes
Feature, structure, material or feature are contained at least one embodiment or the example of the present invention.In this theory
In bright book, the schematic representation of above-mentioned term is not necessarily referring to identical embodiment or example.And,
The specific features, structure, material or the feature that describe can be in any one or more embodiments or examples
In combine in an appropriate manner.
Although an embodiment of the present invention has been shown and described, those of ordinary skill in the art can manage
Solve: these embodiments can be carried out in the case of without departing from the principle of the present invention and objective multiple change,
Amendment, replacement and modification, the scope of the present invention is limited by claim and equivalent thereof.
Claims (12)
1. the processing method of data, it is characterised in that including:
Obtain multiple data source, mate and merge the plurality of data source to generate fisrt feature width table, wherein
Described fisrt feature width table includes that multiple data record, each described data record include major key and multiple feature
Value;
Extract and the wide table of described fisrt feature has the first data record of disappearance eigenvalue and not there is disappearance
Second data record of eigenvalue;
The extended model of described disappearance eigenvalue is generated according to described second data record;And
According to described extended model the described disappearance eigenvalue in described first data record is extended with
Generate second feature width table.
The processing method of data the most according to claim 1, it is characterised in that also include:
Merge the described first data record after described second data record and extension, special to generate described second
Levy wide table.
The processing method of data the most according to claim 1 and 2, it is characterised in that described basis
Described second data record generates the extended model of described disappearance eigenvalue, including:
Using described disappearance eigenvalue characteristic of correspondence eigenvalue in described second data record as target
Training item, and remaining eigenvalue in described second data record is trained number as variable training item to generate
According to collection;And
The described extended model of described disappearance eigenvalue is generated according to described training dataset.
The processing method of data the most according to claim 3, it is characterised in that described in described basis
Training dataset generates the described extended model of described disappearance eigenvalue, including:
Determine the linked character of described disappearance eigenvalue characteristic of correspondence;
Concentrate from described training data and extract the variable training item that described linked character is corresponding;
Determine the type of described disappearance eigenvalue characteristic of correspondence;
Training algorithm is determined according to described type;And
According to the variable training item that described training algorithm is corresponding with described linked character to described target training item
It is trained generating described extended model.
The processing method of data the most according to claim 1, it is characterised in that the plurality of data
The same human subject of Source Description.
The processing method of data the most according to claim 4, it is characterised in that described type includes
One or more in many characteristic of divisions, two characteristic of divisions and continuous feature.
7. the processing means of data, it is characterised in that including:
Acquisition module, is used for obtaining multiple data source;
Coupling and merge module, is used for mating and merge the plurality of data source to generate fisrt feature width table,
Wherein said fisrt feature width table includes that multiple data record, each described data record include major key and multiple
Eigenvalue;
Extraction module, for extracting the first data record in the wide table of described fisrt feature with disappearance eigenvalue
Not there is the second data record of disappearance eigenvalue;
Generation module, for generating the extended model of described disappearance eigenvalue according to described second data record;
And
Expansion module, is used for according to described extended model the described disappearance feature in described first data record
Value is extended generating second feature width table.
The processing means of data the most according to claim 7, it is characterised in that also include:
Merge module, the described first data record after merging described second data record and extension, with
Generate described second feature width table.
9. according to the processing means of the data described in claim 7 or 8, it is characterised in that described generation
Module, including:
First signal generating unit, is used for described disappearance eigenvalue characteristic of correspondence in described second data record
Eigenvalue train item as target, and remaining eigenvalue in described second data record is instructed as variable
Practice item to generate training dataset;And
Second signal generating unit, for generating the described extension of described disappearance eigenvalue according to described training dataset
Model.
The processing means of data the most according to claim 9, it is characterised in that described second generates
Unit, including:
First determines subelement, for determining the linked character of described disappearance eigenvalue characteristic of correspondence;
Extract subelement, extract, for concentrating from described training data, the variable training that described linked character is corresponding
?;
Second determines subelement, for determining the type of described disappearance eigenvalue characteristic of correspondence;
3rd determines subelement, for determining training algorithm according to described type;And
Training subelement, for training item and described linked character pair according to described training algorithm to described target
The variable training item answered is trained generating described extended model.
The processing means of 11. data according to claim 7, it is characterised in that the plurality of data
The same human subject of Source Description.
The processing means of 12. data according to claim 10, it is characterised in that described type bag
Include one or more in many characteristic of divisions, two characteristic of divisions and continuous feature.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510126382.7A CN106033473A (en) | 2015-03-20 | 2015-03-20 | Data processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510126382.7A CN106033473A (en) | 2015-03-20 | 2015-03-20 | Data processing method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106033473A true CN106033473A (en) | 2016-10-19 |
Family
ID=57149566
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510126382.7A Pending CN106033473A (en) | 2015-03-20 | 2015-03-20 | Data processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106033473A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108241934A (en) * | 2016-12-23 | 2018-07-03 | 北京京东尚科信息技术有限公司 | Data query method and apparatus |
CN108921229A (en) * | 2018-07-17 | 2018-11-30 | 成都西加云杉科技有限公司 | Data reconstruction method and device |
CN108962386A (en) * | 2017-05-27 | 2018-12-07 | ***通信有限公司研究院 | A kind of data processing method, apparatus and system |
CN110555070A (en) * | 2018-06-01 | 2019-12-10 | 百度在线网络技术(北京)有限公司 | Method and apparatus for outputting information |
CN110785749A (en) * | 2018-06-25 | 2020-02-11 | 北京嘀嘀无限科技发展有限公司 | System and method for generating wide tables |
CN111178536A (en) * | 2019-11-26 | 2020-05-19 | 腾讯云计算(北京)有限责任公司 | Data information processing method and device, electronic equipment and storage medium |
CN113535817A (en) * | 2021-07-13 | 2021-10-22 | 浙江网商银行股份有限公司 | Method and device for generating characteristic broad table and training business processing model |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1372667A (en) * | 1999-12-30 | 2002-10-02 | Ge资本商业财务公司 | Valuation prediction models in situations with missing inputs |
US20070168545A1 (en) * | 2006-01-18 | 2007-07-19 | Venkat Venkatsubra | Methods and devices for processing incomplete data packets |
CN104133866A (en) * | 2014-07-18 | 2014-11-05 | 国家电网公司 | Intelligent-power-grid-oriented missing data filling method |
-
2015
- 2015-03-20 CN CN201510126382.7A patent/CN106033473A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1372667A (en) * | 1999-12-30 | 2002-10-02 | Ge资本商业财务公司 | Valuation prediction models in situations with missing inputs |
US20070168545A1 (en) * | 2006-01-18 | 2007-07-19 | Venkat Venkatsubra | Methods and devices for processing incomplete data packets |
CN104133866A (en) * | 2014-07-18 | 2014-11-05 | 国家电网公司 | Intelligent-power-grid-oriented missing data filling method |
Non-Patent Citations (2)
Title |
---|
张红霞: "缺失值填充:基于信息增益的方法", 《计算机工程与设计》 * |
鲁均云: "重复和不完整数据的清理方法研究及应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108241934A (en) * | 2016-12-23 | 2018-07-03 | 北京京东尚科信息技术有限公司 | Data query method and apparatus |
CN108241934B (en) * | 2016-12-23 | 2021-02-26 | 北京京东尚科信息技术有限公司 | Data query method and device |
CN108962386A (en) * | 2017-05-27 | 2018-12-07 | ***通信有限公司研究院 | A kind of data processing method, apparatus and system |
CN110555070A (en) * | 2018-06-01 | 2019-12-10 | 百度在线网络技术(北京)有限公司 | Method and apparatus for outputting information |
CN110555070B (en) * | 2018-06-01 | 2021-10-22 | 百度在线网络技术(北京)有限公司 | Method and apparatus for outputting information |
CN110785749A (en) * | 2018-06-25 | 2020-02-11 | 北京嘀嘀无限科技发展有限公司 | System and method for generating wide tables |
CN110785749B (en) * | 2018-06-25 | 2020-08-21 | 北京嘀嘀无限科技发展有限公司 | System and method for generating wide tables |
US11061882B2 (en) | 2018-06-25 | 2021-07-13 | Beijing Didi Infinity Technology And Development Co., Ltd. | Systems and methods for generating a wide table |
CN108921229A (en) * | 2018-07-17 | 2018-11-30 | 成都西加云杉科技有限公司 | Data reconstruction method and device |
CN111178536A (en) * | 2019-11-26 | 2020-05-19 | 腾讯云计算(北京)有限责任公司 | Data information processing method and device, electronic equipment and storage medium |
CN113535817A (en) * | 2021-07-13 | 2021-10-22 | 浙江网商银行股份有限公司 | Method and device for generating characteristic broad table and training business processing model |
CN113535817B (en) * | 2021-07-13 | 2024-05-14 | 浙江网商银行股份有限公司 | Feature broad table generation and service processing model training method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106033473A (en) | Data processing method and device | |
Eustace et al. | Community detection using local neighborhood in complex networks | |
Matsebula et al. | A big data architecture for learning analytics in higher education | |
CN103823896A (en) | Subject characteristic value algorithm and subject characteristic value algorithm-based project evaluation expert recommendation algorithm | |
Moorman et al. | Filtering methods for subgraph matching on multiplex networks | |
CN112925908A (en) | Attention-based text classification method and system for graph Attention network | |
CN110992059B (en) | Surrounding string behavior recognition analysis method based on big data | |
CN111191099B (en) | User activity type identification method based on social media | |
Vedanayaki | A study of data mining and social network analysis | |
CN107220902A (en) | The cascade scale forecast method of online community network | |
Zhu et al. | A unified model for community detection of multiplex networks | |
CN104268134A (en) | Subjective and objective classifier building method and system | |
Lo et al. | Mining direct antagonistic communities in explicit trust networks | |
Alzahrani et al. | Community detection in bipartite networks using random walks | |
Froemelt et al. | A two-stage clustering approach to investigate lifestyle carbon footprints in two Australian cities | |
Shao et al. | Community detection via local dynamic interaction | |
CN103294828B (en) | The verification method of data mining model dimension and demo plant | |
CN105045924A (en) | Question classification method and system | |
Guan et al. | Analysis of inter-country input–output table based on bibliographic coupling network: How industrial sectors on the GVC compete for production resources | |
CN105718564A (en) | Promotion behavior detection method and apparatus | |
Lit et al. | Understanding twitter telehealth communication during the covid-19 pandemic using hetero-functional graph theory | |
Choupani et al. | Population synthesis in activity-based models: tabular rounding in iterative proportional fitting | |
Raju et al. | Detecting communities in social networks using unnormalized spectral clustering incorporated with Bisecting K-means | |
CN110288114A (en) | Violation electricity consumption behavior prediction method based on power marketing data | |
Lin | Information visualization from the perspective of big data analysis and fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 1229909 Country of ref document: HK |
|
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161019 |
|
RJ01 | Rejection of invention patent application after publication | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: WD Ref document number: 1229909 Country of ref document: HK |