CN109146083A - Feature coding method and apparatus - Google Patents

Feature coding method and apparatus Download PDF

Info

Publication number
CN109146083A
CN109146083A CN201810887448.8A CN201810887448A CN109146083A CN 109146083 A CN109146083 A CN 109146083A CN 201810887448 A CN201810887448 A CN 201810887448A CN 109146083 A CN109146083 A CN 109146083A
Authority
CN
China
Prior art keywords
value
coding mode
vector
value set
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810887448.8A
Other languages
Chinese (zh)
Other versions
CN109146083B (en
Inventor
宋乐
李辉
葛志邦
黄鑫
王琳
朱冠胤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201810887448.8A priority Critical patent/CN109146083B/en
Publication of CN109146083A publication Critical patent/CN109146083A/en
Application granted granted Critical
Publication of CN109146083B publication Critical patent/CN109146083B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

This specification embodiment provides a kind of feature coding method and apparatus, and method includes: the variate-value of acquisition characteristic variable relevant to business objective, and variate-value is non-numeric type;According to the corresponding relationship of multiple the value set and various features coding mode of predetermined characteristic variable, the corresponding target code mode of variate-value is selected from various features coding mode, multiple value set divide the discrimination of business objective according to various possible values assess in advance, characteristic variable, and multiple feature coding modes are used to the value corresponded in value set being encoded to multiple vector spaces;Variate-value is encoded to the object vector in object vector space in the way of target code, object vector space is vector space corresponding with target code mode in multiple vector spaces;The feature vector of characteristic variable is determined based on object vector.Both model can be allowed not lose useful information, the length of feature coding can also be reduced, while there is certain generalization.

Description

Feature coding method and apparatus
Technical field
This specification one or more embodiment is related to computer field more particularly to feature coding method and apparatus.
Background technique
In classical data modeling scene, being frequently encountered many data is presented with the variable of non-numeric type.Example If user has purchased a commodity, the commodity belong to some level-one classification, second level classification, commodity have oneself title and type, Commodity belong to some shop (corresponding is the pet name nick in shop or mark id), and user is in some network protocol On the address (internet protocol, IP), some Wireless Fidelity (wireless-fidelity, WIFI) or some physics sets It is logged on standby, finally found that the transaction is wash sale, arbitrage transaction or robber's card transaction etc..
A large amount of behavioural information is contained in the scene of foregoing description, and these information are often with the variable of non-numeric type It appears in data structure.Classical machine learning algorithm generally requires the variable of these non-numeric types to carry out feature coding Afterwards, as the input of model, and how to be encoded that often to will affect model final to the variable of these non-numeric types Performance.
In the prior art, only hot one-hot coding mode is generallyd use when the variable to non-numeric type encodes. One-hot coding mode: enumerating all types and constitute a vector, and the length of vector is the quantity of type, hits some class The corresponding vector potential of type is 1, remaining is 0.
When the variable-value of non-numeric type is especially more, such as IP address or equipment id, directly to all values Carrying out one-hot feature coding often will appear very sparse big matrix, cause the parameter of model too many, be subsequent model Deployment brings very big cost.
Accordingly, it would be desirable to there is improved plan, both model can be allowed not lose useful information, feature volume can also be reduced The length of code, while there is certain generalization.
Summary of the invention
This specification one or more embodiment describes a kind of feature coding method and apparatus, can both model be allowed not damage Useful information is lost, the length of feature coding can also be reduced, while there is certain generalization.
In a first aspect, providing a kind of feature coding method, comprising:
The variate-value of characteristic variable relevant to business objective is obtained, the variate-value is non-numeric type;
According to the corresponding relationship of multiple the value set and various features coding mode of the predetermined characteristic variable, The corresponding target code mode of the variate-value is selected from the various features coding mode, wherein the multiple value set The discrimination of the business objective is divided according to the various possible values of the characteristic variable assess in advance, described, it is described Multiple feature coding modes are used to the value corresponded in value set being encoded to multiple vector spaces;
The variate-value is encoded to the object vector in object vector space, the mesh in the way of the target code Marking vector space is vector space corresponding with the target code mode in the multiple vector space;
The feature vector of the characteristic variable is determined based on the object vector.
In a kind of possible embodiment, wherein the various possible values of the characteristic variable are to the business objective Discrimination is assessed to obtain according to the various possible values in the prior probability of the business objective reached.
In a kind of possible embodiment, wherein the multiple value set includes that the first value set and second takes Value set, the value for including in the first value set are higher than described second for the average discrimination of the business objective and take Value set;The various features coding mode includes fisrt feature coding mode and second feature coding mode, and described first is special It levies coding mode and corresponds to the first value set, the second feature coding mode corresponds to the second value set, The compression rates of the fisrt feature coding mode are less than the second feature coding mode.
Further, wherein the fisrt feature coding mode is full dose coding mode, is encoded to for that will correspond to value Primary vector space, the dimension in the primary vector space are equal to the value number in the first value set;
The second feature coding mode is compression coding mode, is encoded to secondary vector space for that will correspond to value, The dimension in the secondary vector space is less than the value number in the second value set.
Further, the fisrt feature coding mode is only hot one-hot coding mode, the second feature coding staff Formula is Hash coding mode.
Further, wherein the second feature coding mode includes:
Using numeric type hash function, the value in the second value set is mapped as the numerical value in numerical space;
By the numerical value to predetermined number p modulus, modulus result is mapped as p dimensional vector, wherein the predetermined number p is small Value number in the second value set.
Further, wherein the multiple value set is divided by following steps:
By all possible values of the characteristic variable according to the value for the business objective discrimination from greatly to It is small to be ranked up;
Successively select the value of the first quantity as the first value set according to the sequence;
Using other values except first quantity as the second value set.
Further, wherein the multiple value set is divided by following steps:
Obtain discrimination of all possible values for the business objective of the characteristic variable;
The value that discrimination is greater than or equal to first threshold is added to the first value set;
The value that discrimination is less than the first threshold is added to the second value set.
Further, wherein the second value set includes that the first subclass and second subset are closed;First subset The value for including in conjunction closes the average discrimination of the business objective higher than the second subset;The second feature coding Mode includes the first compression coding mode and the second compression coding mode, and first compression coding mode corresponds to described first Subclass, second compression coding mode are closed corresponding to the second subset, the coding pressure of first compression coding mode Shrinkage is less than second compression coding mode.
Further, wherein the second feature coding mode is Hash coding mode, and:
First compression coding mode includes:
Using numeric type hash function, the value in first subclass is mapped as the first numerical value;
By first numerical value to the first number modulus, modulus result is mapped as to the vector of the first number dimension;
Second compression coding mode includes:
Using the numeric type hash function, the value in second subset conjunction is mapped as second value;
By the second value to the second number modulus, modulus result is mapped as to the vector of the second number dimension;
Wherein first number is greater than second number.
In a kind of possible embodiment, wherein determining the feature vector of the characteristic variable based on the object vector Include:
With default value fill in the multiple vector space in addition to the object vector space it is at least one other to Quantity space obtains at least one default vector;
The object vector and at least one described default vector are spliced and combined, described eigenvector is obtained.
Second aspect provides a kind of feature coding device, comprising:
Acquiring unit, for obtaining the variate-value of characteristic variable relevant to business objective, the variate-value is nonumeric Type;
Selecting unit, for the multiple value set and various features coding staff according to the predetermined characteristic variable The corresponding relationship of formula, the corresponding target of the variate-value for selecting the acquiring unit to obtain from the various features coding mode are compiled Code mode, wherein the multiple value set is according to the various possible values of the characteristic variable assess in advance, described for described The discrimination of business objective and divide, the multiple feature coding mode for the value corresponded in value set is encoded to it is more A vector space;
Coding unit, the change for being obtained the acquiring unit in the way of the target code that the selecting unit selects Magnitude is encoded to the object vector in object vector space, the object vector space be in the multiple vector space with it is described The corresponding vector space of target code mode;
Determination unit, the object vector for being obtained based on the coding unit determine the feature of the characteristic variable to Amount.
The third aspect provides a kind of computer readable storage medium, is stored thereon with computer program, when the calculating When machine program executes in a computer, enable computer execute first aspect method.
Fourth aspect provides a kind of calculating equipment, including memory and processor, and being stored in the memory can hold Line code, when the processor executes the executable code, the method for realizing first aspect.
The method and apparatus provided by this specification embodiment, in the change for obtaining characteristic variable relevant to business objective After magnitude, the variate-value be non-numeric type, first according to multiple value set of the predetermined characteristic variable with The corresponding relationship of various features coding mode selects the corresponding target of the variate-value to compile from the various features coding mode Code mode, is then encoded to the object vector in object vector space, institute for the variate-value in the way of the target code Stating object vector space is vector space corresponding with the target code mode in the multiple vector space, then is based on institute State the feature vector that object vector determines the characteristic variable.Wherein the multiple value set is according to assessing in advance, described The various possible values of characteristic variable for the business objective discrimination and divide, the multiple feature coding mode is used for Value in corresponding value set is encoded to multiple vector spaces, to both model can be allowed not lose useful information, also The length of feature coding can be reduced, while there is certain generalization.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill of field, without creative efforts, it can also be obtained according to these attached drawings others Attached drawing.
Fig. 1 is the implement scene schematic diagram of one embodiment that this specification discloses;
Fig. 2 shows the feature coding method flow diagrams according to one embodiment;
Fig. 3 shows the schematic block diagram of the feature coding device according to one embodiment.
Specific embodiment
With reference to the accompanying drawing, the scheme provided this specification is described.
Fig. 1 is the implement scene schematic diagram of one embodiment that this specification discloses.As shown in Figure 1, to machine learning When model is trained, using training data as the input of machine learning model, wherein include nonumeric class in training data The variable of variable of type, such as product name, shop title, the I P address of buyer etc., these non-numeric types will pass through feature It could relate generally to carry out the variable of non-numeric type as the input of machine learning model, this specification embodiment after coding The method of feature coding considers discrimination of the value for business objective of variable in the method for this feature coding.
It is understood that machine learning model shown in FIG. 1 is merely illustrative, it is not used to in this specification embodiment The restriction of machine learning algorithm.For example, machine learning algorithm can use supervised learning, unsupervised learning or intensified learning, This is not repeated them here.
Fig. 2 shows the feature coding method flow diagrams according to one embodiment, as shown in Fig. 2, feature is compiled in the embodiment Code method obtains the variate-value of characteristic variable relevant to business objective the following steps are included: step 21, and the variate-value is non- Value type;Step 22, according to multiple value set of the predetermined characteristic variable and various features coding mode Corresponding relationship selects the corresponding target code mode of the variate-value, wherein described more from the various features coding mode A value set according to the various possible values of the characteristic variable assess in advance, described for the discrimination of the business objective and It divides, the multiple feature coding mode is used to the value corresponded in value set being encoded to multiple vector spaces;Step 23, The variate-value is encoded to the object vector in object vector space in the way of the target code, the object vector is empty Between be vector space corresponding with the target code mode in the multiple vector space;Step 24, it is based on the target Vector determines the feature vector of the characteristic variable.The specific executive mode of above each step is described below.
First in step 21, the variate-value of characteristic variable relevant to business objective is obtained, the variate-value is nonumeric Type.It is understood that characteristic variable relevant to business objective may not only including non-numeric type characteristic variable also Characteristic variable including value type, since the feature coding method in this specification embodiment is applied to non-numeric type Characteristic variable is encoded, therefore only illustrates the variate-value for obtaining the characteristic variable of non-numeric type in step 21.
In addition, identical variable often shows different effects for different business objectives, for example educational background is exactly The variable of one very typical non-numeric type, educational background is higher, and possible credit risk is lower, but educational background it is higher may with disappear Take the relationship of preference just without so big.When business objective is whether determining credit risk is too low, educational background is and the business mesh Mark relevant characteristic variable;When business objective is the consumption preferences of determining user, educational background is not just related to the business objective Characteristic variable.
Then in step 22, according to multiple value set of the predetermined characteristic variable and various features coding staff The corresponding relationship of formula selects the corresponding target code mode of the variate-value, wherein institute from the various features coding mode State differentiation of multiple value set according to the various possible values of the characteristic variable assess in advance, described for the business objective It spends and divides, the multiple feature coding mode is used to the value corresponded in value set being encoded to multiple vector spaces.
In one example, wherein discrimination of the various possible values of the characteristic variable to the business objective, root It assesses to obtain in the prior probability of the business objective reached according to the various possible values.
Specifically, after determining the business objective specifically to be solved, can effectively to each characteristic variable not It is assessed with separating capacity of the value for business objective, obtains the various possible values of the characteristic variable to the business The discrimination of target.It is understood that value is stronger for the separating capacity of business objective, then the value is to the business mesh Target discrimination is higher.For example in arbitrage scene, criminal often goes to buy the commodity in the non-flagship store shop of those wholesales, Seller can provide arbitrage service or even some commodity for buyer also sometimes dedicated for providing arbitrage, that is to say, that each The distribution of commodity, each seller on the prior probability of arbitrage is different.If some characteristic variable has N number of different take Value, so that it may obtain this N number of different value for the different separating capacities of business objective.
Prior probability refers to the probability obtained according to previous experiences and analysis.Prior probability can be divided into objective prior probability With subjective prior probability, wherein the prior probability being calculated using past historical summary, referred to as objective prior probability;When going through When history data has no way of obtaining or data is incomplete, prior probability obtained from judging with the subjective experience of people is referred to as subjective Prior probability.
It is understood that the prior probability mentioned in this specification embodiment can be objective prior probability, it can also be with For subjective prior probability.
In one example, the multiple value set include the first value set and the second value set, described first The value for including in value set is higher than the second value set for the average discrimination of the business objective;It is described a variety of Feature coding mode includes fisrt feature coding mode and second feature coding mode, and the fisrt feature coding mode corresponds to The first value set, the second feature coding mode correspond to the second value set, the fisrt feature coding The compression rates of mode are less than the second feature coding mode.That is, if value is for the business objective Discrimination is higher, then takes the feature coding mode of lower compression rates.
It is understood that the number for the value set that multiple value set include described in this specification embodiment is not done Limitation, can be two, three or more.
Further, the fisrt feature coding mode is full dose coding mode, is encoded to first for that will correspond to value Vector space, the dimension in the primary vector space are equal to the value number in the first value set;The second feature Coding mode is compression coding mode, is encoded to secondary vector space, the dimension in the secondary vector space for that will correspond to value Degree is less than the value number in the second value set.
For example, the fisrt feature coding mode is only hot one-hot coding mode, the second feature coding mode is Hash coding mode.
In one example, wherein the second feature coding mode includes:
Using numeric type hash function, the value in the second value set is mapped as the numerical value in numerical space;
By the numerical value to predetermined number p modulus, modulus result is mapped as p dimensional vector, wherein the predetermined number p is small Value number in the second value set.
In one example, wherein the multiple value set is divided by following steps:
By all possible values of the characteristic variable according to the value for the business objective discrimination from greatly to It is small to be ranked up;
Successively select the value of the first quantity as the first value set according to the sequence;
Using other values except first quantity as the second value set.
For example, the characteristic variable represents commodity, 100,000 commodity are shared, according to each commodity for business objective Discrimination carries out after sorting from high in the end, can preferentially select for example highest commodity progress one-hot volume of preceding 100 discriminations Code.
In another example, wherein the multiple value set is divided by following steps:
Obtain discrimination of all possible values for the business objective of the characteristic variable;
The value that discrimination is greater than or equal to first threshold is added to the first value set;
The value that discrimination is less than the first threshold is added to the second value set.
For example, the characteristic variable represents commodity, 100,000 commodity are shared, according to each commodity for business objective Separating capacity determines that the commodity for the discrimination (for example, numerical value between 0 to 10) of business objective, can be selected preferentially For example commodity of the discrimination greater than 5 carry out one-hot coding.
Further, wherein the second value set includes that the first subclass and second subset are closed;First subset The value for including in conjunction closes the average discrimination of the business objective higher than the second subset;The second feature coding Mode includes the first compression coding mode and the second compression coding mode, and first compression coding mode corresponds to described first Subclass, second compression coding mode are closed corresponding to the second subset, the coding pressure of first compression coding mode Shrinkage is less than second compression coding mode.That is, for can also be after using the value set of compression coding mode The continuous subclass being divided into using different compression coding modes, thus the vector dimension further after compressive features coding.
Specifically, wherein the second feature coding mode is Hash coding mode, and:
First compression coding mode includes:
Using numeric type hash function, the value in first subclass is mapped as the first numerical value;
By first numerical value to the first number modulus, modulus result is mapped as to the vector of the first number dimension;
Second compression coding mode includes:
Using the numeric type hash function, the value in second subset conjunction is mapped as second value;
By the second value to the second number modulus, modulus result is mapped as to the vector of the second number dimension;
Wherein first number is greater than second number.
For example, first passing through the preceding K value that one-hot coding mode completes N number of different values an of characteristic variable The feature coding of (sorting from large to small according to discrimination) there remains N-K value and not encoded.For remaining N-K A value is divided into two classes again, and one kind is that have behavior separating capacity, but be compared to preceding K value separating capacity without so strong (assuming that having M, i.e. K+1 to K+M), the second class is the value (i.e. K+M+1 to n-th) of absolutely not separating capacity.
For the above two classes value, progress Hash (Hashing) coding compresses feature, and specific method is:
The corresponding feature vector of value is determined by formula idx=Hash (val) %p, wherein val represents characteristic variable Value, which is non-numeric type, and Hash () represents hash function, and % represents modulus, and p is a constant, and idx is represented Feature vector after coding.
The value of original non-numeric type is mapped to a very big number by a numeric type hash function Value space can thus be mapped to this non-numeric type then to p modulus in the section of [0, p-1], be mapped to That corresponds to 1, other positions indicate 0.This completes the numeric codings of the value to non-numeric type.
Since the ability to express of above-mentioned K+1 to K+M and K+M+1 to N these two types value interval are not fully identical, such as Fruit can not often distinguish these two types of features using identical hash function well, therefore can be according to these two types of actual numbers Mesh chooses the fine degree and digit that different p carrys out controlling feature coding.Method using Hash coding mode is actually pair Primitive character has carried out first compression, reduces the dimension of feature space.
Then in step 23, the variate-value is encoded to the mesh in object vector space in the way of the target code Vector is marked, the object vector space is that vector corresponding with the target code mode is empty in the multiple vector space Between.It is understood that different coding mode (such as one-hot coding mode from Hash coding mode) it is corresponding it is different to Quantity space, this will not be repeated here.
Finally in step 24, the feature vector of the characteristic variable is determined based on the object vector.It is understood that Common, the corresponding feature vector of different values of characteristic variable answers dimension having the same, therefore in this specification embodiment The feature vector of the characteristic variable can also be further determined that the object vector is based on.
In one example, wherein the feature vector for determining the characteristic variable based on the object vector includes:
With default value fill in the multiple vector space in addition to the object vector space it is at least one other to Quantity space obtains at least one default vector;
The object vector and at least one described default vector are spliced and combined, described eigenvector is obtained.
For example, a characteristic variable has 100 values, this 100 values are arranged from big to small according to business objective Sequence is respectively a, b, c ... d, e, f, g, predefines and takes one-hot coding mode for value set (a, b, c), remaining Value set take dimension be 4 Hash coding mode, then to value a carry out one-hot coding, obtain object vector (1,0, 0), it is assumed that aforementioned default value be 0, then obtain default vector (0,0,0,0), obtained after splicing and combining feature vector (1,0,0,0, 0,0,0) dimension (being 7 in the example) for the feature vector that, this mode encoded by different level obtains is much smaller than directly to entire The dimension for the feature vector that value interval uses one-hot coding mode to obtain (for 100 in the example).
It should be noted that the above-mentioned citing for numerical value is not used to implement this specification only for the purposes of understanding The restriction of example.
It is understood that entire variable sky can be completed by carrying out similar coding to all non-numeric type variables Between numeralization, can serve as after all nonumericization variable codings the input of machine learning method in specific business into Row applies.
The method provided by this specification embodiment, in the variate-value for obtaining characteristic variable relevant to business objective Afterwards, the variate-value be non-numeric type, first according to multiple value set of the predetermined characteristic variable with it is a variety of The corresponding relationship of feature coding mode selects the corresponding target code side of the variate-value from the various features coding mode Then the variate-value is encoded to the object vector in object vector space, the mesh in the way of the target code by formula Marking vector space is vector space corresponding with the target code mode in the multiple vector space, then is based on the mesh Mark vector determines the feature vector of the characteristic variable.Wherein the multiple value set is according to the feature assess in advance, described The various possible values of variable for the business objective discrimination and divide, the multiple feature coding mode is for will be right The value in value set is answered to be encoded to multiple vector spaces, it, can be with to both model can be allowed not lose useful information The length of feature coding is reduced, while there is certain generalization.
According to the embodiment of another aspect, a kind of feature coding device is also provided.Fig. 3 shows the spy according to one embodiment Levy the schematic block diagram of code device.As shown in figure 3, the device 300 includes:
Acquiring unit 31, for obtaining the variate-value of characteristic variable relevant to business objective, the variate-value is non-number Value Types;
Selecting unit 32, for being encoded according to the multiple value set and various features of the predetermined characteristic variable The corresponding relationship of mode, the corresponding mesh of variate-value for selecting the acquiring unit 31 to obtain from the various features coding mode Mark coding mode, wherein the multiple value set according to the characteristic variable assess in advance, described it is various possibility values for The discrimination of the business objective and divide, the multiple feature coding mode be used for will correspond in value set value coding To multiple vector spaces;
Coding unit 33, for being obtained the acquiring unit in the way of the target code that the selecting unit 32 selects Variate-value be encoded to the object vector in object vector space, the object vector space be in the multiple vector space with The corresponding vector space of the target code mode;
Determination unit 34, the object vector for being obtained based on the coding unit 33 determine the feature of the characteristic variable Vector.
In one example, wherein discrimination of the various possible values of the characteristic variable to the business objective, root It assesses to obtain in the prior probability of the business objective reached according to the various possible values.
In one example, wherein the multiple value set includes the first value set and the second value set, described The value for including in first value set is higher than the second value set for the average discrimination of the business objective;It is described Various features coding mode includes fisrt feature coding mode and second feature coding mode, the fisrt feature coding mode pair First value set described in Ying Yu, the second feature coding mode correspond to the second value set, the fisrt feature The compression rates of coding mode are less than the second feature coding mode.
Further, wherein the fisrt feature coding mode is full dose coding mode, is encoded to for that will correspond to value Primary vector space, the dimension in the primary vector space are equal to the value number in the first value set;
The second feature coding mode is compression coding mode, is encoded to secondary vector space for that will correspond to value, The dimension in the secondary vector space is less than the value number in the second value set.
For example, the fisrt feature coding mode is only hot one-hot coding mode, the second feature coding mode is Hash coding mode.
Optionally, wherein the second feature coding mode includes:
Using numeric type hash function, the value in the second value set is mapped as the numerical value in numerical space;
By the numerical value to predetermined number p modulus, modulus result is mapped as p dimensional vector, wherein the predetermined number p is small Value number in the second value set.
In one example, wherein the multiple value set is divided by following steps:
By all possible values of the characteristic variable according to the value for the business objective discrimination from greatly to It is small to be ranked up;
Successively select the value of the first quantity as the first value set according to the sequence;
Using other values except first quantity as the second value set.
In another example, wherein the multiple value set is divided by following steps:
Obtain discrimination of all possible values for the business objective of the characteristic variable;
The value that discrimination is greater than or equal to first threshold is added to the first value set;
The value that discrimination is less than the first threshold is added to the second value set.
Further, wherein the second value set includes that the first subclass and second subset are closed;First subset The value for including in conjunction closes the average discrimination of the business objective higher than the second subset;The second feature coding Mode includes the first compression coding mode and the second compression coding mode, and first compression coding mode corresponds to described first Subclass, second compression coding mode are closed corresponding to the second subset, the coding pressure of first compression coding mode Shrinkage is less than second compression coding mode.
For example, wherein the second feature coding mode is Hash coding mode, and:
First compression coding mode includes:
Using numeric type hash function, the value in first subclass is mapped as the first numerical value;
By first numerical value to the first number modulus, modulus result is mapped as to the vector of the first number dimension;
Second compression coding mode includes:
Using the numeric type hash function, the value in second subset conjunction is mapped as second value;
By the second value to the second number modulus, modulus result is mapped as to the vector of the second number dimension;
Wherein first number is greater than second number.
In one example, wherein the determination unit 34, is specifically used for:
With default value fill in the multiple vector space in addition to the object vector space it is at least one other to Quantity space obtains at least one default vector;
The object vector and at least one described default vector are spliced and combined, described eigenvector is obtained.
The device provided by this specification embodiment obtains characteristic variable relevant to business objective in acquiring unit 31 Variate-value after, the variate-value be non-numeric type, first by selecting unit 32 according to the predetermined characteristic variable Multiple value set and various features coding mode corresponding relationship, from the various features coding mode selection described in obtain The corresponding target code mode of variate-value for taking unit 31 to obtain, is then selected by coding unit 33 using the selecting unit 32 Target code mode the variate-value is encoded to the object vector in object vector space, the object vector space is institute Vector space corresponding with the target code mode in multiple vector spaces is stated, then the coding is based on by determination unit 34 The object vector that unit 33 obtains determines the feature vector of the characteristic variable.Wherein the multiple value set according to commenting in advance The various possible values of the characteristic variable estimate, described for the business objective discrimination and divide, the multiple feature is compiled Code mode is used to the value corresponded in value set being encoded to multiple vector spaces, so that it is useful both to have allowed model not lose Information, the length of feature coding can also be reduced, while there is certain generalization.
According to the embodiment of another aspect, a kind of computer readable storage medium is also provided, is stored thereon with computer journey Sequence enables computer execute method described in conjunction with Figure 2 when the computer program executes in a computer.
According to the embodiment of another further aspect, a kind of calculating equipment, including memory and processor, the memory are also provided In be stored with executable code, when the processor executes the executable code, realize method described in conjunction with Figure 2.
Those skilled in the art are it will be appreciated that in said one or multiple examples, function described in the invention It can be realized with hardware, software, firmware or their any combination.It when implemented in software, can be by these functions Storage in computer-readable medium or as on computer-readable medium one or more instructions or code transmitted.
Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention Protection scope, all any modification, equivalent substitution, improvement and etc. on the basis of technical solution of the present invention, done should all Including within protection scope of the present invention.

Claims (24)

1. a kind of feature coding method, comprising:
The variate-value of characteristic variable relevant to business objective is obtained, the variate-value is non-numeric type;
According to the corresponding relationship of multiple the value set and various features coding mode of the predetermined characteristic variable, from institute State and select the corresponding target code mode of the variate-value in various features coding mode, wherein the multiple value set according to The various possible values of the characteristic variable assess in advance, described for the business objective discrimination and divide, it is the multiple Feature coding mode is used to the value corresponded in value set being encoded to multiple vector spaces;
The variate-value is encoded to the object vector in object vector space in the way of the target code, the target to Quantity space is vector space corresponding with the target code mode in the multiple vector space;
The feature vector of the characteristic variable is determined based on the object vector.
2. the method as described in claim 1, wherein area of the various possible values of the characteristic variable to the business objective Indexing is assessed to obtain according to the various possible values in the prior probability of the business objective reached.
3. the method for claim 1, wherein the multiple value set includes the first value set and the second value collection It closes, the value for including in the first value set is higher than the second value collection for the average discrimination of the business objective It closes;The various features coding mode includes fisrt feature coding mode and second feature coding mode, and the fisrt feature is compiled Code mode corresponds to the first value set, and the second feature coding mode corresponds to the second value set, described The compression rates of fisrt feature coding mode are less than the second feature coding mode.
4. method as claimed in claim 3, wherein the fisrt feature coding mode is full dose coding mode, and being used for will be right Value is answered to be encoded to primary vector space, the dimension in the primary vector space is equal to the value number in the first value set Mesh;
The second feature coding mode is compression coding mode, is encoded to secondary vector space for that will correspond to value, described The dimension in secondary vector space is less than the value number in the second value set.
5. method as claimed in claim 4, the fisrt feature coding mode is only hot one-hot coding mode, described the Two feature coding modes are Hash coding mode.
6. method as claimed in claim 5, wherein the second feature coding mode includes:
Using numeric type hash function, the value in the second value set is mapped as the numerical value in numerical space;
By the numerical value to predetermined number p modulus, modulus result is mapped as p dimensional vector, wherein the predetermined number p is less than institute State the value number in the second value set.
7. the method as described in any one of claim 3-6, wherein the multiple value set is divided by following steps:
By all possible values of the characteristic variable according to the value for the business objective discrimination from big to small into Row sequence;
Successively select the value of the first quantity as the first value set according to the sequence;
Using other values except first quantity as the second value set.
8. the method as described in any one of claim 3-6, wherein the multiple value set is divided by following steps:
Obtain discrimination of all possible values for the business objective of the characteristic variable;
The value that discrimination is greater than or equal to first threshold is added to the first value set;
The value that discrimination is less than the first threshold is added to the second value set.
9. method as claimed in claim 4, wherein the second value set includes that the first subclass and second subset are closed;Institute The value for including in the first subclass is stated to close the average discrimination of the business objective higher than the second subset;Described Two feature coding modes include the first compression coding mode and the second compression coding mode, and first compression coding mode is corresponding In first subclass, second compression coding mode is closed corresponding to the second subset, the first compressed encoding side The compression rates of formula are less than second compression coding mode.
10. method as claimed in claim 9, wherein the second feature coding mode is Hash coding mode, and:
First compression coding mode includes:
Using numeric type hash function, the value in first subclass is mapped as the first numerical value;
By first numerical value to the first number modulus, modulus result is mapped as to the vector of the first number dimension;
Second compression coding mode includes:
Using the numeric type hash function, the value in second subset conjunction is mapped as second value;
By the second value to the second number modulus, modulus result is mapped as to the vector of the second number dimension;
Wherein first number is greater than second number.
11. the method as described in claim 1, wherein determining the feature vector packet of the characteristic variable based on the object vector It includes:
The sky of at least one other vector in the multiple vector space in addition to the object vector space is filled with default value Between, obtain at least one default vector;
The object vector and at least one described default vector are spliced and combined, described eigenvector is obtained.
12. a kind of feature coding device, comprising:
Acquiring unit, for obtaining the variate-value of characteristic variable relevant to business objective, the variate-value is non-numeric type;
Selecting unit, for the multiple value set and various features coding mode according to the predetermined characteristic variable Corresponding relationship, the corresponding target code side of variate-value for selecting the acquiring unit to obtain from the various features coding mode Formula, wherein the multiple value set is according to the various possible values of the characteristic variable assess in advance, described for the business The discrimination of target and divide, the multiple feature coding mode be used for by the value corresponded in value set be encoded to it is multiple to Quantity space;
Coding unit, the variate-value for being obtained the acquiring unit in the way of the target code that the selecting unit selects The object vector being encoded in object vector space, the object vector space be in the multiple vector space with the target The corresponding vector space of coding mode;
Determination unit, the object vector for being obtained based on the coding unit determine the feature vector of the characteristic variable.
13. device as claimed in claim 12, wherein the various possible values of the characteristic variable are to the business objective Discrimination is assessed to obtain according to the various possible values in the prior probability of the business objective reached.
14. device as claimed in claim 12, wherein the multiple value set includes the first value set and the second value Gather, the value for including in the first value set is higher than second value for the average discrimination of the business objective Set;The various features coding mode includes fisrt feature coding mode and second feature coding mode, the fisrt feature Coding mode corresponds to the first value set, and the second feature coding mode corresponds to the second value set, institute The compression rates for stating fisrt feature coding mode are less than the second feature coding mode.
15. device as claimed in claim 14, wherein the fisrt feature coding mode is full dose coding mode, and being used for will Corresponding value is encoded to primary vector space, and the dimension in the primary vector space is equal to the value in the first value set Number;
The second feature coding mode is compression coding mode, is encoded to secondary vector space for that will correspond to value, described The dimension in secondary vector space is less than the value number in the second value set.
16. device as claimed in claim 15, the fisrt feature coding mode is only hot one-hot coding mode, described Second feature coding mode is Hash coding mode.
17. device as claimed in claim 16, wherein the second feature coding mode includes:
Using numeric type hash function, the value in the second value set is mapped as the numerical value in numerical space;
By the numerical value to predetermined number p modulus, modulus result is mapped as p dimensional vector, wherein the predetermined number p is less than institute State the value number in the second value set.
18. the device as described in any one of claim 14-17, wherein the multiple value set is drawn by following steps Point:
By all possible values of the characteristic variable according to the value for the business objective discrimination from big to small into Row sequence;
Successively select the value of the first quantity as the first value set according to the sequence;
Using other values except first quantity as the second value set.
19. the device as described in any one of claim 14-17, wherein the multiple value set is drawn by following steps Point:
Obtain discrimination of all possible values for the business objective of the characteristic variable;
The value that discrimination is greater than or equal to first threshold is added to the first value set;
The value that discrimination is less than the first threshold is added to the second value set.
20. device as claimed in claim 15, wherein the second value set includes that the first subclass and second subset are closed; The value for including in first subclass closes the average discrimination of the business objective higher than the second subset;It is described Second feature coding mode includes the first compression coding mode and the second compression coding mode, first compression coding mode pair First subclass described in Ying Yu, second compression coding mode are closed corresponding to the second subset, first compressed encoding The compression rates of mode are less than second compression coding mode.
21. device as claimed in claim 20, wherein the second feature coding mode is Hash coding mode, and:
First compression coding mode includes:
Using numeric type hash function, the value in first subclass is mapped as the first numerical value;
By first numerical value to the first number modulus, modulus result is mapped as to the vector of the first number dimension;
Second compression coding mode includes:
Using the numeric type hash function, the value in second subset conjunction is mapped as second value;
By the second value to the second number modulus, modulus result is mapped as to the vector of the second number dimension;
Wherein first number is greater than second number.
22. device as claimed in claim 12, wherein the determination unit, is specifically used for:
The sky of at least one other vector in the multiple vector space in addition to the object vector space is filled with default value Between, obtain at least one default vector;
The object vector and at least one described default vector are spliced and combined, described eigenvector is obtained.
23. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that when the computer journey When sequence executes in a computer, computer perform claim is enabled to require the method for any one of 1-11.
24. a kind of calculating equipment, including memory and processor, which is characterized in that be stored with executable generation in the memory Code when the processor executes the executable code, realizes the method for any one of claim 1-11.
CN201810887448.8A 2018-08-06 2018-08-06 Feature encoding method and apparatus Active CN109146083B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810887448.8A CN109146083B (en) 2018-08-06 2018-08-06 Feature encoding method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810887448.8A CN109146083B (en) 2018-08-06 2018-08-06 Feature encoding method and apparatus

Publications (2)

Publication Number Publication Date
CN109146083A true CN109146083A (en) 2019-01-04
CN109146083B CN109146083B (en) 2021-07-23

Family

ID=64791926

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810887448.8A Active CN109146083B (en) 2018-08-06 2018-08-06 Feature encoding method and apparatus

Country Status (1)

Country Link
CN (1) CN109146083B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109934628A (en) * 2019-03-08 2019-06-25 智者四海(北京)技术有限公司 Characteristic processing method and device
CN110675931A (en) * 2019-08-28 2020-01-10 吉林金域医学检验所有限公司 Information coding method, device, equipment and storage medium for detection report
CN110970100A (en) * 2019-11-04 2020-04-07 广州金域医学检验中心有限公司 Method, device and equipment for detecting item coding and computer readable storage medium
CN111611449A (en) * 2020-05-08 2020-09-01 百度在线网络技术(北京)有限公司 Information encoding method and device, electronic equipment and computer readable storage medium
CN113220947A (en) * 2021-05-27 2021-08-06 支付宝(杭州)信息技术有限公司 Method and device for encoding event characteristics

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100002764A1 (en) * 2008-07-03 2010-01-07 National Cheng Kung University Method For Encoding An Extended-Channel Video Data Subset Of A Stereoscopic Video Data Set, And A Stereo Video Encoding Apparatus For Implementing The Same
CN106897776A (en) * 2017-01-17 2017-06-27 华南理工大学 A kind of continuous type latent structure method based on nominal attribute

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100002764A1 (en) * 2008-07-03 2010-01-07 National Cheng Kung University Method For Encoding An Extended-Channel Video Data Subset Of A Stereoscopic Video Data Set, And A Stereo Video Encoding Apparatus For Implementing The Same
CN106897776A (en) * 2017-01-17 2017-06-27 华南理工大学 A kind of continuous type latent structure method based on nominal attribute

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LUGUANYOU: "非数值型特征如何进行编码", 《非数值型特征如何进行编码?_LUGUANYOU的博客-CSDN博客(HTTPS://BLOG.CSDN.NET/LUGUANYOU/ARTICLE/DETAILS/80598122)》 *
叶倩怡 等: "基于Xgboost的商业销售预测", 《南昌大学学报(理科版)》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109934628A (en) * 2019-03-08 2019-06-25 智者四海(北京)技术有限公司 Characteristic processing method and device
CN110675931A (en) * 2019-08-28 2020-01-10 吉林金域医学检验所有限公司 Information coding method, device, equipment and storage medium for detection report
CN110970100A (en) * 2019-11-04 2020-04-07 广州金域医学检验中心有限公司 Method, device and equipment for detecting item coding and computer readable storage medium
CN111611449A (en) * 2020-05-08 2020-09-01 百度在线网络技术(北京)有限公司 Information encoding method and device, electronic equipment and computer readable storage medium
CN111611449B (en) * 2020-05-08 2023-08-29 百度在线网络技术(北京)有限公司 Information encoding method, apparatus, electronic device, and computer-readable storage medium
CN113220947A (en) * 2021-05-27 2021-08-06 支付宝(杭州)信息技术有限公司 Method and device for encoding event characteristics

Also Published As

Publication number Publication date
CN109146083B (en) 2021-07-23

Similar Documents

Publication Publication Date Title
CN109146083A (en) Feature coding method and apparatus
CN109165683B (en) Sample prediction method, device and storage medium based on federal training
CN109034398A (en) Feature selection approach, device and storage medium based on federation's training
CN108833458B (en) Application recommendation method, device, medium and equipment
US9235814B2 (en) Machine learning memory management and distributed rule evaluation
CN112559900B (en) Product recommendation method and device, computer equipment and storage medium
CN104992348B (en) A kind of method and apparatus of information displaying
CN107767262A (en) Information processing method, device and computer-readable recording medium
CN110310114A (en) Object classification method, device, server and storage medium
CN113255908B (en) Method, neural network model and device for service prediction based on event sequence
CN110737730A (en) Unsupervised learning-based user classification method, unsupervised learning-based user classification device, unsupervised learning-based user classification equipment and storage medium
CN109635939A (en) A kind of determination method and device of the convolutional neural networks based on cutting
CN107666407A (en) A kind of service package system of selection, apparatus and system
CN108346098A (en) A kind of method and device of air control rule digging
CN109033220A (en) Automatically selecting method, system, equipment and the storage medium of labeled data
CN115358809A (en) Multi-intention recommendation method and device based on graph comparison learning
CN109615153A (en) Businessman's methods of risk assessment, device, equipment and storage medium
CN112449002B (en) Method, device and equipment for pushing object to be pushed and storage medium
CN104899232B (en) The method and apparatus of Cooperative Clustering
CN107240019A (en) Assess customer service preference methods, customer investment risk partiality method and device
CN113327154B (en) E-commerce user message pushing method and system based on big data
Ashlock et al. Evolving diverse cellular automata based level maps
CN114004626A (en) Method, system and application for evaluating monetary value of patent
CN114066310A (en) Approval task allocation method and device, computer equipment and storage medium
Becker Decomposition methods for large scale stochastic and robust optimization problems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20200927

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20200927

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

GR01 Patent grant
GR01 Patent grant