CN109255368A - Randomly select method, apparatus, electronic equipment and the storage medium of feature - Google Patents

Randomly select method, apparatus, electronic equipment and the storage medium of feature Download PDF

Info

Publication number
CN109255368A
CN109255368A CN201810892174.1A CN201810892174A CN109255368A CN 109255368 A CN109255368 A CN 109255368A CN 201810892174 A CN201810892174 A CN 201810892174A CN 109255368 A CN109255368 A CN 109255368A
Authority
CN
China
Prior art keywords
metric
feature
value set
measurement value
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810892174.1A
Other languages
Chinese (zh)
Other versions
CN109255368B (en
Inventor
叶俊锋
赖云辉
罗先贤
孙成
龙觉刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201810892174.1A priority Critical patent/CN109255368B/en
Publication of CN109255368A publication Critical patent/CN109255368A/en
Application granted granted Critical
Publication of CN109255368B publication Critical patent/CN109255368B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the present application provides a kind of method, apparatus for randomly selecting feature, electronic equipment and storage medium.This method comprises: determining the metric of each candidate feature, the first measurement value set is obtained;First measurement value set is standardized;Difference in the first measurement value set after expanding standardization by preset algorithm between each metric, obtains the second measurement value set;The metric second measured in value set inputs roulette model as the fitness of each candidate feature, using feature that roulette model exports as choosing feature.Difference between metric of the embodiment of the present application by expanding each feature, expand the difference between the corresponding select probability of each feature, so that the diversity factor of metric height and the selected probability of the low feature of metric is larger, the selected probability of feature is improved, finally to choose the algorithm of feature to can make full use of effective characteristic information using this, improves arithmetic accuracy.

Description

Randomly select method, apparatus, electronic equipment and the storage medium of feature
Technical field
This application involves technical field of data processing, specifically, this application involves a kind of method for randomly selecting feature, Device, electronic equipment and storage medium.
Background technique
Feature selecting is also referred to as feature subset selection or Attributions selection, refers to and selects N number of feature from existing M feature, So that the specific indexes of system are optimal.In addition, can be selected from primitive character by feature selecting some most effective Feature is an important means for improving learning algorithm performance to reduce the dimension of data set.
Existing feature selection approach is the metric for calculating each feature, such as nicety of grading or AUC (Area Under the Curve) etc. classification of assessment algorithm performance index, then using the metric of each feature as weight substitute into wheel disc It gambles in algorithm, obtain random output chooses feature.In existing feature selection approach, the weight of each feature is distinguished unknown It is aobvious so that metric height and metric it is low feature it is selected probability it is very nearly the same, it is selected that validity feature cannot be promoted Probability causes algorithm that cannot make full use of the information of validity feature, reduces arithmetic accuracy.
Summary of the invention
This application provides a kind of method, apparatus for randomly selecting feature, electronic equipment and computer readable storage medium, It can solve because the weight of feature distinguishes unobvious the problem of leading to not the probability that promotion validity feature is selected.The technology Scheme is as follows:
In a first aspect, this application provides a kind of methods for randomly selecting feature, this method comprises:
The metric for determining each candidate feature obtains the first measurement value set;
First measurement value set is standardized;
Difference in the first measurement value set after expanding standardization by preset algorithm between each metric, obtains Second measurement value set;
The metric second measured in value set inputs roulette model as the fitness of each candidate feature, will take turns The feature of disk gambling model output, which is used as, chooses feature.
Optionally, the first measurement value set is standardized, comprising: min- is carried out to the first measurement value set Max standardization.
Optionally, the difference in the first measurement value set after expanding standardization by preset algorithm between each metric It is different, obtain the second measurement value set, comprising: carry out to each metric in the first measurement value set after standardization flat Square operation obtains the second measurement value set to expand the difference between each metric.
Optionally, the difference in the first measurement value set after expanding standardization by preset algorithm between each metric It is different, obtain the second measurement value set, comprising:
Metric in the first measurement value set after standardization is clustered, multiple clusters are obtained, in each cluster Including at least one metric;
The metric in each cluster is carried out respectively according to preset strategy to expand difference processing, obtains the second metric collection It closes.
Optionally, the metric in each cluster is carried out respectively according to preset strategy expanding difference processing, comprising:
Determine the boundary point of each cluster and the quantity of metric that each cluster includes;
According to the quantity for the metric that the boundary point of each cluster and each cluster include, the density of each cluster is determined;
Judge whether the density of each cluster is greater than pre-set density, the metric being greater than in the cluster of pre-set density to density carries out Expand difference processing.
Optionally, the metric in the cluster of pre-set density is greater than to density to carry out expanding difference processing, comprising:
Expand the boundary of cluster to be processed, wherein cluster to be processed is the cluster that density is greater than pre-set density;
Boundary before being expanded according to cluster to be processed and the boundary after expansion determine sampling factor;
Expand the distance between each metric in cluster to be processed according to sampling factor.
Optionally, the metric second measured in value set inputs roulette mould as the fitness of each candidate feature Before type, method further include: Laplce's smoothing processing is carried out to the second measurement value set;
The metric second measured in value set inputs roulette model, packet as the fitness of each candidate feature It includes: the metric in the second measurement value set after Laplce's smoothing processing is inputted as the fitness of each candidate feature Roulette model.
Second aspect, this application provides a kind of device for randomly selecting feature, which includes:
Metric determining module obtains the first measurement value set for determining the metric of each candidate feature;
Standardization module, for being standardized to the first measurement value set;
Difference extension module, for passing through each degree in the first measurement value set after preset algorithm expansion standardization Difference between magnitude obtains the second measurement value set;
Characteristic selecting module, it is defeated as the fitness of each candidate feature for measuring the metric in value set using second Enter roulette model, using feature that roulette model exports as choosing feature.
The third aspect, this application provides a kind of electronic equipment, which includes: one or more processors;
Memory;
One or more application program, wherein one or more application programs be stored in memory and be configured as by One or more processors execute, and one or more application program is configured to: executing random shown in the application first aspect The method of selected characteristic.
Fourth aspect stores on computer readable storage medium this application provides a kind of computer readable storage medium There is computer program, which realizes the method that feature is randomly selected shown in the application first aspect when being executed by processor.
Technical solution provided by the embodiments of the present application have the benefit that by the metric that expands each feature it Between difference, the difference between the corresponding select probability of each feature is expanded, so that metric height and the low feature of metric The diversity factor of selected probability is larger, therefore, while guaranteeing the randomness of Feature Selection, it is selected to have improved feature Probability, finally to choose the algorithm of feature to can make full use of effective characteristic information using this, improve arithmetic accuracy.
Detailed description of the invention
In order to more clearly explain the technical solutions in the embodiments of the present application, institute in being described below to the embodiment of the present application Attached drawing to be used is needed to be briefly described.
Fig. 1 is a kind of flow diagram for the method for randomly selecting feature provided by the embodiments of the present application;
Fig. 2 is the flow diagram for the method that another kind provided by the embodiments of the present application randomly selects feature;
Fig. 3 be it is provided by the embodiments of the present application another randomly select the flow diagram of the method for feature;
Fig. 4 is a kind of structural schematic diagram for the device for randomly selecting feature provided by the embodiments of the present application;
Fig. 5 is the structural schematic diagram for the device that another kind provided by the embodiments of the present application randomly selects feature;
Fig. 6 is a kind of structural schematic diagram for the electronic equipment for randomly selecting feature provided by the embodiments of the present application.
Specific embodiment
Embodiments herein is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, and is only used for explaining the application, and cannot be construed to the limitation to the application.
Those skilled in the art of the present technique are appreciated that unless expressly stated, singular " one " used herein, " one It is a ", " described " and "the" may also comprise plural form.It is to be further understood that being arranged used in the description of the present application Diction " comprising " refer to that there are the feature, integer, step, operation, element and/or component, but it is not excluded that in the presence of or addition Other one or more features, integer, step, operation, element, component and/or their group.It should be understood that when we claim member Part is " connected " or when " coupled " to another element, it can be directly connected or coupled to other elements, or there may also be Intermediary element.In addition, " connection " used herein or " coupling " may include being wirelessly connected or wirelessly coupling.It is used herein to arrange Diction "and/or" includes one or more associated wholes for listing item or any cell and all combinations.
How the technical solution of the application and the technical solution of the application are solved with specifically embodiment below above-mentioned Technical problem is described in detail.These specific embodiments can be combined with each other below, for the same or similar concept Or process may repeat no more in certain embodiments.Below in conjunction with attached drawing, embodiments herein is described.
Embodiment one
The embodiment of the present application provides a kind of method for randomly selecting feature, as shown in Figure 1, this method comprises: step S101, step S102, step S103, step S104.
Step S101, the metric for determining each candidate feature obtains the first measurement value set.
Wherein, candidate feature refers to that the predetermined output result to learning algorithm, sorting algorithm scheduling algorithm has shadow Loud feature.For example, the banking operation of prediction user, can choose following candidate feature: for describing personal essential characteristic User property feature, the credit feature for describing user's revenue potential and income situation are practised for describing client's major consumers Used and consumption preferences consumption features, the interest characteristics for describing hobby of the client with which aspect are used for description The banking operation of user is predicted by features described above in the movable social characteristics etc. of social media in family.
Wherein, metric is the index of classification of assessment algorithm, prediction algorithm performance, such as nicety of grading, ROC curve (receiver operating characteristic curve), AUC etc..
For example, it is assumed that the metric selected is AUC, the AUC of each candidate feature is calculated by the following method:
The first step, collecting sample data, determine candidate feature, and establish attribute prediction model.
The actual value of a certain attribute of collecting sample, and classified according to the actual value of sample attribute to sample, for example, The sample that actual value is 1 is positive sample, and the sample that actual value is 0 is negative sample, obtains sample set with this.
Determine each feature correspondences that are multiple for evaluating the features of the attribute as candidate feature, and acquiring each sample Characteristic value.For example, the credit feature based on people, the moral standing attribute of appraiser is good or bad.
Attribute prediction model is actually a preparatory trained classifier, and the corresponding characteristic value of candidate feature is inputted The prediction probability value to attribute can be obtained in attribute prediction model.
Second step, the AUC for calculating each candidate feature.
A feature A is chosen from candidate feature, the characteristic value input attribute of the feature A of sample each in sample set is pre- Estimate model, obtains the prediction probability value score to the attribute of each sample.Wherein, score indicates that each sample belongs to positive sample Probability.
The score of positive sample is greater than the probability of the score of negative sample in calculating sample set.Method particularly includes: take N*M two Tuple, wherein N is positive sample number, and M is negative sample number;The score for counting the positive sample in all binary groups is greater than negative sample The binary group quantity L of score, wherein when the score of positive sample in the binary group and score of negative sample equal, binary Group quantity increases by 0.5;It is calculated by the following formula the AUC:AUC=L/ (N*M) of feature A.
The AUC of all candidate features is obtained by the above method.
Step S102, the first measurement value set is standardized.
Wherein, the method for standardization can be min-max standardized method, log function conversion method, standard deviation mark Quasi-ization method, extremum method etc..The unit limitation that metric can be removed by standardization, converts dimensionless for metric Pure values, be able to carry out convenient for the index of not commensurate or magnitude and compare and weight.
Step S103, in the first measurement value set after expanding standardization by preset algorithm between each metric Difference obtains the second measurement value set.
Wherein, preset algorithm can be square operation, cube operation or ruler carries out expanding the first measurement according to a certain percentage Value set etc..For example, by carrying out square operation to each metric in the first measurement value set after standardization, To expand the difference between each metric, it is assumed that the first measurement value set is { 0.5,0.6,0.8.0.9 }, after square operation The the second measurement value set arrived is { 0.25,0.36,0.64,0.81 }, it is clear that square operation expands the difference between each metric It is different, the proportional difference of each feature weight in roulette model can be increased in this way.
Step S104, the metric measured second in value set inputs roulette as the fitness of each candidate feature Model, using feature that roulette model exports as choosing feature.
Wherein, the detailed process of roulette model selection feature includes: firstly, second is measured the metric in value set Fitness F as each feature in roulette algorithmi(i.e. weight), the selected probability of feature is directly proportional to its fitness, If feature total data is n, the fitness of feature i is Fi.Then, according to fitness FiCalculate the select probability q of each featurei, choosing Select the calculation formula of probability are as follows:Then, according to select probability qiThe cumulative probability of each feature is obtained, it is special Levying the corresponding cumulative probability of j isThen, the corresponding probability interval of each feature, feature are obtained according to cumulative probability The corresponding probability interval of j is [Pj-1,Pj], wherein P0=0, as shown in table 1, gives and calculated by a roulette model The example of the select probability and cumulative probability that arrive.Finally, generating the random number between a 0-1 at random, determine that this is scolded at random The probability interval entered, for example, generate random number be 0.01, reference table 1, the probability interval which falls into be [0, 0.18], it be the corresponding feature 1 of probability interval is the feature chosen through roulette model which, which falls into,.
Table 1
The method for randomly selecting feature of the present embodiment, the difference between metric by expanding each feature, expands Difference between each feature corresponding select probability, so that metric height and the low selected probability of feature of metric Diversity factor is larger, therefore, while guaranteeing the randomness of Feature Selection, has improved the selected probability of feature, has finally made It obtains and chooses the algorithm of feature to can make full use of effective characteristic information using this, improve arithmetic accuracy.
It can select a feature by step S101, step S102, step S103, step S104, in practical application, needle To an algorithm, generally require to select multiple features.As shown in Fig. 2, in step S101, step S102, step S103, step On the basis of S104, the method for randomly selecting feature for selecting multiple features is given, comprising:
Step S101, the metric for determining each candidate feature obtains the first measurement value set.
Step S102, the first measurement value set is standardized.
Step S103, in the first measurement value set after expanding standardization by preset algorithm between each metric Difference obtains the second measurement value set.
Step S104, the metric measured second in value set inputs roulette as the fitness of each candidate feature Model, using feature that roulette model exports as choosing feature.
Step S105, it chooses feature to be put into result set for what roulette model exported, and is deleted from the first measurement value set Except this chooses the corresponding metric of feature.
Step S106, judging result concentrates whether the quantity of feature will arrive preset value.If not up to preset value, step is returned Rapid S102.If reaching preset value, S107, output result set are thened follow the steps.
Wherein, preset value is the quantity for needing the feature selected.Feature in the result set of output is final checked Feature.
By the above method, multiple features can be selected from candidate feature automatically.
Embodiment two
The embodiment of the present application provides alternatively possible implementation, further includes implementing on the basis of example 1 Method shown in example two.
Optionally, step S102 is specifically included: carrying out min-max standardization to the first measurement value set.
Wherein, metric can be mapped on [0,1] section by min-max standardization.Min-max standardized method energy Unified conversion is carried out to the metric of each candidate feature so that metric can other than the weight as roulette algorithm, In addition, the difference between metric can also be expanded by min-max standardized method.For example, when the metric selected is AUC When, the value range of AUC is [0.5,1], and the difference between the metric of each feature is unobvious, and such one group of metric is made For weight, the proportional difference for accounting for total weight is too small, when to the AUC progress min-max standardization in the first measurement value set After processing, the AUC in the first measurement value set is mapped on the section of [0,1], is compared with the value range of original [0.5,1], The diversity factor between metric is expanded to a certain extent.
Optionally, step S103 is specifically included:
Step S1031, the metric in the first measurement value set after standardization is clustered, is obtained multiple Cluster includes at least one metric in each cluster.
Wherein, the method for cluster can be K-means algorithm, K- central point algorithm, CLARANS algorithm etc., herein no longer It repeats.
Step S1032, the metric in each cluster is carried out respectively according to preset strategy expanding difference processing, obtains second Measure value set.
The diversity factor of the metric in some sections is sufficiently large, does not need to carry out expansion difference processing again, and some areas Between metric diversity factor it is smaller, need to expand diversity factor.Therefore, method of the present embodiment by clustering, by similar degree Magnitude is divided into a cluster, expands the diversity factor between metric using different strategies for different clusters.
Further, step S1032 is specifically included: step S201, step S202 and step S203.
Step S201, the boundary point of each cluster is determined and the quantity of metric that each cluster includes.
Wherein, the boundary point of cluster refers to maximum value and minimum value in the metric of the cluster.
Step S202, the quantity for the metric for including according to the boundary point of each cluster and each cluster, determines each cluster Density.
Wherein, the density of clusterWherein, according to the maximum value A in clustermaxFor the maximum value in cluster, Amin For the minimum value in cluster, num is the quantity for the metric that the cluster includes.
Step S203, judge whether the density of each cluster is greater than pre-set density, density is greater than in the cluster of pre-set density Metric carries out expanding difference processing.
Further, step S203 is specifically included:
Step S2031, expand the boundary of cluster to be processed, wherein cluster to be processed is the cluster that density is greater than pre-set density.
Step S2032, the boundary before being expanded according to cluster to be processed and the boundary after expansion determine sampling factor.
Step S2033, the distance between each metric in cluster to be processed is expanded according to sampling factor.
Assuming that the boundary point of cluster to be processed is AmaxAnd Amin, the boundary point after expansion is CmaxAnd Cmin, wherein Cmax>Amax, Cmin< Amin;By the metric in same cluster to the boundary point C at both endsmaxAnd CminEqual proportion extension, sampling factor areI-th of metric after extension is Ci=Cmin+k(Ai-Amin), AiFor i-th of metric before extension.
Wherein, the boundary point after each cluster expands can determine, the corresponding expansion of the biggish cluster of density according to the density of each cluster Large scale is larger, i.e. boundary point CmaxWith Amax、CminWith AminSpacing can set greatly, but after must assure that each cluster expands Range is not overlapped.
Optionally, as shown in figure 3, further including step S105 between step S103 and step S104: to the second metric collection It closes and carries out Laplce's smoothing processing.
Correspondingly, step S104 is specifically included: by the measurement in the second measurement value set after Laplce's smoothing processing Fitness input roulette model of the value as each candidate feature, using feature that roulette model exports as choosing feature.
Zero probability problem is exactly in the probability of calculated examples, if some amount x, in observation sample database or training set Do not occurred, the probability results that will lead to entire example are 0.For example, in the text classification the problem of, when a word does not have Occur in training sample, which is 0, the use of when multiplying and calculating text probability of occurrence is also 0.This is unreasonable , it is 0 that the just dogmatic probability for thinking the event cannot be not observed because of an event.
Laplce's smoothing processing, exactly by adding 1 method to molecule when calculating probability, estimation does not occur The probability for the phenomenon that crossing.It is assumed that when training sample is very big, each component x estimated probability variation caused by count is incremented can be neglected Slightly disregard, but can be convenient and effectively avoid zero probability problem.
Therefore, the method for the present embodiment avoids selecting by carrying out Laplce's smoothing processing to the second measurement value set Occur zero probability problem during selecting feature, guarantees that each feature has selected probability, do not lose any one feature.
Embodiment three
Based on embodiment one, two identical inventive concepts, the embodiment of the present application provides a kind of feature that randomly selects Device, as shown in figure 4, the device 40 for randomly selecting feature may include: metric determining module 401, standardization mould Block 402, difference extension module 403 and characteristic selecting module 404.
Metric determining module 401 is used to determine the metric of each candidate feature, obtains the first measurement value set.
Standardization module 402 is used to be standardized the first measurement value set.
Difference extension module 403 is used to expand by preset algorithm each in the first measurement value set after standardization Difference between metric obtains the second measurement value set.
Characteristic selecting module 404 is used for using the metric in the second measurement value set as the fitness of each candidate feature Roulette model is inputted, using feature that roulette model exports as choosing feature.
The device provided in this embodiment for randomly selecting feature, the difference between metric by expanding each feature, The difference between the corresponding select probability of each feature is expanded, so that metric height and selected general of the low feature of metric The diversity factor of rate is larger, therefore, while guaranteeing the randomness of Feature Selection, has improved the selected probability of feature, most Make to choose the algorithm of feature to can make full use of effective characteristic information using this eventually, improves arithmetic accuracy.
Optionally, standardization module 402 is specifically used for: carrying out at min-max standardization to the first measurement value set Reason.
Optionally, difference extension module 403 is specifically used for: to each in the first measurement value set after standardization Metric carries out square operation and obtains the second measurement value set to expand the difference between each metric.
Optionally, as shown in figure 5, difference extension module 403 includes cluster cell 501 and difference processing unit 502.
Wherein, cluster cell 501 is used to cluster the metric in the first measurement value set after standardization, Multiple clusters are obtained, include at least one metric in each cluster.
Wherein, difference processing unit 502 is used to expand to the metric in each cluster respectively according to preset strategy poor Different processing obtains the second measurement value set.
Further, as shown in figure 5, difference processing unit 502 includes density computation subunit 5021 and difference processing Unit 5022.
Wherein, density computation subunit 5021 is for determining the metric that the boundary point of each cluster and each cluster include Quantity determines the density of each cluster according to the quantity for the metric that the boundary point of each cluster and each cluster include.
Difference processing subelement 5022 is greater than density default for judging whether the density of each cluster is greater than pre-set density Metric in the cluster of density carries out expanding difference processing.
Further, difference processing subelement 5022 is specifically used for: expanding the boundary of cluster to be processed, wherein cluster to be processed It is greater than the cluster of pre-set density for density;Boundary before being expanded according to cluster to be processed and the boundary after expansion determine sampling factor;Root Expand the distance between each metric in cluster to be processed according to sampling factor.
Optionally, the device 40 of the present embodiment further includes smoothing module 405, the smoothing module 405 for pair Second measurement value set carries out Laplce's smoothing processing.
Correspondingly, characteristic selecting module 404 is specifically used for: will be in the second measurement value set after Laplce's smoothing processing Metric as each candidate feature fitness input roulette model.
The device for randomly selecting feature of the present embodiment uses the method phase for randomly selecting feature with embodiment one, two Same inventive concept, can obtain identical technical effect, details are not described herein.
Example IV
The embodiment of the present application provides a kind of electronic equipment, as shown in fig. 6, electronic equipment shown in fig. 6 600 includes: place Manage device 601 and memory 603.Wherein, processor 601 is connected with memory 603, is such as connected by bus 602.Optionally, electric Sub- equipment 600 can also include transceiver 604.It should be noted that transceiver 604 is not limited to one in practical application, the electricity The structure of sub- equipment 600 does not constitute the restriction to the embodiment of the present application.
Wherein, processor 601 is applied in the embodiment of the present application, for realizing metric determining module shown in Fig. 4 401, standardization module 402, the function of difference extension module 403 and characteristic selecting module 404.Transceiver 604 includes connecing Receipts machine and transmitter.
Processor 601 can be CPU, general processor, DSP, ASIC, FPGA or other programmable logic device, crystalline substance Body pipe logical device, hardware component or any combination thereof.It, which may be implemented or executes, combines described by present disclosure Various illustrative logic blocks, module and circuit.Processor 601 is also possible to realize the combination of computing function, such as wraps It is combined containing one or more microprocessors, DSP and the combination of microprocessor etc..
Bus 602 may include an access, and information is transmitted between said modules.Bus 602 can be pci bus or EISA Bus etc..Bus 602 can be divided into address bus, data/address bus, control bus etc..For convenient for indicating, in Fig. 6 only with one slightly Line indicates, it is not intended that an only bus or a type of bus.
Memory 603 can be ROM or can store the other kinds of static storage device of static information and instruction, RAM Or the other kinds of dynamic memory of information and instruction can be stored, it is also possible to EEPROM, CD-ROM or other CDs Storage, optical disc storage (including compression optical disc, laser disc, optical disc, Digital Versatile Disc, Blu-ray Disc etc.), magnetic disk storage medium Or other magnetic storage apparatus or can be used in carry or store have instruction or data structure form desired program generation Code and can by any other medium of computer access, but not limited to this.
Optionally, memory 603 be used for store execution application scheme application code, and by processor 601 Control executes.Processor 601 is for executing the application code stored in memory 603, to realize that embodiment illustrated in fig. 6 mentions The movement of the device for randomly selecting feature supplied.
Compared with prior art, electronic equipment provided by the embodiments of the present application, by expand each feature metric it Between difference, the difference between the corresponding select probability of each feature is expanded, so that metric height and the low feature of metric The diversity factor of selected probability is larger, therefore, while guaranteeing the randomness of Feature Selection, it is selected to have improved feature Probability, finally to choose the algorithm of feature to can make full use of effective characteristic information using this, improve arithmetic accuracy.
Optionally, processor 601 is real shown in Fig. 5 to realize for executing the application code stored in memory 603 The movement of the device for randomly selecting feature of example offer is applied, details are not described herein.
Embodiment five
The embodiment of the present application provides a kind of computer readable storage medium, is stored on the computer readable storage medium Computer program realizes the method that feature is randomly selected shown in embodiment one when the program is executed by processor.
The embodiment of the present application provides a kind of computer readable storage medium, compared with prior art, each by expanding Difference between the metric of feature expands the difference between the corresponding select probability of each feature so that metric height and The diversity factor of the selected probability of the low feature of metric is larger, therefore, while guaranteeing the randomness of Feature Selection, is promoted The selected probability of good feature finally to choose the algorithm of feature to can make full use of effective characteristic information using this, Improve arithmetic accuracy.
Optionally, the embodiment of the present application also provides a kind of computer readable storage medium, the computer-readable storage mediums It is stored with computer program in matter, the side for randomly selecting feature shown in embodiment two is realized when which is executed by processor Method, details are not described herein.
It should be understood that although each step in the flow chart of attached drawing is successively shown according to the instruction of arrow, These steps are not that the inevitable sequence according to arrow instruction successively executes.Unless expressly stating otherwise herein, these steps Execution there is no stringent sequences to limit, can execute in the other order.Moreover, at least one in the flow chart of attached drawing Part steps may include that perhaps these sub-steps of multiple stages or stage are not necessarily in synchronization to multiple sub-steps Completion is executed, but can be executed at different times, execution sequence, which is also not necessarily, successively to be carried out, but can be with other At least part of the sub-step or stage of step or other steps executes in turn or alternately.
The above is only some embodiments of the application, it is noted that for the ordinary skill people of the art For member, under the premise of not departing from the application principle, several improvements and modifications can also be made, these improvements and modifications are also answered It is considered as the protection scope of the application.

Claims (10)

1. a kind of method for randomly selecting feature characterized by comprising
The metric for determining each candidate feature obtains the first measurement value set;
The first measurement value set is standardized;
Difference in the first measurement value set after expanding standardization by preset algorithm between each metric, obtains second Measure value set;
Metric in the second measurement value set is inputted into roulette model as the fitness of each candidate feature, by institute The feature for stating the output of roulette model, which is used as, chooses feature.
2. the method according to claim 1, wherein described be standardized place to the first measurement value set Reason, comprising:
Min-max standardization is carried out to the first measurement value set.
3. the method according to claim 1, wherein the expanded after standardization by preset algorithm Difference in one metric set between each metric obtains the second measurement value set, comprising:
Square operation is carried out to each metric in the first measurement value set after standardization, to expand each metric Between difference, obtain the second measurement value set.
4. the method according to claim 1, wherein the expanded after standardization by preset algorithm Difference in one metric set between each metric obtains the second measurement value set, comprising:
Metric in the first measurement value set after standardization is clustered, multiple clusters is obtained, includes in each cluster At least one metric;
The metric in each cluster is carried out respectively according to preset strategy to expand difference processing, obtains the second measurement value set.
5. according to the method described in claim 4, it is characterized in that, it is described according to preset strategy respectively to the measurement in each cluster Value carries out expanding difference processing, comprising:
Determine the boundary point of each cluster and the quantity of metric that each cluster includes;
According to the quantity for the metric that the boundary point of each cluster and each cluster include, the density of each cluster is determined;
Judge whether the density of each cluster is greater than pre-set density, the metric being greater than in the cluster of the pre-set density to density carries out Expand difference processing.
6. according to the method described in claim 5, it is characterized in that, it is described to density be greater than the pre-set density cluster in degree Magnitude carries out expanding difference processing, comprising:
Expand the boundary of cluster to be processed, wherein the cluster to be processed is the cluster that density is greater than the pre-set density;
Boundary before being expanded according to the cluster to be processed and the boundary after expansion determine sampling factor;
Expand the distance between each metric in the cluster to be processed according to the sampling factor.
7. method according to any one of claim 1 to 6, which is characterized in that described by the second measurement value set In metric input roulette model as the fitness of each candidate feature before, the method also includes: to described the Two measurement value sets carry out Laplce's smoothing processing;
The metric using in the second measurement value set inputs roulette model as the fitness of each candidate feature, It include: that the metric in the second measurement value set after Laplce's smoothing processing is defeated as the fitness of each candidate feature Enter roulette model.
8. one kind randomly selects characterizing arrangement characterized by comprising
Metric determining module obtains the first measurement value set for determining the metric of each candidate feature;
Standardization module, for being standardized to the first measurement value set;
Difference extension module, for passing through each metric in the first measurement value set after preset algorithm expansion standardization Between difference, obtain the second measurement value set;
Characteristic selecting module, for the metric in the second measurement value set is defeated as the fitness of each candidate feature Enter roulette model, using the feature of roulette model output as choosing feature.
9. a kind of electronic equipment, characterized in that it comprises:
One or more processors;
Memory;
One or more application program, wherein one or more of application programs are stored in the memory and are configured To be executed by one or more of processors, one or more of application programs are configured to: being executed according to claim 1 To the method for randomly selecting feature described in any one of 7.
10. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium is for storing computer Instruction, when run on a computer, allow computer execute described in any one of the claims 1 to 7 with The method of machine selected characteristic.
CN201810892174.1A 2018-08-07 2018-08-07 Method, device, electronic equipment and storage medium for randomly selecting characteristics Active CN109255368B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810892174.1A CN109255368B (en) 2018-08-07 2018-08-07 Method, device, electronic equipment and storage medium for randomly selecting characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810892174.1A CN109255368B (en) 2018-08-07 2018-08-07 Method, device, electronic equipment and storage medium for randomly selecting characteristics

Publications (2)

Publication Number Publication Date
CN109255368A true CN109255368A (en) 2019-01-22
CN109255368B CN109255368B (en) 2023-12-22

Family

ID=65049766

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810892174.1A Active CN109255368B (en) 2018-08-07 2018-08-07 Method, device, electronic equipment and storage medium for randomly selecting characteristics

Country Status (1)

Country Link
CN (1) CN109255368B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109816034A (en) * 2019-01-31 2019-05-28 清华大学 Signal characteristic combines choosing method, device, computer equipment and storage medium
CN110223264A (en) * 2019-04-26 2019-09-10 中北大学 Image difference characteristic attribute fusion availability distributed structure and synthetic method based on intuition possibility collection
CN111782734A (en) * 2019-04-04 2020-10-16 华为技术服务有限公司 Data compression and decompression method and device

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080219498A1 (en) * 2007-03-05 2008-09-11 Siemens Corporate Research, Inc. Visual discrimination model for single image applications
CN101853485A (en) * 2010-06-04 2010-10-06 浙江工业大学 Non-uniform point cloud simplification processing method based on neighbor communication cluster type
CN101853526A (en) * 2010-06-04 2010-10-06 浙江工业大学 Density self-adapting non-uniform point cloud simplifying treatment method
CN103049750A (en) * 2013-01-11 2013-04-17 广州广电运通金融电子股份有限公司 Character recognition method
CN103942318A (en) * 2014-04-25 2014-07-23 湖南化工职业技术学院 Parallel AP propagating XML big data clustering integration method
CN105139035A (en) * 2015-08-31 2015-12-09 浙江工业大学 Mixed attribute data flow clustering method for automatically determining clustering center based on density
AU2014277847A1 (en) * 2014-12-22 2016-07-07 Canon Kabushiki Kaisha A method or computing device for configuring parameters of a feature extractor
CN106503731A (en) * 2016-10-11 2017-03-15 南京信息工程大学 A kind of based on conditional mutual information and the unsupervised feature selection approach of K means
CN106778814A (en) * 2016-11-24 2017-05-31 郑州航空工业管理学院 A kind of method of the removal SAR image spot based on projection spectral clustering
CN106874923A (en) * 2015-12-14 2017-06-20 阿里巴巴集团控股有限公司 A kind of genre classification of commodity determines method and device
CN107943830A (en) * 2017-10-20 2018-04-20 西安电子科技大学 A kind of data classification method suitable for higher-dimension large data sets
CN108038500A (en) * 2017-12-07 2018-05-15 东软集团股份有限公司 Clustering method, device, computer equipment, storage medium and program product

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080219498A1 (en) * 2007-03-05 2008-09-11 Siemens Corporate Research, Inc. Visual discrimination model for single image applications
CN101853485A (en) * 2010-06-04 2010-10-06 浙江工业大学 Non-uniform point cloud simplification processing method based on neighbor communication cluster type
CN101853526A (en) * 2010-06-04 2010-10-06 浙江工业大学 Density self-adapting non-uniform point cloud simplifying treatment method
CN103049750A (en) * 2013-01-11 2013-04-17 广州广电运通金融电子股份有限公司 Character recognition method
CN103942318A (en) * 2014-04-25 2014-07-23 湖南化工职业技术学院 Parallel AP propagating XML big data clustering integration method
AU2014277847A1 (en) * 2014-12-22 2016-07-07 Canon Kabushiki Kaisha A method or computing device for configuring parameters of a feature extractor
CN105139035A (en) * 2015-08-31 2015-12-09 浙江工业大学 Mixed attribute data flow clustering method for automatically determining clustering center based on density
CN106874923A (en) * 2015-12-14 2017-06-20 阿里巴巴集团控股有限公司 A kind of genre classification of commodity determines method and device
CN106503731A (en) * 2016-10-11 2017-03-15 南京信息工程大学 A kind of based on conditional mutual information and the unsupervised feature selection approach of K means
CN106778814A (en) * 2016-11-24 2017-05-31 郑州航空工业管理学院 A kind of method of the removal SAR image spot based on projection spectral clustering
CN107943830A (en) * 2017-10-20 2018-04-20 西安电子科技大学 A kind of data classification method suitable for higher-dimension large data sets
CN108038500A (en) * 2017-12-07 2018-05-15 东软集团股份有限公司 Clustering method, device, computer equipment, storage medium and program product

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109816034A (en) * 2019-01-31 2019-05-28 清华大学 Signal characteristic combines choosing method, device, computer equipment and storage medium
CN109816034B (en) * 2019-01-31 2021-08-27 清华大学 Signal characteristic combination selection method and device, computer equipment and storage medium
CN111782734A (en) * 2019-04-04 2020-10-16 华为技术服务有限公司 Data compression and decompression method and device
CN111782734B (en) * 2019-04-04 2024-04-12 华为技术服务有限公司 Data compression and decompression method and device
CN110223264A (en) * 2019-04-26 2019-09-10 中北大学 Image difference characteristic attribute fusion availability distributed structure and synthetic method based on intuition possibility collection
CN110223264B (en) * 2019-04-26 2022-03-25 中北大学 Image difference characteristic attribute fusion validity distribution structure based on intuition possibility set and synthesis method

Also Published As

Publication number Publication date
CN109255368B (en) 2023-12-22

Similar Documents

Publication Publication Date Title
CN107103171B (en) Modeling method and device of machine learning model
CN104239351B (en) A kind of training method and device of the machine learning model of user behavior
CN106646158A (en) Transformer fault diagnosis improving method based on multi-classification support vector machine
CN109255368A (en) Randomly select method, apparatus, electronic equipment and the storage medium of feature
CN108051660A (en) A kind of transformer fault combined diagnosis method for establishing model and diagnostic method
CN105224872A (en) A kind of user&#39;s anomaly detection method based on neural network clustering
CN111198938A (en) Sample data processing method, sample data processing device and electronic equipment
CN105354595A (en) Robust visual image classification method and system
Zhou et al. Convolutional neural networks based pornographic image classification
CN110334356A (en) Article matter method for determination of amount, article screening technique and corresponding device
CN108846097A (en) The interest tags representation method of user, article recommended method and device, equipment
CN110096499A (en) A kind of the user object recognition methods and system of Behavior-based control time series big data
CN103617146B (en) A kind of machine learning method and device based on hardware resource consumption
CN109918444A (en) Training/verifying/management method/system, medium and equipment of model result
CN106980639B (en) Short text data aggregation system and method
CN109272312A (en) Method and apparatus for transaction risk detecting real-time
CN109472307A (en) A kind of method and apparatus of training image disaggregated model
CN111626360B (en) Method, apparatus, device and storage medium for detecting boiler fault type
CN105224954A (en) A kind of topic discover method removing the impact of little topic based on Single-pass
CN116611911A (en) Credit risk prediction method and device based on support vector machine
CN109766333A (en) Data processing empty value method, apparatus and terminal device
CN112597699B (en) Social network rumor source identification method integrated with objective weighting method
CN109241146A (en) Student&#39;s intelligence aid method and system under cluster environment
CN108280224A (en) Ten thousand grades of dimension data generation methods, device, equipment and storage medium
CN104331398B (en) Generate the method and device of synonymous word alignment dictionary

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant