CN110458199A - Based on the kohonen neural network clustering methods of sampling - Google Patents

Based on the kohonen neural network clustering methods of sampling Download PDF

Info

Publication number
CN110458199A
CN110458199A CN201910641516.7A CN201910641516A CN110458199A CN 110458199 A CN110458199 A CN 110458199A CN 201910641516 A CN201910641516 A CN 201910641516A CN 110458199 A CN110458199 A CN 110458199A
Authority
CN
China
Prior art keywords
sampling
sample
neural network
enterprise
attributive character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910641516.7A
Other languages
Chinese (zh)
Inventor
王妍
卿枫
陈云鹏
檀雷雷
胡菁
樊珑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Communication University of China
Original Assignee
Communication University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Communication University of China filed Critical Communication University of China
Priority to CN201910641516.7A priority Critical patent/CN110458199A/en
Publication of CN110458199A publication Critical patent/CN110458199A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/10Tax strategies

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses one kind to be based on the kohonen neural network clustering methods of sampling, comprises the following specific steps that: logging data, and extracts attributive character;Kononen neural network clustering is carried out according to the attributive character of extraction;Sample total is determined using relative error;The attributive character of major class and major class where the sample after cluster gives all kinds of different sample sizes;After each classification sample size determines, stratified sampling is carried out in the inside of each classification, corresponding weight is distributed, obtains last sampling samples.The present invention not only ensure that sample point was not in the case where concentrating on certain a kind of enterprise in totality, but also can will really need the enterprise investigated and extract.Improve influence of the limitation of " pareira tau effect " existing for export enterprise and traditional methods of sampling to sampling results;And avoiding the enterprise largely extracted is the identical enterprise of attributive character.

Description

Based on the kohonen neural network clustering methods of sampling
Technical field
The present invention relates to nerual network technique fields, more particularly to a kind of poly- based on kohonen neural network The class methods of sampling.
Background technique
Technical trade measure (referred to as " skill trade measure "), the in fact " TBT (Technical Barriers to Trade) in WTO system (Technical Barriers to Trade, TBT) " word.Skill trade measure generally refers to nontariff measures, and in the whole world Change today under economic continuous development, effect of the tariff in international goods increasingly reduces, instead current international Under situation, influence of the technical trade measure to international trade is growing day by day, it has also become realizes that economy, political target have in various countries Effect means.Skill trade measure is during specific implementation, mainly by three kinds of technical regulation, standard, Conformity Assessment Procedures means shapes Enter the first barrier in market at foreign trade commodity.And China Today export enterprise is influenced increasingly to increase by technical trade measure Greatly, we need to be sampled export enterprise investigation thus, and with low cost but comprehensive understanding Chinese exports enterprise is by skill trade The situation that measure influences.
Sample investigation is one of common method in investigation, is a kind of non-comprehensive investigation, it refers to from research object A part is extracted in all (totality) and is used as sample, and sample is comprehensively investigated, with this come to totally estimating.Root From the point of view of the method for sample drawn, non-probability sampling and probability sampling can be divided into.It is studied herein mainly for probability sampling, It is according to randomly assigne, the program designed in advance according to certain, the methods of sampling of extraction section unit from totality.It is general compared to non- Rate sampling, probability sampling can control error from probability meaning.For each particular problem, on the basis of the above The various methods of samplings can be derived again, each methods of sampling has in place of its pros and cons.When problem is fairly simple, such as only Single Sampling Frame is sampled, the otherness and sample for the conclusion that each method is drawn may to overall representativeness Difference can't be too big.But if when being related to multiple Sampling Frames, we just each Sampling Frame cannot individually be taken out into Line sampling, because being individually sampled to Sampling Frame may result in sample there may be certain hiding connections between Sampling Frame This loses representative so that generating deviation to overall estimation to overall data structure.
Therefore, how providing one kind can ensure that sample point will not be unilateral in totality and by required sampling Out based on the kohonen neural network clustering methods of sampling be those skilled in the art's urgent need to resolve the problem of.
Summary of the invention
In view of this, the present invention provides one kind to be based on the kohonen neural network clustering methods of sampling, this method both guaranteed Sample point in totality is not in the case where concentrating on certain a kind of enterprise, and can will really need the enterprise investigated and take out It takes out.
To achieve the goals above, the invention provides the following technical scheme:
One kind being based on the kohonen neural network clustering methods of sampling, comprises the following specific steps that:
Sample total is determined using relative error;
Logging data, and extract attributive character;
Kononen neural network clustering is carried out according to the attributive character of extraction, obtains the corresponding major class of each sample;
The attributive character of major class and major class where the sample after cluster gives all kinds of different sample sizes;
After each classification sample size determines, proportional to city number is distributed in all kinds of corresponding to each city Sampling weights carry out stratified sampling in the inside of each classification according to weight, obtain last sampling samples.
Preferably, it is based in the kohonen neural network clustering methods of sampling in above-mentioned one kind, the attributive character includes But it is not limited to: export amount of money number, exporting country's number, exporting species number, place city.
Preferably, it is based in the kohonen neural network clustering methods of sampling in above-mentioned one kind, the kononen nerve Network one kind only has input layer -- the neural network of hidden layer;A node on behalf one class for needing to be polymerized in hidden layer; For each input unit, a corresponding only hiding node layer exports during competition learning, i.e., the node is Classification corresponding to this input unit.
Preferably, it is based in the kohonen neural network clustering methods of sampling in above-mentioned one kind, the relative error determines The specific steps of sample total:
According to the relational expression of the relative error of sampling theory and sample size:
The conversion that formula (1) carries out formula is obtained into following relationship:
Last sample total determines that formula is as follows:
It can be seen via above technical scheme that compared with prior art, the present disclosure provides one kind to be based on kohonen The neural network clustering methods of sampling, from traditional methods of sampling sampling purpose it is different, the purpose of current sample be by really by Skill trade measure influence enterprise extract with carry out it is subsequent investigate on the spot, the present invention has fully considered present in export enterprise The limitation of connection, traditional methods of sampling between " Pareto benefit ", enterprise attributes provides a kind of for export enterprise's number According to sampling innovatory algorithm.This method not only ensure that sample point was not in the case where concentrating on certain a kind of enterprise in totality, but also The enterprise investigated can will be really needed to extract.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.
Fig. 1 is work flow diagram of the invention;
Fig. 2 is improvement sampling algorithm flow chart of the invention;
Fig. 3 is the line chart for taking all kinds of class mean values to be made again after being standardized the data of each enterprise;
Relationship of the Fig. 4 between sampling error and sample size of the invention;
Fig. 5 is that traditional batch sampling results and the present invention improve comparing result of the sampling algorithm result on export amount of money.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a kind of improvement sampling algorithms based on kohonen neural network algorithm result, both protect Demonstrate,proved sample point is not in the case where concentrating on certain a kind of enterprise in totality, and can will really need the enterprise investigated It extracts.The limitation of " pareira tau effect " existing for export enterprise and traditional methods of sampling is improved to sampling results It influences;And avoiding the enterprise largely extracted is the identical enterprise of attributive character.In this sampling process, outlet is looked forward to first Industry data carry out data processing, extract the attribute information of each enterprise;Using enterprise attributes as input variable, enterprise first into Row cluster;Utilization on the basis of cluster result, to traditional sampling is carried out inside each classification.
By taking export enterprise, Guangdong Province as an example, as shown in Figure 1, a kind of improvement based on kohonen neural network algorithm result Sampling algorithm, method flow of the invention are main including the following steps: ((1)
According to export enterprise, Guangdong Province data, export amount of money number, the exporting country's number, exporting of each enterprise are sorted out Species number, place city four dimensions variable;(2) it is carried out according to export amount of money number, exporting country's number, exporting type Enterprise is divided into 12 major class by kohonen neural network clustering;(3)
Using export enterprise, Guangdong Province as sampled population, and sample total is determined according to sampling theory;(4)
Determine all kinds of internal specimen amounts;(5) all kinds of middle each department sample sizes are determined;(6) letter is used in each area Single random sampling sample drawn;(7) golden in outlet by the way that the sample that the methods of sampling and traditional methods of sampling extract will be improved Distribution situation on specified number compares.
The characteristics of stratified sampling is that intraformational differentiation is small after layering, and interlayer difference is big.The purpose of clustering algorithm is, from number According to angle start with and be classified into different clusters, similar with feature between cluster, feature difference is big between different clusters.From cluster For the characteristics of purpose and stratified sampling of algorithm, the result of clustering algorithm is most suitable as the foundation of stratified sampling 's.Therefore, the present invention carries out clustering to enterprise first, has both fully considered Guangdong Province before traditional stratified sampling Connection between " Pareto benefit ", enterprise attributes present in export enterprise, and suitable stratification factor is introduced, so as to energy The enterprise of each type is covered comprehensively, and carries out importance sampling, tool for the enterprise being affected by skill trade measure Body flow chart is as shown in Figure 2.Meanwhile with Guangdong data instance, enterprise is divided into 12 major class after clustering, between every class Attribute difference it is as shown in Figure 3.And showing in existing research is really outlet gold by the enterprise that skill trade measure is affected Volume is big, exporting country is more, the enterprise more than exporting.From the point of view of figure three, relatively large enterprise sort is influenced by skill trade measure It is the 1st class and the 4th class enterprise, correspondingly needs to distribute more sample size for it in subsequent sampling process.
Fig. 4 is the relationship between sampling error and sample size, from existing research shows that comprehensively consider sampling error, Cost of sampling etc. problem, and then controlling relative error is 4%-5%, in the estimation ratio P for being influenced enterprise by skill trade measure In 40%-50%, a range of sample size can be tentatively provided.By taking national export enterprise as an example, when known overall, sample size Range be 1529 to 3559;When unknown overall, the range of sample size is 1537 to 3602.To guarantee sampling It is scientific to reduce cost simultaneously, it is peaceful big not small further according to the sampling principle of conservative, it can be by Guangdong Province's sample of this sample investigation This amount fixed 2700 proper.In conjunction with the variance analysis before to all kinds of enterprise attributes, after comprehensively considering, the 1st class to The sample size that 12 classes are distributed respectively are as follows: 810,270,135,540,135,113,135,112,112,112,112,112.
Fig. 5 is that the result that three kinds of methods of samplings sample out carries out the hybrid-sorting by export amount of money from big to small, then Carry out the result after fragmentation count.Such as: having 62.3% enterprise in 10% forward sample companies of export amount of money is It is extracted by improving sampling algorithm, remaining 37.7% is the sampling results of tradition sampling.From Fig. 5 it is not difficult to find out that, In Either generally still be segmented in 40% forward enterprise of export amount of money it is saw later, shared by kohonen stratified sampling Enterprise's number be and simple random sampling and city stratified sampling much higher than simple random sampling and city stratified sampling Sample on export amount of money number it is whole rearward.
In conclusion simple simple random sampling and existing by the enterprise that the stratified sampling in area extracts certain Limitation, that is, most of enterprise extracted out are the enterprise that export amount of money is small, exporting country is few, exporting is few, and kohonen points Layer sampling improves simple random sampling and stratified random smapling since export enterprise, Guangdong Province data have " pareto efficiency " Lead to above-mentioned drawback, and allows and be really extracted by the serious " three high " enterprise of skill trade measure.It is analyzed according to existing data, Export amount of money is small, exporting country is few, exporting lack enterprise on by skill trade measure effect affirm it is bigger than export amount of money, Exporting country is more, exporting multiple enterprises wants small.Therefore, kohonen stratified sampling is more suitable for technology trade than tradition sampling The investigation that easy measure influences Enterprises for Export in China.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For device disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.
The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims (4)

1. one kind is based on the kohonen neural network clustering methods of sampling, which is characterized in that comprise the following specific steps that: utilizing phase Sample total is determined to error;
Logging data, and extract attributive character;
Kononen neural network clustering is carried out according to the attributive character of extraction, obtains the corresponding major class of each sample;
The attributive character of major class and major class where the sample after cluster gives all kinds of different sample sizes;
After each classification sample size determines, proportional to city number distributes sampling corresponding to each city in all kinds of Weight carries out stratified sampling in the inside of each classification according to weight, obtains last sampling samples.
2. according to claim 1 a kind of based on the kohonen neural network clustering methods of sampling, which is characterized in that described Attributive character includes but is not limited to: export amount of money number, exporting country's number, exporting species number, place city.
3. according to claim 1 a kind of based on the kohonen neural network clustering methods of sampling, which is characterized in that described Kononen neural network one kind only has input layer -- the neural network of hidden layer;One need of a node on behalf in hidden layer The class to be polymerized to;For each input unit, a corresponding only hiding node layer exports during competition learning, I.e. the node is classification corresponding to this input unit.
4. according to claim 1 a kind of based on the kohonen neural network clustering methods of sampling, which is characterized in that described Relative error determines the specific steps of sample total:
According to the relational expression of the relative error of sampling theory and sample size:
The conversion that formula (1) carries out formula is obtained into following relationship:
Last sample total determines that formula is as follows:
CN201910641516.7A 2019-07-16 2019-07-16 Based on the kohonen neural network clustering methods of sampling Pending CN110458199A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910641516.7A CN110458199A (en) 2019-07-16 2019-07-16 Based on the kohonen neural network clustering methods of sampling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910641516.7A CN110458199A (en) 2019-07-16 2019-07-16 Based on the kohonen neural network clustering methods of sampling

Publications (1)

Publication Number Publication Date
CN110458199A true CN110458199A (en) 2019-11-15

Family

ID=68481346

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910641516.7A Pending CN110458199A (en) 2019-07-16 2019-07-16 Based on the kohonen neural network clustering methods of sampling

Country Status (1)

Country Link
CN (1) CN110458199A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111967771A (en) * 2020-08-18 2020-11-20 深圳市维度统计咨询股份有限公司 Data quality management method and device based on big data and storage medium
CN112215640A (en) * 2020-10-09 2021-01-12 浪潮卓数大数据产业发展有限公司 Network retail platform shop sampling method based on statistical estimation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102716A (en) * 2014-07-17 2014-10-15 哈尔滨理工大学 Imbalance data predicting method based on cluster stratified sampling compensation logic regression
CN104794335A (en) * 2015-04-15 2015-07-22 同济大学 General multistage space sampling method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102716A (en) * 2014-07-17 2014-10-15 哈尔滨理工大学 Imbalance data predicting method based on cluster stratified sampling compensation logic regression
CN104794335A (en) * 2015-04-15 2015-07-22 同济大学 General multistage space sampling method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
叶伟: "《市场调查与预测》", 31 August 2011, 北京理工大学出版社 *
周晓苏等: "《会计研究中的数据挖掘方法》", 30 April 2009, 天津:南开大学出版社 *
张家林: "《证券投资人工智能:人工智能时代的财富管理变革》", 31 January 2017, 中国经济出版社 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111967771A (en) * 2020-08-18 2020-11-20 深圳市维度统计咨询股份有限公司 Data quality management method and device based on big data and storage medium
CN112215640A (en) * 2020-10-09 2021-01-12 浪潮卓数大数据产业发展有限公司 Network retail platform shop sampling method based on statistical estimation
CN112215640B (en) * 2020-10-09 2022-07-26 浪潮卓数大数据产业发展有限公司 Network retail platform shop sampling method based on statistical estimation

Similar Documents

Publication Publication Date Title
CN110852856B (en) Invoice false invoice identification method based on dynamic network representation
CN104462184B (en) A kind of large-scale data abnormality recognition method based on two-way sampling combination
Nachman et al. Neural resampler for Monte Carlo reweighting with preserved uncertainties
CN110458199A (en) Based on the kohonen neural network clustering methods of sampling
CN109034194A (en) Transaction swindling behavior depth detection method based on feature differentiation
CN111062425B (en) Unbalanced data set processing method based on C-K-SMOTE algorithm
CN107132266A (en) A kind of Classification of water Qualities method and system based on random forest
CN113516228B (en) Network anomaly detection method based on deep neural network
CN110348608A (en) A kind of prediction technique for improving LSTM based on fuzzy clustering algorithm
CN111291779A (en) Vehicle information identification method and system, memory and processor
CN110244099A (en) Stealing detection method based on user's voltage
CN104850868A (en) Customer segmentation method based on k-means and neural network cluster
CN109886352A (en) A kind of unsupervised appraisal procedure of airspace complexity
CN103942415A (en) Automatic data analysis method of flow cytometer
CN109345684A (en) A kind of multinational paper money number recognition methods based on GMDH-SVM
Guo et al. An improved oversampling method for imbalanced data–SMOTE based on Canopy and K-means
CN117725437B (en) Machine learning-based data accurate matching analysis method
CN108763926B (en) Industrial control system intrusion detection method with safety immunity capability
CN109389517B (en) Analysis method and device for quantifying line loss influence factors
CN110321376A (en) A kind of data fabrication investigation method based on Ben Fute law
CN113392877A (en) Daily load curve clustering method based on ant colony algorithm and C-K algorithm
CN105824785A (en) Rapid abnormal point detection method based on penalized regression
CN113435536A (en) Electricity charge data preprocessing method, device, terminal equipment and medium
CN113837865A (en) Method for extracting multi-dimensional risk feature strategy
CN107563421A (en) One kind loss similitude feeder line sorting technique

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191115