CN114841269A

CN114841269A - Sparse data-based machine learning model construction method and storage medium

Info

Publication number: CN114841269A
Application number: CN202210489017.2A
Authority: CN
Inventors: 靳谊; 蔡明渊; 付晨雨; 邹新明; 丁治凯; 张星亮
Original assignee: FUJIAN RONGJI SOFTWARE CO LTD
Current assignee: FUJIAN RONGJI SOFTWARE CO LTD
Priority date: 2019-09-10
Filing date: 2019-09-10
Publication date: 2022-08-02
Also published as: CN110569904B; CN114841270A; CN110569904A

Abstract

The invention discloses a construction method and a storage medium of a machine learning model based on sparse data, wherein the method comprises the following steps: acquiring auxiliary data according to preset keywords; acquiring service data, and determining a data item corresponding to an input item and a data item corresponding to an output item; labeling the business data of which the value of the data item corresponding to the output item is not null; acquiring a first sample according to the service data; acquiring service data marked by a label in the first sample as a second sample; synthesizing feature items through a feature synthesis technology, and combining the feature items into a second sample as input items; performing positive and negative sample equalization processing on the second sample by synthesizing a few types of oversampling technologies, and taking newly synthesized sample data as a third sample; combining the second sample and the third sample to obtain a fourth sample; and training the fourth sample through a preset machine learning algorithm to obtain a machine learning model. The invention can improve the accuracy of the machine learning model.

Description

Sparse data-based machine learning model construction method and storage medium

The present application is a divisional application based on an invention patent entitled "method for constructing machine learning model and computer-readable storage medium" having application date of 09/10 in 2019 and application number of 201910850536.5.

Technical Field

The present invention relates to the field of machine learning technologies, and in particular, to a method for constructing a machine learning model and a computer-readable storage medium.

Background

The most commonly used sampling method in the inspection and quarantine port supervision process is shown in the 'management regulations on exit and entry inspection and quarantine processes' and 'inspection and quarantine batch drawing proportion and process time limit table' issued by the general bureau of quality inspection in 2017 in 10 months. The field inspection and quarantine proportion and the laboratory inspection and quarantine proportion (sampling inspection and delivery proportion) under different conditions are respectively specified aiming at different product categories. After analysis, except that partial products refer to product risk grade classification, enterprise risk grade classification and registration record management requirement classification, targeted sampling requirement difference is formed, and most imported goods of the product types are widely subjected to port supervision and spot check in a random sampling mode. The current condition of sampling proportion difference is basically derived from inspection and quarantine declaration items of imported commodities (the current condition is imported declaration items of customs declaration).

In real life, besides enterprise risks and risk grade division of products, various commodities, particularly fresh commodities and bulk commodities which are not transported in a cold chain, are affected by environmental factors, price factors or transportation conditions, and differentiated unqualified performances appear. In the international transportation (particularly, sea transportation), the situation of the port where the transportation means stops is complex, and the condition of biological medium infection is easy to occur in the process of loading and unloading goods and containers. Some critical process information and environment information are not reflected in imported declaration projects, and missing situations easily occur in the port risk identification process.

The main performance is as follows:

1. because the border bank line of the country is longer and the environment difference of each port is large, the related health quarantine content and the related method have certain difference. Aiming at each port in the south, the method is suitable for medium survival and propagation due to environmental factors such as temperature, humidity and the like, and key spot check, detection, prevention and control are required to be carried out on medium organisms such as mice, mosquitoes and the like; for northern ports, especially in extremely low temperature environments in winter, the objective environment does not have the conditions for survival and reproduction of mosquitoes, and the requirements for spot check and detection prevention and control of vector organisms can be properly adjusted and reduced within a certain range.

2. Due to seasonal temperature and humidity changes, the applicable spot check and detection methods of commodities, especially fresh goods, imported from the same port in different seasons are inconsistent. For example, fresh fruits imported in summer for non-cold chain transportation often need to pay attention to property changes caused by temperature and humidity in the checking process due to high temperature and humidity. In this scenario, for a high-temperature and high-humidity environment, the sampling and detection requirements should be different from those in other seasons.

3. Due to the change of the voyage period, the fresh and bulk goods may be rotten and deteriorated during the voyage process, and the properties of the goods are affected. When the voyage or the transportation period exceeds the fresh-keeping time or the conventional transportation time of fresh commodities, the requirements of sampling, checking and the like of the commodities are different from those of the commodities in the normal voyage.

4. Compared with epidemic situation epidemic area information issued by an authority (such as WHO), the goods stopped at the harbor of the epidemic area are different from other products in the sampling and checking processes.

5. Because the country of origin and the region of origin may exist some kind of epidemic disease epidemic areas in stages or for a long time, the difference between the commodities from the countries/regions of the epidemic areas and the commodities from the countries/regions of the non-epidemic areas should be reflected in the sampling, detection and control processes of the commodities from the countries or regions of the epidemic areas.

6. Due to the fact that integrity, qualification and the like of enterprises such as reporting enterprises, production enterprises and consignees are good, the situations of withholding and reporting omission exist, and the distribution difference is shown between enterprises with low integrity level and high risk and other enterprises.

7. When the commodity price is greatly different from the average price of the market goods, the price should be different from other products in the sampling process.

In addition, according to the analysis of the official published double random and one public statistical results of the customs of 2018, the random selection effective rate of the audit after goods release by the port supervision department is as high as 20.14%, and the random selection effective rate of the audit of partial ports is more than 60%. From the above data analysis, the commodity sampling effectiveness (i.e. sampling unqualified commodities) needs to be further improved in the port supervision process.

In conclusion, in the port supervision process, local conditions and time conditions need to be considered, accurate customization is carried out on the sampling methods of various commodities based on multi-dimensional comprehensive analysis of risk factors, and real-time accurate sampling is achieved.

The existing import and export commodity batch extraction model determines the commodity batch extraction rate (inspection batch extraction rate and inspection batch extraction rate) through a comprehensive decision according to a group of multidimensional commodity batch extraction factors (such as commodity types, commodity attributes, countries or regions, enterprise categories and entry and exit signs) which are defined in advance.

The existing import and export commodity batch-drawing algorithm is realized by adopting a fixed program, commodity batch-drawing factors and batch-drawing rates are input, commodity batch-drawing is realized by algorithms such as random hit, normal correction and the like, and commodity medium or medium-out results are output. The specific process is as follows:

the first step is as follows: searching whether corresponding batch extraction records exist according to batch extraction factors (derived from commodity declaration items, such as commodity types, commodity attributes, countries or regions, enterprise categories and entry and exit signs); if not, taking the batch extraction factor of the current declared commodity as a service main key, creating a batch extraction record, and initializing the extracted quantity and the extracted quantity in the record to be 0 and 0 respectively.

The purpose of recording the batch extraction condition is to avoid the situation of excessive or few batches when the sample size is small due to the excessively strong randomness of the algorithm in the random batch extraction implementation process.

The second step is that: and searching whether a corresponding batch drawing rule is set or not according to the batch drawing factor. Setting the commodity batch drawing mark as 'no drawing in middle', and ending the process, aiming at the condition that the batch drawing rule is not set or the batch drawing rule exists but the batch drawing rate is set as '0'; and extracting the batch extraction rate set in the batch extraction rule aiming at the batch extraction rule which can be searched.

The third step: from the batch record, the actual batch drawing rate is calculated, which is (number drawn/number drawn) × 100%. Checking whether over-drawing or under-drawing conditions exist:

the actual batch drawing rate is larger than the preset batch drawing rate, the commodity batch drawing mark is set as 'no drawing', and the batch drawing record (the number of drawn times is plus 1) is updated;

secondly, the actual batch drawing rate is equal to the preset batch drawing rate, random batch drawing is executed according to the preset batch drawing rate, a commodity batch drawing mark is set according to a calculation result, and batch drawing records are updated;

and thirdly, setting the commodity batch drawing mark as 'drawing middle', and updating the batch drawing record (the number of drawn times is plus 1, and the number of drawn times is plus 1) when the actual batch drawing rate is less than the preset batch drawing rate.

The fourth step: and circulating the steps to execute batch extraction of other declared commodities.

The existing import and export commodity batch extraction algorithm has the following defects:

(1) the batch drawing factors related to the batch drawing rule are too few, so that the risk degree of various commodities in a complex business environment cannot be fully reflected, and personalized setting aiming at specific risks cannot be implemented.

(2) The batch extraction rate of the corresponding batch extraction factor cannot be dynamically increased or reduced according to the attributes, characteristics and environmental factors of various commodities and the dynamic change results of the corresponding inspection results, and real-time, accurate and dynamic sampling decision cannot be achieved.

(3) The attribute, the characteristic and the environmental factor of each commodity and the corresponding checking result can not be comprehensively analyzed through the accumulated import and export commodity big data, the risk factor influencing the unqualified condition of each commodity is extracted, the weight ratio of the risk factor is calculated, and the commodity batch-taking rate determined by comprehensive decision-making is calculated.

(4) The machine learning technology cannot be used for accumulating and learning the big data analysis result and the comprehensive decision result, the commodity batch model can be intelligently corrected, the manual processing workload is reduced, and the precision of the commodity batch model is improved.

(5) The system can not accurately acquire the sudden epidemic disease early warning information around the world in real time by using a web crawler technology, and the information is subjected to big data analysis and machine learning processing and is applied to a commodity batch extraction model, so that the rapid response handling capability of the sudden epidemic disease early warning event is improved.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: provided are a method for constructing a machine learning model and a computer-readable storage medium, which can improve the accuracy of the machine learning model.

In order to solve the technical problems, the invention adopts the technical scheme that: a method for constructing a machine learning model comprises the following steps:

acquiring data according to preset keywords to obtain auxiliary data;

acquiring service data, and determining a data item corresponding to an input item and a data item corresponding to an output item in the service data;

labeling the service data of which the value of the data item corresponding to the output item is not null;

acquiring a first sample according to the service data;

acquiring service data marked by a label in the first sample as a second sample;

synthesizing feature items through a feature synthesis technology according to corresponding data items in the auxiliary data and the service data, and combining the feature items into the second sample as input items;

performing positive and negative sample equalization processing on the second sample by synthesizing a few types of oversampling technologies, and taking newly synthesized sample data as a third sample;

combining the second sample and the third sample to obtain a fourth sample;

and training the fourth sample through a preset machine learning algorithm to obtain a machine learning model.

The invention also proposes a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps as described above.

The invention has the beneficial effects that: by collecting auxiliary data, combining with related input items in the service data, generating characteristic items through a characteristic synthesis technology and combining the characteristic items into a second sample, the number of the input items can be increased, and the prediction accuracy of the constructed machine learning model is improved; through carrying out positive and negative sample equalization processing on the samples, the overfitting problem can be avoided, and the accuracy of the constructed machine learning model is improved. When the machine learning model that constitutes is used in imported goods port supervisory process, can solve the blind sampling that uses random sampling method to bring among the prior art and the high problem of missed investigation rate, and can promote import and export commodity sampling effective rate at to a great extent, when promoting port supervision work efficiency, promote the whole customs efficiency of port to reduce the economic burden that foreign trade enterprise caused because of storage, commodity circulation cost of transportation.

Drawings

FIG. 1 is a flow chart of a method for constructing a machine learning model according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method according to a first embodiment of the present invention;

fig. 3 is a schematic diagram of an analysis result of the decision tree model according to the first embodiment of the present invention.

Detailed Description

In order to explain technical contents, objects and effects of the present invention in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.

The most key concept of the invention is as follows: historical service data is effectively utilized, and a machine learning model is constructed by applying technologies such as a web crawler technology, a big data analysis technology and a sample balancing technology.

The noun explains:

machine learning techniques: machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. It is the core of artificial intelligence, is the fundamental way to make computers have intelligence, and its application is spread over various fields of artificial intelligence.

The web crawler focusing technology comprises the following steps: focused web crawlers (also known as topic crawlers) refer to web crawlers that selectively crawl pages that are related to a predefined topic. Compared with the general web crawler, the focusing crawler only needs to crawl pages related to the theme, hardware and network resources are greatly saved, the saved pages are updated quickly due to small quantity, and the requirements of certain specific crowds on information in specific fields can be well met.

Sparse data: in the database, sparse data refers to data containing a large number of null values in a two-dimensional table; i.e., sparse data, refers to data in which a majority of the values in the data set are missing or zero. Sparse data is absolutely not useless data, but is incomplete in information, and a large amount of useful information can be mined by proper means.

Referring to fig. 1, a method for constructing a machine learning model includes:

acquiring data according to preset keywords to obtain auxiliary data;

acquiring a first sample according to the service data;

combining the second sample and the third sample to obtain a fourth sample;

From the above description, the beneficial effects of the present invention are: the accuracy of the machine learning model can be improved.

Further, the acquiring data according to the preset keyword specifically includes:

crawling data through a focused web crawler technology according to a preset crawling strategy, wherein the crawling strategy comprises a preset URL (uniform resource locator), keywords and a crawling time range;

and classifying, cleaning and removing the weight of the crawled data by a data cleaning technology to obtain auxiliary data of preset categories, and storing the auxiliary data in databases or data tables of corresponding categories respectively.

According to the above description, the crawler technology is adopted, so that the auxiliary data can be accurately acquired from the internet in real time, and the auxiliary data is used for auxiliary business data training.

Further, after the acquiring the service data, the method further includes:

supplementing missing values in the service data;

and converting the nonlinear data in the service data into linear data.

Further, the supplementing missing values in the service data specifically include:

respectively calculating missing value proportion of data values of all data items in the service data;

and if the missing value proportion of a data item is smaller than a preset threshold value, selecting a corresponding missing value filling method according to the service attribute of the data item, and supplementing the missing value of the data item according to the missing value filling method.

As can be seen from the above description, by supplementing missing values, a full coverage of the model training on the historical data can be ensured.

Further, the converting the non-linear data in the service data into linear data specifically includes:

acquiring a data value in a nonlinear data item in the service data;

removing the duplication of the data values, and associating the data values after the duplication removal with a plurality of characters in a one-to-one correspondence manner, wherein the characters are provided with digital subscripts, and the digital subscripts of the characters are sequentially increased in an increasing manner;

and respectively replacing each data value in the nonlinear data item with the numerical subscript of the character corresponding to the data value.

As can be seen from the above description, the problem that some algorithms do not recognize text attributes can be effectively solved.

Further, the obtaining a first sample according to the service data specifically includes:

if the business data need to be classified and learned, classifying the business data according to business categories, and acquiring the business data in the classification with larger data volume as a first sample;

and if the business data does not need to be classified and learned, taking the business data as a first sample.

According to the description, the accuracy of the constructed machine learning model is ensured by classifying the business data containing various data types to construct the sample.

Further, after the fourth sample is trained through a preset machine learning algorithm to obtain a machine learning model, the method further includes:

optimizing the machine learning model.

From the above description, the prediction accuracy of the model can be improved by performing model optimization.

and applying the machine learning model to a corresponding business scene.

Further, the applying the machine learning model to the corresponding service scenario specifically includes:

acquiring new business data, and taking a data value of a data item of an input item corresponding to the new business data as an input variable of the machine learning model to obtain an output variable;

and obtaining the data value of the data item of the output item corresponding to the new service data according to the output variable.

From the above description, by applying the machine learning model, the value of the output item in the business data can be obtained quickly and accurately.

Example one

Referring to fig. 2-3, a first embodiment of the present invention is: a method for constructing a machine learning model can be applied to machine learning model analysis of large-scale sparse data, such as market supervision analysis starting from dimensions of goods, production enterprises, native countries and the like; analyzing and supervising the safety of food and medicine; pre-detecting and analyzing before delivery of the commodity by a production enterprise; analyzing the detection items of the to-be-detected commodity and predicting the unqualified condition in a laboratory; and the E-commerce platform predicts the merchant integrity, basic capability and other scenes according to merchant behaviors, user orders and user feedback.

The embodiment is described by taking a constructed commodity sampling model as an example, and finally can be applied to sampling import and export commodities, so that the problem that the difference of factors such as different cargo types, different regions and different environments in port supervision and other fields is solved, an accurate sampling model which meets the current objective condition and can be dynamically adjusted is trained, the instantaneity accurate batch taking is realized, the sampling accuracy is improved, and the problems of missed check, wrong check and the like are reduced.

As shown in fig. 2, the method comprises the following steps:

s1: and acquiring data according to preset keywords to obtain auxiliary data. Specifically, a web crawler focusing technology is used for customizing the crawling strategy of various types of data, the initialized crawling strategy comprises an initial URL (uniform resource locator), a crawling target (keyword), a crawling time range and the like, and auxiliary data required by model training is acquired from the Internet; and then, by adopting a data cleaning means, classifying, cleaning, removing the duplicate and other operations on the crawled data to obtain auxiliary data of different categories, and respectively storing the auxiliary data in databases or data tables of corresponding categories.

In this embodiment, the collection keywords are mainly developed around data sources, such as epidemic situation data, and the keywords mainly used are "epidemic area", "epidemic situation", "harm", "transmission", "medium", and the like. Auxiliary data that may be collected includes, but is not limited to, the data in Table 1.

Table 1:

after being distinguished according to the types, the information is respectively stored in an epidemic situation information, a trade information, a commodity price information, an environmental factor information and other related information databases/tables.

Furthermore, in order to ensure that the data acquisition range is not limited by the initial content, the data source can be directionally adjusted and keywords can be crawled through a manual intervention means, so that the data acquisition range is not limited by the initial content. The manual intervention mainly relocates the information related to the service by relocating the data source, the website URL, the data keywords, the year limit and the like, and supplements the missing information by increasing the acquisition range.

S2: acquiring service data, and determining a data item corresponding to an input item and a data item corresponding to an output item in the service data. The service data acquired in this step is historical service data.

In this embodiment, structured data of cargo declaration and inspection in the port supervision process is extracted as service data for a service scenario, and mainly includes declaration information of customs declaration, cargo declaration information, container declaration information, package declaration information, license information, inspection item and inspection result information, and the like; the inspection result is registered in real time in the historical merchandise inspection process. Furthermore, data correlation analysis can be performed through a big data statistical means, a scatter diagram, a residual diagram and the like are drawn, and understanding basis of structured data such as goods declaration and inspection is established.

Because the embodiment mainly realizes that the unqualified condition of the commodity is predicted and marked in real time through machine learning (modeling training) based on historical data so as to achieve the purposes of advanced prejudgment and accurate sampling, the data item corresponding to the output item in the embodiment is the data item used for judging whether the commodity is qualified or not, namely, the inspection result in the historical business data is used as the output variable of the machine learning model, and other data items in the historical business data are used as input items, namely, the input variables of the machine learning model.

The candidate values of the inspection result include "pass", "fail", "rework rectification pass", "contract (letter of credit) pass", wherein two candidate values of "rework rectification pass" and "contract (letter of credit) pass" are derived from the secondary inspection result after the first inspection fails, and therefore, in the data analysis process, the two candidate values are also regarded as "fail" processing.

S3: and performing data cleaning and data conversion on the service data, namely supplementing missing values in the service data, and converting nonlinear data in the service data into linear data. Further, missing value filling is performed only on the data items corresponding to the input items in the service data.

At present, when the port supervision organization supervises import and export data, declared data contain a large number of nullable items, and most declared information is discontinuous variables. Considering that most machine learning algorithms are not friendly to missing value data, conventional machine learning algorithms usually discard data with null values during training. In order to ensure comprehensive coverage of model training on historical data, how to process high-dimensional sparse data while not reducing the model prediction accuracy needs to be considered. Therefore, the missing value in the traffic number needs to be supplemented. For data with non-linear properties, which do not meet the requirements of the model algorithm, the data need to be converted into linear data.

Specifically, for missing value supplement, the missing value proportion (the number of empty data/the number of full data × 100%) of the data values in each data item is calculated, and in general, a missing value proportion smaller than 10% is defined as a case of few missing values, and a missing value proportion greater than 10% is defined as a case of many missing values, and for a data item with few missing values, an attribute filling missing value supplementing method can be adopted, and the missing values are pushed and added mainly according to business knowledge and experience. Further, some data items with special attributes may be processed by special value filling, mean filling, K-Means, and the like. For example, the data item representing age may be filled in by calculating based on the data item associated with its existence (e.g., the data item representing identification card information) using a special filling method; for data items with data types of numerical values, a mean filling method can be adopted, and filling is carried out according to the mean value by calculating the mean value of all data values of the data items; for data items needing to be assigned according to similar data, a K-Means clustering algorithm can be adopted, a proper mass center is found out according to experience, a clustering result is obtained through a K-Means executing process, and the largest cluster is selected as a missing value filling basis. In the actual application process, after the targeted analysis can be performed according to the service meanings and data types of different fields, different methods are respectively adopted to process null data.

For data conversion, nonlinear data is converted into linear data in a form of constructing a simulation row, and specifically, a data value in a nonlinear data item in the service data is obtained; then, the data values are subjected to duplication elimination, the data values after duplication elimination are respectively associated with a plurality of characters in a one-to-one correspondence mode, the characters are provided with digital subscripts, the digital subscripts are sequentially increased in an increasing mode, namely the same data value is associated with the same character provided with the digital subscript, preferably, the digital subscripts can form an arithmetic progression, and the tolerance is 1; and finally, replacing each data value in the nonlinear data item with a numerical subscript of a character corresponding to each data value.

For example, after the data value in the non-linear data item a is deduplicated, 5 candidate values { AAisj, B8s99h, C19snx, 78sqs, Z8SU } are left, and the set of candidate values is labeled by using a character with a numerical index, such as d1 ═ AAisj, d2 ═ B8s99h, d3 ═ C19snx, d4 ═ 78sqs, and d5 ═ Z8SU, where the candidate value of a can be expressed as { d1, d2, d3, d4, d5 }. And then, substituting the subscript for corresponding data to generate a mirror image simulation line a1 of a, and initializing the value of a1 line according to the data correspondence, wherein the candidate value of a1 is {1,2,3,4,5}, and the candidate values are all numerical types. In performing model calculations, data item a1 will be used in place of data item a to participate in model training. That is, all "AAisj" in the original a data value is replaced with "1", all "B8 s99 h" is replaced with "2", and so on. By carrying out data conversion, the problem that some algorithms do not recognize text attributes can be effectively solved.

S4: and performing label marking on the service data of which the value of the data item of the corresponding output item is not null, for example, marking the service data as label.

Since the historical data is used as the data source for model training, the data value of the inspection result needs to be obtained in the training sample. In this step, the service data whose inspection result is not empty is labeled.

S5: acquiring a first sample according to the service data; specifically, if the business data needs to be classified and learned, classifying the business data according to business categories, then selecting a category with a large data volume, and taking the business data in the category as a first sample; and if the business data does not need to be classified and learned, taking the business data as a first sample.

In this embodiment, the business data may be initially classified according to the category of the product, and to ensure the trainable degree of the model, a category with a large data size is preferentially selected, and the business data in the category is labeled as the first sample a.

After the business data are classified, a machine learning model is obtained through training according to the business data in the classification with larger data volume, and the business data in other subsequent classifications can be trained based on the model.

S6: and acquiring the business data marked with the label in the first sample as a second sample.

Referring to the official published batch proportion of each type of goods, the batch proportion of most products is between 5% and 30%, which means that the inspection result of most business data in the historical data may be empty. Therefore, the business data in the first sample is classified according to the label, the business data with the label is used as a second sample Y, and the business data without the label is used as an N sample. In the process of supervised learning, model training is mainly performed using the second sample Y.

S7: synthesizing feature items through a feature synthesis technology according to corresponding data items in the auxiliary data and the service data, and combining the feature items into the second sample as input items; i.e. the auxiliary data and the data items associated with the service data, are combined into the service data by matching these data items and generating feature items from some of the auxiliary data.

For example, assuming that the auxiliary data is epidemic situation data, which includes information such as an area where an epidemic situation occurs, an epidemic situation influence range, epidemic situation information, an epidemic situation risk level, and an epidemic situation processing mode, the area where the epidemic situation occurs and the epidemic situation influence range are matched with country information such as a country of origin, an approach country, and a trade country of goods in the business data, and the epidemic situation information, the epidemic situation risk level, and the epidemic situation processing mode in the auxiliary data are combined into the second sample as input items to generate four new feature items, which are respectively (i) whether the candidate is from an epidemic area, yes/no, an epidemic situation name, a candidate value of { yellow fever, ebola hemorrhagic fever, dengue fever, yellow fever, cholera, rift crack fever, avian influenza … … }, a risk level, a candidate value of { high risk, medium risk, low risk }, a processing method, and a candidate value of { spray processing, an epidemic situation processing method, a method of detecting an epidemic situation, a method of a method, a method of detecting a method of a method, a method of detecting a method of a method, a method of detecting a method, fumigation treatment, medicament treatment, radiation, cold treatment, heat treatment, destruction, … … }.

S8: and (3) carrying out positive and negative sample equalization processing on the second sample by synthesizing a few types of oversampling technologies, and taking newly synthesized sample data as a third sample.

In this embodiment, the positive sample is the service data whose inspection result is qualified, and the negative sample is the service data whose inspection result is unqualified. The training ratio of positive and negative samples is 3:1, which is common in machine learning, but the ratio of positive and negative samples of the service data in the second sample in the embodiment is usually between 5:1 and 10:1, and the sample balance is poor. In the sample training process, if positive and negative samples are extremely unbalanced, the problem of overfitting (overfitting means that assumptions become too strict in order to obtain consistent assumptions) exists in the prediction results of most types of samples, and the accuracy index of the model lacks a reference value. In order to avoid the over-fitting problem of the model training result, a sample equalization means is needed to perform positive and negative sample equalization processing on the data to be trained.

In the embodiment, a synthesized minority oversampling technology is adopted to realize positive and negative sample equalization, a SMOTE algorithm is mainly used to synthesize minority samples, newly synthesized sample data is marked as a third sample M, and special marks are added to the data for distinguishing from a second sample Y.

S9: and combining the second sample and the third sample to obtain a fourth sample. At this time, the minority sample in the fourth sample M1 is the sum of the minority sample in the second sample Y and the new summed third sample M, the majority sample in the fourth sample M1 is the majority sample in the second sample Y, and the positive and negative sample ratios in M1 are balanced.

S10: and training the fourth sample through a preset machine learning algorithm to obtain a machine learning model. The method includes the steps that a suitable machine learning algorithm can be selected according to data conditions and business realization targets, the algorithm adopted in the embodiment mainly comprises a random forest, XGboost, LightGBM and the like, and a model algorithm with high business combination degree is selected through repeated simulation training of different algorithms.

The first training is carried out by adopting the following steps of 2: 1, voting and combining by multi-algorithm combination and Bagging and Boosting methods in the training process, thereby improving the prediction accuracy of the model.

Further, splitting the fourth sample according to a preset ratio (6:3:1) to obtain a training sample data set, a test sample data set and a verification sample data set; the training sample data set is used for training the model, the testing sample data set is used for tracking errors in the training process so as to prevent excessive training, the verification sample data set is used for evaluating the final model, and errors of the verification samples give a real estimation value of the prediction capability of the model. By splitting the three mutually exclusive sets, the problem of overfitting of the training result can be reduced on one hand, and a relatively real model training result can be observed through a verified result on the other hand.

S11: optimizing the machine learning model; comparing the model prediction results, and searching for the optimal solution to determine the finally selected model.

In this embodiment, the optimization method includes:

firstly, adjusting the proportion of positive and negative samples; in the model tuning process, training the model by respectively adopting positive and negative sample ratios of 3:1, 4:1, 5:1 and the like;

reducing the dimension of the data; through analyzing the feature importance of various goods models, performing data dimension reduction in a targeted manner, and selecting features which influence the prejudgment on the unqualified current goods category;

thirdly, algorithm adjustment; in the model training process, various algorithms such as random forest, XGboost, LightGBM and the like are respectively adopted to execute model training; in the training process, according to the training results of different algorithms, the optimal algorithm is pertinently adopted to implement model calculation;

adjusting parameters; taking lightGBM algorithm as an example, the parameters to be adjusted include: training speed, model accuracy, processed overfitting, maximum depth of the tree, the minimum number of records possibly possessed by leaves, data proportion used in iteration, regularization and the like, sorting the parameters according to the influence on the performance of the overall model, and then adjusting the parameters according to the sequence;

after each model training, comparing training results, and analyzing prediction data with judgment errors by adopting a tree model. And adjusting the tuning mode of the model according to the analysis result.

The method comprises the following specific steps:

the first step is as follows: after the current model is trained, a column is added to the training data set by the model, and the prediction result of the training data is marked. Finding out a data set with inaccurate prediction for secondary analysis by comparing the data values of the prediction result and the inspection result in the historical data, wherein the data set with the wrong prediction is marked as data 1;

the second step is that: data1 is analyzed using a decision tree (C4.5) model, trying to find data regularity. In the analysis result of the tree model, the field and the corresponding value which affect the judgment result can be found by analyzing the branch node and the corresponding branch of the tree. Taking fig. 3 as an example, since the prejudgment result of the first branch of the TRADE _ count _ CODE field points to "pass" by 100%, we need to analyze the proportion of the data of the TRADE _ count _ CODE, which is 554, 158, 392, 410, 152, 792, 056, 784, 484, 642, 268, 882, 591, in the whole training sample, and confirm whether the model is over-trained because the proportion of positive samples pointed by these values is too large.

The third step: according to the analysis result, if the data ratio of the TRADE _ COUNTRY _ CODE in the positive sample is 554, 158, 392, 410, 152, 792, 056, 784, 484, 642, 268, 882 and 591 exceeds 1/3, the value ratio of the TRADE _ COUNTRY _ CODE field in the training sample is adjusted, and the next training is performed.

The fourth step: if obvious abnormity cannot be identified through the decision tree model and the data distribution proportion of each node in the output tree structure is normal, the model needs to be adjusted through methods of adjusting the positive and negative ratios, parameters, algorithms and the like of the training samples.

S12: and applying the machine learning model to a corresponding business scene. Specifically, new business data is obtained, then a data value of a data item of an input item corresponding to the new business data is used as an input variable of the machine learning model to obtain an output variable, and finally a data value of a data item of an output item corresponding to the new business data is obtained according to the output variable.

The model is applied to the port supervision link of foreign trade commodities, the accuracy of sampling hit (prediction of unqualified condition) can be greatly improved, the integral customs clearance efficiency of the port is improved, and the detention time and the cost loss of foreign trade enterprises on the port are reduced.

Further, when the model is deployed, automatic packing and periodic release of the model can be realized in a parameter configuration mode, and the model can be applied to a service scene in time to realize dynamic adjustment of the model.

Further, since the machine learning model is in the early stage of operation, the prejudgment result may be different from the actual operation result due to the problems of quality, quantity and the like of the training set samples. Therefore, a parallel transition mode is adopted in the practical application process of the model, the risk possibly generated in the initial running stage of the model is reduced, and the parallel transition time is marked as a verification period D. The sampling scheme used during validation is as follows:

firstly, performing data training on a sample data set N (N samples in step S6) without a label, namely performing data training on the N samples by using a dynamic sampling machine learning model, calculating the sampling proportion prejudgment of the model on the current goods category, and marking the model prejudgment sampling proportion as r;

then, comparing the pre-judging sampling rate of the model with the sampling rate (marked as f) set by the official, and selecting a ratio in a high form as the batch extraction rate used in the verification period D;

if r is less than f, in the actual sampling operation, firstly predicting a sampling result according to a model, and when the sampling result is not centered, performing random sampling according to the proportion of f-r; if r > -f, the sampling result is predicted according to the model in the actual sampling operation.

Furthermore, the sampling result and the registration result of the new data sample are input into the dynamic sampling machine learning model in real time, so that the model can realize online learning and quickly respond to business changes. The dynamic adjustment of the model is realized through online learning, and manual intervention is not needed in the training process. When the quality of the commodity has periodic change or emergencies, the model has short adjustment period and strong adaptability.

Further, after step S11, the machine learning model may be evaluated, mainly for the superiority, inferiority, performance and extensibility of the model.

Specifically, the superiority and inferiority of the model are evaluated through the analysis of indexes such as model training accuracy and recall rate. For the accuracy index analysis, the total number of samples of the model is divided by the sum of the number of positive samples predicted to be positive by the model and the number of negative samples predicted to be negative by the model, namely the accuracy of the model, and the index result is closer to 1 and represents that the model is better. For the recall rate index analysis, the number of positive samples predicted by model training is divided by the total number of actual positive samples in the original samples, namely, the recall rate of the model is obtained, and the index result is closer to 1, and the model is better.

And evaluating whether the model meets the hardware performance or not by monitoring the whole operation duration of the model and the consumption condition of hardware resource indexes. And comprehensively considering the performance result of the model according to the model sample data size under the actual condition, the configuration of hardware equipment and the real-time requirement of the service level on the prediction result. The operation duration of the model is influenced by a large number of samples, and the duration is expanded from the hardware when the duration does not reach the standard, so that the performance is optimized. Generally, the higher the utilization efficiency of the CPU is on the premise of satisfying the business needs, the faster the running speed is relatively.

For the expandability, the evaluation is mainly carried out on the aspects of data volume, dynamic parameter adjustment and the like. The model can maintain the existing operational capability under the condition of supporting data increase at least 5 times more than the current service volume, support the dynamic expansion addition of fields such as characteristic parameters and the like, and improve the high expansibility of the model for adapting to service change.

Further, the machine learning model may also be visualized. Through visualization, the readability of the business model becomes very strong, the logic is clearer, the conclusion is clearer, and business personnel can understand the operation process of the model and the result of data analysis, so that business analysis and decision making are supported.

In this embodiment, risk factors affecting the unqualified condition of the current goods and the weights thereof can be calculated according to the feature importance analysis result output by the model, and the threshold intervals of various risk factors are divided by combining a data analysis means. The analysis results can provide a larger reference value for subsequent risk level supervision. For example, among the various risk factors of G commodities, the native country occupies a large weight proportion; the unqualified proportion of the G-type commodities produced by the X country accounts for 80%, the combination of the X country and the G goods can be listed as a key observation object in the risk classification supervision process, and even a certain sampling proportion can be added on the basis of model prejudgment in the sampling process so as to better control the risk. High-risk feature combinations (such as feature combinations of country of origin, commodity category, season, transportation route and the like) can be predicted in advance through weight calculation and risk classification of the risk factors, and then differential processing can be carried out on the combinations in the actual application process.

The embodiment effectively utilizes the accumulated import and export commodity declaration and inspection data, combines environmental factors, epidemic situation information and the like, and utilizes technologies such as big data analysis, machine learning and the like to mine factors related to risk generation, so that the model can realize intelligent dynamic adjustment in an online learning mode, and accurate and dynamic sampling decision can be realized for each batch of declared commodities in a targeted manner. The problem that the batch drawing rule is too limited and the rule cannot be dynamically adjusted is solved.

Through the use of the crawler technology, the emergent epidemic disease early warning information around the world can be accurately acquired in real time, and the information is subjected to big data analysis and machine learning processing and is used as a commodity batch extraction model, so that the quick response handling capacity of the emergent epidemic disease early warning event can be improved. The problems that rule distribution is lagged, risks cannot be captured in time and the like caused by manual epidemic situation maintenance rules in the prior art are solved.

By loading the machine learning model, the accuracy of unqualified prediction is improved, and the problems of large workload and inaccurate distribution control caused by random batch-taking model are effectively solved. Through the analysis of the model prediction result, the effective rate of the model for predicting the commodity risk exceeds 85%, and compared with the control efficiency of 2-30% of the traditional random sampling, the control accuracy is obviously improved.

Example two

This embodiment is a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

acquiring data according to preset keywords to obtain auxiliary data;

acquiring a first sample according to the service data;

combining the second sample and the third sample to obtain a fourth sample;

Further, after the acquiring the service data, the method further includes:

supplementing missing values in the service data;

and converting the nonlinear data in the service data into linear data.

acquiring a data value in a nonlinear data item in the service data;

optimizing the machine learning model.

and applying the machine learning model to a corresponding business scene.

In summary, according to the method for constructing a machine learning model and the computer-readable storage medium provided by the present invention, by collecting auxiliary data, combining with the associated input items in the service data, generating feature items by a feature synthesis technique and merging the feature items into the second sample, the number of the input items can be increased, and the prediction accuracy of the constructed machine learning model can be improved; through carrying out positive and negative sample equalization processing on the samples, the over-fitting problem can be avoided, and the accuracy of the constructed machine learning model is improved; by supplementing missing values, comprehensive coverage of model training on historical data can be ensured; by converting nonlinear data in the service data into linear data, the problem that some algorithms do not recognize text attributes can be effectively solved; by optimizing the model, the prediction accuracy of the model can be further improved. When the machine learning model that constitutes is used in imported goods port supervisory process, can solve the blind sampling that uses random sampling method to bring among the prior art and the high problem of missed investigation rate, and can promote import and export commodity sampling effective rate at to a great extent, when promoting port supervision work efficiency, promote the whole customs efficiency of port to reduce the economic burden that foreign trade enterprise caused because of storage, commodity circulation cost of transportation.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent changes made by using the contents of the present specification and the drawings, or applied directly or indirectly to the related technical fields, are included in the scope of the present invention.

Claims

1. A method for constructing a machine learning model based on sparse data is characterized by comprising the following steps:

the method comprises the steps that crawled data are classified, cleaned and deduplicated through a data cleaning technology to obtain auxiliary data of preset categories, and the auxiliary data are stored in databases or data tables of corresponding categories respectively, wherein the auxiliary data comprise epidemic situation information, trade information, commodity price information and environmental factor information of an epidemic area;

acquiring business data, and determining a data item corresponding to an input item and a data item corresponding to an output item in the business data, wherein the business data is structured data for goods declaration and inspection, and comprises declaration information of a customs declaration form, goods declaration information, container declaration information, packaging declaration information, license information, inspection items and inspection results, and the candidate values of the inspection results comprise qualified and unqualified values;

performing data cleaning and data conversion on the service data;

acquiring a first sample according to the service data;

acquiring service data, of which the value of the data item corresponding to the output item is not null, in the first sample as a second sample;

matching to obtain the auxiliary data and the associated data items in the service data, generating feature items according to the associated data items in the auxiliary data, and combining the feature items serving as input items into the second sample;

carrying out positive and negative sample equalization processing on the second sample to obtain a fourth sample, wherein the positive sample is service data with qualified inspection results, and the negative sample is service data with unqualified inspection results;

training the fourth sample through a preset machine learning algorithm to obtain a machine learning model, wherein the machine learning algorithm comprises a random forest algorithm, an XGboost algorithm and a LightGBM algorithm;

and applying the machine learning model to a corresponding business scene, wherein the business scene is used for port supervision of foreign trade commodities.

2. The sparse data-based machine learning model construction method according to claim 1, wherein the data cleaning and data conversion of the business data specifically comprises:

supplementing missing values in the service data;

and converting the nonlinear data in the service data into linear data.

3. The sparse data-based machine learning model construction method according to claim 2, wherein the missing values in the supplementary service data are specifically:

4. The method for constructing a machine learning model based on sparse data according to claim 2, wherein the converting the non-linear data in the business data into linear data specifically comprises:

acquiring a data value in a nonlinear data item in the service data;

removing the duplication of the data values, and associating the data values after the duplication removal with a plurality of characters in a one-to-one correspondence manner, wherein the characters are provided with digital subscripts, and the digital subscripts of the characters are sequentially increased in number;

5. The method for constructing a machine learning model based on sparse data according to claim 1, wherein the obtaining a first sample according to the service data specifically comprises:

6. The sparse data-based machine learning model construction method according to claim 1, wherein the performing positive and negative sample equalization processing on the second sample to obtain a fourth sample specifically comprises:

carrying out positive and negative sample equalization processing on the second sample by synthesizing a minority oversampling technology, and taking newly synthesized sample data as a third sample, wherein the synthesizing minority oversampling technology is an SMOTE algorithm;

and combining the second sample and the third sample to obtain a fourth sample.

7. The method for constructing a machine learning model based on sparse data as claimed in claim 1, wherein after the training of the fourth sample by a preset machine learning algorithm to obtain the machine learning model, further comprising:

optimizing the machine learning model.

8. The sparse data-based machine learning model construction method according to claim 1, wherein the applying of the machine learning model to the corresponding service scenario specifically comprises:

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-8.