CN112116184A - Factory risk estimation using historical inspection data - Google Patents

Factory risk estimation using historical inspection data Download PDF

Info

Publication number
CN112116184A
CN112116184A CN201910771217.5A CN201910771217A CN112116184A CN 112116184 A CN112116184 A CN 112116184A CN 201910771217 A CN201910771217 A CN 201910771217A CN 112116184 A CN112116184 A CN 112116184A
Authority
CN
China
Prior art keywords
data
plant
risk
features
risk score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910771217.5A
Other languages
Chinese (zh)
Inventor
B·T·阮
V·C·T·阮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspectorio Co ltd
Original Assignee
Inspectorio Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspectorio Co ltd filed Critical Inspectorio Co ltd
Publication of CN112116184A publication Critical patent/CN112116184A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Operations Research (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Business, Economics & Management (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Marketing (AREA)
  • Game Theory and Decision Science (AREA)
  • Educational Administration (AREA)
  • Development Economics (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Testing And Monitoring For Control Systems (AREA)

Abstract

Plant risk estimation using historical inspection data is provided. In various embodiments, data for a plant is received, wherein the data includes historical inspection data for the plant. A plurality of features are extracted from the data. These multiple features are provided to a trained classifier. A risk score corresponding to a probability that the plant will fail to meet the predetermined performance metric is obtained from the trained classifier.

Description

Factory risk estimation using historical inspection data
Technical Field
Embodiments of the present disclosure relate to plant risk estimation, and more particularly, to plant risk estimation using historical inspection data
Disclosure of Invention
According to embodiments of the present disclosure, a method of plant risk estimation and a computer program product for plant risk estimation are provided. In various embodiments, data for a plant is received, wherein the data includes historical inspection data for the plant. A plurality of features are extracted from the data. These multiple features are provided to a trained classifier. A risk score corresponding to a probability that the plant will fail to meet the predetermined performance metric is obtained from the trained classifier.
In various embodiments, the data is preprocessed. In various embodiments, the pre-processed data comprises aggregated data. In various embodiments, preprocessing the data further includes filtering the data prior to aggregation.
In various embodiments, the data also includes a performance history of the plant. In various embodiments, the data also includes geographic information of the plant. In various embodiments, the data also includes a ground true (ground true) risk score. In various embodiments, the data also includes product data for the plant. In various embodiments, the data spans a predetermined time window.
In various embodiments, providing the plurality of features to the trained classifier includes sending the plurality of features to a remote risk prediction server, and obtaining the risk score from the trained classifier includes receiving the risk score from the risk prediction server.
In various embodiments, extracting the plurality of features includes removing features having a low correlation with the target variable. In various embodiments, extracting the plurality of features includes applying a dimensionality reduction algorithm. In various embodiments, extracting the plurality of features from the data includes applying an artificial neural network. In various embodiments, applying the artificial neural network includes receiving a first feature vector as an input, and outputting a second feature vector having a lower dimensionality than the first feature vector.
In various embodiments, a risk score is provided to the user. In various embodiments, providing the risk score to the user includes sending the risk score to a mobile or web application. In various embodiments, the sending is performed via a wide area network.
In various embodiments, the trained classifier includes an artificial neural network. In various embodiments, the trained classifier includes a support vector machine. In various embodiments, obtaining the risk score from the trained classifier includes applying a gradient enhancement (boosting) algorithm.
In various embodiments, the risk score is related to the probability by a linear mapping. In various embodiments, obtaining the risk score includes applying a scorecard model.
In various embodiments, the performance of the trained classifier is measured by comparing the risk score to the underlying true risk score, and the parameters of the trained classifier are optimized according to that performance. In various embodiments, optimizing the parameters of the trained classifier includes modifying a hyper-parameter (hyper-parameter) of the trained machine learning model. In various embodiments, optimizing the parameters of the trained classifier includes replacing the first machine learning algorithm with a second machine learning algorithm that includes a hyper-parameter configured to improve the performance of the trained classifier.
Drawings
FIG. 1 is a schematic diagram of an example system for plant risk estimation, according to an embodiment of the present disclosure.
FIG. 2 illustrates a process for plant risk estimation according to an embodiment of the present disclosure.
FIG. 3 illustrates a process for training a plant risk estimation system according to an embodiment of the present disclosure.
FIG. 4 illustrates a process for updating a plant risk estimation system according to an embodiment of the present disclosure.
FIG. 5 illustrates a process for training a plant risk estimation system according to an embodiment of the present disclosure.
FIG. 6 illustrates a process for training a plant risk estimation system according to an embodiment of the present disclosure.
FIG. 7 illustrates a process for training a plant risk estimation system according to an embodiment of the present disclosure.
FIG. 8 depicts a compute node according to an embodiment of the present disclosure.
Detailed Description
Factory risk assessment is an important step in assessing potential manufacturing partners. Factory risk assessment generally involves the manual application of statistical methods on an annual basis. This method is expensive and time consuming and does not provide timely advice regarding plant risk.
To address these and other shortcomings of alternative approaches, the present disclosure provides a framework for estimating a risk of failure of a plant using historical inspection data.
As used herein, the term risk refers to the probability that a plant does not meet predetermined quality and quantity metrics. In other words, risk refers to the risk that a manufacturing partner will not meet the overall performance goals. This risk is an important aspect of assessing potential and current manufacturing partners, and is also an important criterion in deciding what level of supervision must be provided for a particular partner. For example, manufacturing partners identified as high risk may require additional inspections or other improved quality control measures. Various factors contribute to the overall risk. For example, opportunities for injury, opportunities for equipment failure, opportunities for adverse weather events, opportunities for adverse labor conditions. A potential instability event, such as a management change, a worker strike, or a previous bankruptcy, may also contribute to the classification of the plant as high risk, as possible by certain operating conditions and trends of the workplace, such as a long-term lack of verification, a lack of production planning, a lack of lead commitment to quality, quality instability, a lack of authorization by the Quality Assurance (QA) team, and a lack of automation of the manufacturing process.
It should be noted that the risk of a manufacturing partner may be altered and improved if it starts to perform well in the inspection, e.g. shows a low failure rate and consistent quality, and if it improves the overall quality of its mechanical, management or working environment.
In an embodiment of the present disclosure, plant risk estimation is performed by obtaining data related to a plant, extracting a plurality of features from the data, providing the features to a trained classifier, and obtaining a risk score from the trained classifier that indicates a probability that the plant will fail to meet a predetermined performance metric. In some embodiments, feature vectors are generated and input into a trained classifier, which in some embodiments comprises a machine learning model.
In embodiments of the present disclosure, data in various formats may be obtained. Data may be structured or unstructured and may include information stored in multiple media. The data may be entered manually into the computer or may be obtained automatically from a file by the computer. It will be appreciated that a variety of methods are known for obtaining data via a computer, including but not limited to parsing written documents or text files using optical character recognition, text parsing techniques (e.g., looking up key/value pairs using regular expressions), and/or natural language processing, crawling web pages, and obtaining values for various measurements from databases (e.g., relational databases), XML files, CSV files, or JSON objects.
In some embodiments, factory or inspection data may be obtained directly from an inspection management system or other system that includes a database. In some embodiments, the inspection management system is configured to store information related to the plant and/or the inspection. The checkout management system may collect and store various types of information relating to the plant and the checkout, such as information relating to purchase orders, checkout reservations, assignments, reports, corrective and preventative measures (CAPA), checkout results, and other data obtained during the checkout. It will be appreciated that a large amount of data may be available, and in some embodiments, only a subset of the available data is used for input into the predictive model. The subset of data may contain a sufficient number of attributes to successfully predict plant risk.
As used herein, a check subscription refers to a request for future checks made on a suggested date. The checkout subscription may be initiated by a vendor, brand, or retailer and may contain information for purchase orders corresponding to future checkout. As used herein, assignment refers to a verified subscription that is confirmed. The assignment may include confirmation of the proposed date for checking the subscription, as well as identification of the assigned checker and information related to the subscription.
Data may be obtained via a data pipeline that collects data from various sources of plant and inspection data. The data pipeline may be implemented via an Application Programming Interface (API) with permissions to access and obtain desired data and compute various characteristics of the data. The API may be internally facing, e.g. it may provide access to an internal database containing plant or inspection data, or the API may be externally facing, e.g. it may provide access to plant or inspection data from an external brand, retailer or plant. In some embodiments, the data is provided by an entity wishing to obtain a prediction from a predictive model. The provided data may be input into the model in order to obtain prediction results, and may also be stored to train and test various predictive models.
Plant and inspection data may also be aggregated and statistical analysis may be performed on the data. According to embodiments of the present disclosure, data may be aggregated and analyzed in various ways, including, but not limited to, adding values for a given measurement over a given time window (e.g., 7 days, 14 days, 30 days, 60 days, 90 days, 180 days, or one year), obtaining maximum and minimum values, mean, median, and mode for a distribution of values for the given measurement over the given time window, and obtaining a measure of prevalence (prevalence) for certain values or ranges of values among the data. For any feature or measurement of the data, the variance, standard deviation, skewness, kurtosis, hyper-skewness (hyper-skewness), hyper-tailing (hyper-tailing), and various percentile values (e.g., 5%, 10%, 25%, 50%, 75%, 90%, 95%, 99%) of the distribution of the feature or measurement over a given time window may also be measured.
The data may also be filtered before being aggregated or performing statistical or aggregated analysis. Data may be aggregated by certain characteristics, and statistical analysis may be performed on a subset of the data having those characteristics. For example, the above metric may be calculated for data relating only to pass or fail tests, relating to in-process (DUPRO) tests, or relating to tests exceeding a minimum sample size.
Aggregation and statistical analysis may also be performed on data resulting from previous aggregation or statistical analysis. For example, statistics for a given measurement over a given period of time may be measured over multiple consecutive time windows, and the resulting values may be analyzed to obtain values regarding their variation over time. For example, the average plant failure rate may be calculated for various consecutive 7-day windows, and the change in average failure rate may be measured over the 7-day window.
In an embodiment of the present disclosure, the plant data includes information related to a risk score for the plant. Examples of suitable data for predicting a risk score include: data obtained from previous inspections at the same plant, data obtained from inspections at other plants having similar products or product lines to the plant, data obtained from a plant across multiple inspections, data relating to future inspections subscriptions (e.g., geographic location, time, entity performing the inspection, and/or type of inspection), data relating to business operations of the plant, data relating to product quality of the plant, general information about the plant, data relating to sustainability of the plant or other similar plants, and/or data relating to performance of the plant or other similar plants. The data may include the results of past tests (e.g., whether the test passed). The data may include information obtained from customer reviews of products or product lines similar to the product or product line produced by the plant and/or customer reviews of products or product lines originating at the plant. It will be appreciated that for certain metrics, a plant may be divided into various departments, with each department obtaining a different metric.
Examples of data relating to plant risk include: the number of orders placed at the plant, the quantity of orders (qualification), the quality of orders, the monetary value of orders, general information about orders, a description of each product at the plant (e.g., Stock Keeping Units (SKUs) of products, size, style, color, quantity, and packaging method), the financial performance of the plant, the number of inspected items at the plant during inspection of a process (such as process, packaging, and measurement), information about Acceptable Quality Limits (AQL) of a process at the plant (e.g., the number of samples used to test quality), inspection results of past inspections at the plant, inspection results of past inspections of a particular product/product line, inspection results at other plants having similar products, inspection results of past inspections at business partners of the plant, values of various metrics collected during inspection, The geographical location of the plant, the size of the plant, the operating conditions and operating time of the plant, and the aggregation and statistical measures of the data mentioned above.
As used herein, the style of a product or product line refers to the unique appearance of the item based on the corresponding design. The style may have a unique Identification (ID) within a particular brand, retailer, or factory. The pattern ID may be used as an identification feature by which other measurements may be aggregated in order to extract meaningful features related to the test results and risk calculations.
It will be appreciated that a large number of features may be extracted by various methods, such as manual feature extraction, to calculate or extract features from the obtained data that have significant relevance to the target variable (e.g., estimated risk score). Features may be extracted directly from the data or may require processing and/or further computation in order to be formatted in a manner that can extract the desired metrics. For example, considering the results of various tests at the factory in the last year, one may wish to calculate the percentage of tests that failed within that time period. In some embodiments, extracting features produces a feature vector that can be pre-processed by applying dimensionality reduction algorithms (such as principal component analysis and linear discriminant analysis) or inputting the feature vector into a neural network, thereby reducing the size of the vector and improving the performance of the overall system.
In some embodiments, the trained classifier is a random decision forest. However, it will be appreciated that various other classifiers are also suitable for use in accordance with the present disclosure, including linear classifiers, Support Vector Machines (SVMs), gradient enhanced classifiers, or neural networks, such as Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs).
Suitable artificial neural networks include, but are not limited to, feed-forward neural networks, radial basis function networks, self-organizing maps, learning vector quantization, recurrent neural networks, Hopfield networks, Boltzmann machines, echo state networks, long-term short-term memory, bidirectional recurrent neural networks, hierarchical recurrent neural networks, stochastic neural networks, modular neural networks, associative neural networks, deep belief networks, convolutional neural networks, convolutional deep belief networks, large memory storage and retrieval neural networks, deep Boltzmann machines, deep stack networks, tensor deep stack networks, spike (spike) and slab (slab) bounding Boltzmann machines, composite hierarchical depth models, depth coding networks, multi-layer kernel machines, or deep Q networks.
In some embodiments, the estimated risk score includes values within a specified range, e.g., values in the range [0,100 ]. For example, a factory with perfect performance that never fails an inspection may achieve a score of 0, while a factory with bad performance that fails each inspection may achieve a score of 100. In some embodiments, the estimated risk score may be compared to a threshold and a binary value may be generated that indicates whether the plant is considered high risk (e.g., 0 if the score is below the threshold and 1 otherwise). The threshold may be heuristically selected, or may be adaptively calculated during training of the machine learning model. In some embodiments, the estimated risk score includes a vector indicating probabilities of various types of estimated risk.
The performance of a machine learning model according to embodiments of the present disclosure may be tested for new data, and the machine learning model may be updated in order to improve its performance. In some embodiments, updating the machine learning model includes modifying a hyper-parameter of the model. In some embodiments, updating the machine learning model includes using a different machine learning method than the machine learning method currently used in the model, and modifying the hyper-parameters of the different machine learning method in order to achieve the desired performance.
In some embodiments of the present disclosure, a plant is classified as either high performance or low performance. Plants classified as high performance have a low risk, while plants classified as low performance have a high risk.
In an embodiment of the present disclosure, historical inspection data from a given time window is used to estimate risk of the plant. It will be appreciated that various time windows may be used, for example, three months, six months, nine months, or one year. In some embodiments, the assessment may be updated at a regular frequency (e.g., weekly, biweekly, or monthly). Obtaining an updated risk estimate for the plant will help brands and retailers reduce their potential risk when working with the plant.
In some embodiments, the predicted risk outcome is converted to a corresponding risk score for the plant, where the risk score represents the company's performance by the risk assessment date. As described above, the overall performance of the plant may correspond to the consistency of the plant with predetermined performance criteria (e.g., with respect to volume or quality).
In embodiments of the present disclosure, a machine learning model is trained by assembling a training data set that includes inspection data for plants during various time windows and corresponding performance evaluations of those plants over their respective time windows. In some embodiments, the performance assessment comprises an expert assessment.
In some embodiments, the performance assessment includes feedback (e.g., from a customer or business partner of the plant) on previously estimated risk measures. In some embodiments, the plant is assigned a label that indicates whether the plant is low performing (and thus high risk), e.g., 1 indicates high risk and 0 indicates low risk. The data collection process produces an initial training data set to which machine learning techniques may be applied to generate an optimal model for predicting plant risk.
In some embodiments, training the machine learning model includes a feature extraction step. In some embodiments, the selected features to be extracted have a high correlation with the target variable. In some embodiments, the number of features is reduced in order to reduce computational costs in training and deploying the risk estimation model. In some embodiments, multiple machine learning methods and classification methods are tested on the training dataset and the model with the most desirable performance is selected for deployment in the risk estimation model. It will be appreciated that various machine learning algorithms may be used for risk assessment, including logistic regression models, random forests, Support Vector Machines (SVMs), deep neural networks, or enhancement methods (e.g., gradient enhancement, Catboost). The hyper-parameters of each model may be learned to achieve the desired performance. It will be appreciated that the performance of the machine learning model may be measured by different metrics. In some embodiments, the metrics used to measure the performance of the machine learning model include accuracy, precision, recall, AUC, and/or F1 scores.
In embodiments of the present disclosure, hyper-parameters for various machine learning risk estimation models are learned and the performance of each model is measured. In some embodiments, the metrics used to measure the performance of the machine learning model include accuracy, precision, recall, AUC, and/or F1 scores. In some embodiments, the initial data set is divided into three subsets: a training data set, a validation data set, and a test data set.
In some embodiments, 60% of the data is used for the training data set, 20% is used for the validation set, and the remaining 20% is used for the test data set. In some embodiments, cross-validation techniques are used to estimate the performance of each risk estimation model. The performance results may be verified by subjecting the trained risk prediction model to new test data.
It will be appreciated that predicting the risk of a plant is useful to achieve dynamic risk-based quality control. For example, given a particular inspection or risk of a particular plant, a specific inspection workflow or template may be automatically generated based on the requirements of either the plant or a business partner of the plant. The calculated risk may be applied to the critical path or time and action plan of the purchase order or pattern in order to modify the number of proofs required. Based on the calculated level of risk for a particular plant, the verification team may assess whether they should give up or confirm the verification reservation. The estimated risk may also be exploited to determine the nature of the test. For example, for a high risk plant, the inspection may be performed via an internal independent team, while a low risk plant may perform the inspection by itself by the person responsible for plant performance.
Referring now to FIG. 1, a schematic diagram of an exemplary system for plant risk estimation is shown, according to an embodiment of the present disclosure. The historical plant inspection data 103 is input into the plant risk prediction server 105 and an estimated risk score 107 is obtained. The plant inspection data 103 may be obtained from the plant 101, from the inspection database 109, or from any combination of sources. As described above, the inspection data may include data related to inspection at the plant, data related to the performance of the plant, and/or data related to the plant in general. In some embodiments, the estimated risk score 107 is sent to the mobile or web application 108 where it can be used for further analysis or decision making. The mobile application may be implemented on a smartphone, tablet, or other mobile device, and may run on various operating systems (e.g., iOS, Android, or Windows). In various embodiments, the estimated risk score 107 is sent to the mobile or web application 108 via a wide area network.
Referring now to FIG. 2, a process for factory risk estimation is shown, according to an embodiment of the present disclosure. The plant inspection data 210 is input into the plant risk prediction system 220 to obtain a predicted risk result 260. In some embodiments, the verification data is obtained from various sources, as discussed above. In some embodiments, the plant risk prediction system 220 employs a machine learning model to estimate the risk associated with the plant. In some embodiments, the plant risk prediction system 220 is deployed on a server. In some embodiments, the server is a remote server. In some embodiments, the plant risk estimation process 200 includes performing a data processing step 230 on the input plant data. Data processing may include aggregating data, obtaining statistical measures of the data, and formatting various forms of the data in a manner from which features may be extracted. In some embodiments, the process 200 includes performing a feature extraction step 240 on the input data to extract various features. In some embodiments, the feature extraction step 240 is performed on the data that has been processed at step 230. In some embodiments, a feature vector is output. In some embodiments, the features extracted at 240 are input into a classifier at 250. In some embodiments, the classifier includes a trained machine learning model. In some embodiments, the classifier outputs the prediction 260. In some embodiments, steps 230, 240, and 250 are performed by the plant risk prediction system 220. The steps of process 200 may be performed locally at the plant site, may be performed by a remote server (e.g., a cloud server), or may be shared between a local computing device and a remote server.
Referring now to FIG. 3, a process for training a plant risk estimation system is shown, according to an embodiment of the present disclosure. The steps of the process 300 may be performed to train a plant risk estimation model. In some embodiments, the model is deployed on a prediction server. The steps of process 300 may be performed locally at the plant site, may be performed by a remote server (e.g., a cloud server), or may be shared between a local computing device and a remote server. At 302, an initial training data set is created. In some embodiments, the training data set may include historical inspection data for a large number of plants. The inspection data may be based on the results of the inspection and may include various values corresponding to various measurements made during the inspection, as discussed above. In some embodiments, the training data set includes evaluation data corresponding to the inspection data for each plant. The assessment data may include performance scores within a range of possible scores, binary scores indicating whether a particular plant passed or failed a test, or a number of scores in various categories indicating the performance of the plant in those categories. In some embodiments, the assessment data comprises an expert assessment. In some embodiments, the check sum corresponding assessment data is time stamped. In some embodiments, the data for the plant may be aggregated over a given length of time or a given number of tests. In some embodiments, the data obtained for the plant is collected only from the inspection during a given time window.
It will be appreciated that a well-performing plant may be considered a low risk, while a poorly-performing plant may be considered a high risk. In some embodiments, a list of factory and inspection results may be obtained, with the evaluation data as a label for the inspection data. For example, in embodiments where the evaluation data is assigned a binary value, a value of 1 may indicate that the plant is high risk, while a value of 0 may indicate that the plant is low risk.
Useful features are then extracted from the initial training dataset. The extracted features may correspond to different time windows (e.g., three months, six months, nine months, or one year). The importance of each feature in estimating the final risk outcome of the plant is calculated. In some embodiments, the importance of each feature is calculated by measuring the correlation of the feature to a target tag (e.g., assessment data). At 306, a plurality of machine learning models are trained on a training data set, and the performance of each model is evaluated. It will be appreciated that acceptable machine learning models include, in addition to those described above, a Catboost classifier, a neural network (e.g., a neural network with 4 fully-connected hidden layers and a ReLU activation function), a decision tree, an extreme enhancement machine, a random forest classifier, a SVM, and a logistic regression. The hyper-parameters of each model may be adjusted to optimize the performance of the model. In some embodiments, the metrics used to measure the performance of the machine learning model include accuracy, precision, recall, AUC, or F1 score. The features that are most useful for performing the desired estimation are chosen. At 308, the performance of the machine learning models is compared. The model with the best performance is selected at 310. At 312, the selected model is deployed onto a prediction server.
Referring now to FIG. 4, a process for updating a plant risk estimation system is shown, according to an embodiment of the present disclosure. In some embodiments of process 400, an existing plant risk prediction model is updated. In some embodiments, updating the predictive model includes inputting new data and modifying parameters of the learning system accordingly to improve the performance of the system. In some embodiments, a new machine learning model may be selected to perform the estimation. The plant risk prediction model may be updated at regular intervals (e.g., monthly, bimonthly, or quarterly), or may be updated as a certain amount of new data is accumulated. It will be appreciated that the updated risk estimation system provides a more accurate risk estimation than existing methods.
In some embodiments, customer feedback regarding previous forecasts 402 and/or new data 404 is collected and used to generate a new data set 406 having tags corresponding to the data for each plant. The customer feedback 402 may include a base true risk score that includes an indication of the accuracy of previous predictions (such as which predictions made by the prediction model are incorrect) and corrected results for the predictions. The data obtained from the customer feedback can be used to create new tags for the plant's inspection data. The new data 404 may include new inspection data and evaluation data for a large number of plants. It will be appreciated that the new data set 406 may be constructed in a similar manner to the initial data set described above. In some embodiments, the new training data set 406 is combined with the existing training data set 408 to create a new training data set 410. In some embodiments, the performance of the latest version of the trained risk prediction model 424 (including the plant risk predictor 412) is measured on the new training data set. In some embodiments, if the performance of the latest version of the trained risk prediction model 424 and predictor 412 is below a certain threshold, then a feature re-engineering step 414 is performed, and/or a new machine learning model 418 is introduced prior to retraining 416. The threshold may be heuristically selected, or may be adaptively calculated during training.
It will be appreciated that the method of retraining the predictive model at 416 may be similar to the method used in initially training the plant risk estimation model, as described above. The process of retraining the predictive model may be repeated multiple times until the performance of the model on the new training data set reaches an acceptable threshold. In some embodiments, the latest version of the trained risk prediction model 424 is updated at 420 with the new model trained at 416. The updated risk prediction model may then be deployed on prediction server 422. The existing training data set 408 may also be updated to reflect the newly acquired data.
5-7, various processes for training a plant risk estimation system are illustrated, in accordance with embodiments of the present disclosure. In various embodiments of the present disclosure, generating a trained risk assessment system includes four main steps: data collection, feature extraction, model training and risk prediction. In some embodiments, the data collection includes creating an initial training data set using the method described above. In some embodiments, the feature extraction comprises extracting a plurality of useful features from the initial training dataset. The extracted features may be a subset of a large number of features that may be extracted from the initial training data set. In some embodiments, the importance of each feature to the risk prediction calculation is measured. In some embodiments, the features that are least relevant to the prediction computation are not used in the risk prediction model. In some embodiments, determining the relevance of the feature to the prediction calculation includes measuring the relevance of the feature to the risk prediction outcome. In some embodiments, a fixed number of features are extracted. In some embodiments, the feature extraction step comprises manual feature extraction. In some embodiments, dimensionality reduction techniques (e.g., principal component analysis or linear discriminant analysis) may be applied to the extracted features. In some embodiments, the extracted features are passed through a neural network, resulting in feature vectors with reduced dimensionality. Model training includes measuring performance of a plurality of machine learning models on the extracted features. The model with the most desirable performance may be chosen to perform risk prediction.
Referring now to FIG. 5, a process for training a plant risk estimation system is shown, according to an embodiment of the present disclosure. In some embodiments, manual feature extraction 502 is performed on an initial training data set 501 that includes factory inspection data. Features may be extracted for each plant or for various departments within the plant in the manner described above. Features may be extracted based on inspection data during a particular time window (e.g., one year). In some embodiments, a feature vector corresponding to the inspection data for each plant is generated from the feature extraction step. In some embodiments, a label is assigned to each feature vector. In some embodiments, the labels are obtained from an initial training data set 501. In some embodiments, the label is a binary value indicating whether the plant is at high risk or low risk. In some embodiments, the risk estimation of a plant is transformed into a binary classification problem, where the plant may be classified as high risk or non-high risk. These categories correspond to plants with low and high performance, respectively. At 503, various machine learning models (e.g., support vector machines, decision trees, random forests, or neural networks) and enhancement methods (e.g., Catboost or XGBoost) may be tested on the initial training data set.
In training various machine learning models and enhancement methods, an initial training data set may be divided into a training data set and a test data set. For example, 80% of the initial training data set may be used to create the training data set, with the remaining 20% being used to form the test data set. In some embodiments, the initial training data set may be divided into a training data set, a test data set, and a validation data set. In some embodiments, the hyper-parameters of the machine learning model and the enhancement method are adjusted to achieve the most desirable performance. The model with the most desirable performance may then be chosen to provide a risk estimation of the input plant data. In some embodiments, the selected models are deployed on a prediction server to provide future risk predictions.
In some embodiments of the present disclosure, a feature vector is calculated from inspection data for a plant. The feature vectors are input into a risk prediction model and predicted risk probabilities are obtained. The probability may be compared to a given threshold to determine whether the plant should be classified as high risk. In some embodiments, a plant is considered to be high risk if the predicted probability is greater than or equal to a threshold. In some embodiments, a risk score is obtained based on the calculated probability. In some embodiments, the risk score is a function of the average failure rate of the test at the factory. In some embodiments, the risk score includes values within a predetermined range (e.g., [0,100 ]). For example, a factory with perfect performance that never fails an inspection may achieve a score of 0, while a factory with bad performance that fails each inspection may achieve a score of 100. In some embodiments, testing the risk prediction model includes comparing the predicted risk score and/or the classification of the plant as high risk to known data.
In some embodiments, the risk score R is obtained based on the calculated probability p using the following procedure:
a range [ a, B ] is selected that defines an upper and lower limit for the risk score. For example, the risk score R may be considered to be within the range [0,100], where R ═ 0 represents the lowest possible risk for the plant (e.g., the plant has perfect performance without failed tests during a given time window), and R ═ 100 represents the highest possible risk for the plant (e.g., the plant has all of its tests failed, and has poor performance). Assuming that the predicted probability p is within the unit interval [1,0], a mapping F may be determined to assign the predicted probability to the corresponding risk score R:
F:[0,1]→[A,B]
equation 1
For a given p, the number of p,
F(p)=p→R
f (1) ═ B equation 2
F is selected such that F (0) is a and F (1) is B. For example, a linear mapping may be used:
F(p)=A×p+(1-p)×B
equation 3
In some embodiments, the risk score R may be calculated by using a scorecard model as follows:
applying a machine learning technique to a training data set to obtain a list L ═ f1,f2,f3,…,fNIn which each fiCorresponding to the feature associated with the value being predicted. Next, a classification process may be performed to transform the digital features in the list L into classification values, where multiple attributes are grouped together under one value, and the classification features may be regrouped and merged. It will be appreciated that grouping similar attributes with similar prediction strengths may improve the accuracy of the prediction model. For example, one extracted feature from the training data set may be the average failure detection rate of the plant over the past 180 days. Since the extracted feature is the average success rate, it can take the unit interval 0,1]A value of (1). Features may be transformed from digital features to classification features by applying a ranking process to the feature values. For example, the values may be transformed into one of the following groups:
a) less than 2 percent
b)[2%,5%)
c)[5%,10%)
d)[10%,15%)
e) 15% or more.
For example, if a plant has an average failed verification rate of 3.4%, it will be assigned into group (b). In this way, continuous and discrete feature values may be categorized into multiple categories.
A proof Weight (WOE) may be calculated for each category and the classification value may be replaced for later calculations. WOE is a measure of the logarithm of the ratio of favorable events to unfavorable events and measures the predictive strength of the attributes of a feature in distinguishing high-performance plants from low-performance plants. For each feature, an Information Value (IV) for each group may also be obtained. IV is a measure of the sum of the difference between the percentages of adverse events and favorable events multiplied by WOE. IV is a useful metric for determining the importance of variables in the predictive model. It will be appreciated that during the feature engineering stage, an IV may be calculated for each feature in the list L to verify that the feature has a good information value and is therefore relevant to the prediction problem.
Use all features f1,f2,f3,…,fNAppropriate logistic regression model can be trained to classify the plant as either high performance or low performance, and regression coefficients { β } corresponding to each feature can be obtained123,…,βNAnd the intercept term α. Finally, for each f in the list LiThe corresponding score point can be calculated using the following formula, where KiIs the number of groups of attributes in feature f, N is the number of (most important) features selected, WoEjIs a feature f determined during the classification processiAnd Factor and Offset are scaling parameters to ensure that the final score is within the selected range.
Figure BDA0002173610350000161
Finally, the risk score for the plant may be calculated as the sum of all scores for feature fi:
Figure BDA0002173610350000162
referring now to FIG. 6, a process for training a plant risk estimation system is shown, according to an embodiment of the present disclosure. In some embodiments, the features are obtained from inspection data 601 of the plant using manual feature extraction 602. It will be appreciated that feature extraction may yield a large number of extracted features for each plant, and thus a large feature vector. The number of extracted features may be counted in hundreds. Reducing the dimensionality of the feature vectors may result in more efficient training, deployment, and operation of the predictive models. In some embodiments, the dimensionality of the feature vector is reduced at 603 by computing the correlation of each feature to the target variable, and keeping only those features that have a high correlation to the target variable. In some embodiments, the dimensionality of the feature vectors is reduced at 603 by applying a dimensionality reduction algorithm, such as Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA), to the vectors. In some embodiments, the features computed in the resulting smaller dimension vectors for multiple plants at 604 are input into various machine learning and/or gradient enhancement models, and the model with the most desirable performance is chosen, as described above.
Referring now to FIG. 7, a process for training a plant risk estimation system is shown, according to an embodiment of the present disclosure. In some embodiments, features are obtained from inspection data 701 using manual feature extraction 702. In some embodiments, the feature extraction step produces a feature vector. In some embodiments, the feature vectors are input into the neural network at 703. In some embodiments, the neural network comprises a deep neural network. In some embodiments, the neural network includes an input layer, a plurality of fully connected hidden layers, and an output later having a predetermined activation function. In some embodiments, the activation function comprises a ReLU or sigmoid activation function, but it will be appreciated that a variety of activation functions may be suitable. The output of the neural network may be considered as a new feature vector and may be input into various machine learning models at 704 using steps similar to those described above. In some embodiments, the new feature vector has a smaller dimension than the input feature vector.
Table 1 lists a number of features that can be extracted from the plant's inspection data using the method described above. In various exemplary embodiments, gradient enhancement on the decision tree is applied, for example using Catboost. In exemplary embodiments of the present disclosure, these features have a high correlation with the target variable.
Figure BDA0002173610350000171
Figure BDA0002173610350000181
TABLE 1
It will be appreciated that various additional features and statistical measures may be used in accordance with the present disclosure.
Referring now to FIG. 8, a schematic diagram of an example of a compute node is shown. The computing node 10 is only one example of a suitable computing node and is not intended to suggest any limitation as to the scope of use or functionality of the embodiments described herein. In any event, computing node 10 is capable of implementing and/or performing any of the functions set forth above.
In the computing node 10, there is a computer system/server 12, which is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with computer system/server 12 include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above systems or devices, and the like.
Computer system/server 12 may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. Computer system/server 12 may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.
As shown in FIG. 8, the computer system/server 12 in the computing node 10 is shown in the form of a general purpose computing device. The components of computer system/server 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 to the processing unit 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, Peripheral Component Interconnect (PCI) bus, peripheral component interconnect express (PCIe), and Advanced Microcontroller Bus Architecture (AMBA).
Computer system/server 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessed by computer system/server 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. The computer system/server 12 may also include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be provided for reading from and writing to non-removable, nonvolatile magnetic media (not shown, and commonly referred to as "hard drives"). Although not shown, a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from and writing to a removable, non-volatile optical disk (such as a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each may be connected to bus 18 by one or more data media interfaces. As will be further depicted and described below, memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the disclosure.
By way of example, and not limitation, a program/utility 40 having a set (at least one) of program modules 42, and an operating system, one or more application programs, other program modules, and program data may be stored in memory 28. Each of the operating system, one or more application programs, other program modules, and program data or some combination thereof, may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods in the embodiments as described herein.
The computer system/server 12 may also communicate with one or more external devices 14 (such as a keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the computer system/server 12, and/or any device (e.g., network card, modem, etc.) that enables the computer system/server 12 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 22. Also, the computer system/server 12 may communicate with one or more networks, such as a Local Area Network (LAN), a general Wide Area Network (WAN), and/or a public network (e.g., the internet) via the network adapter 20. As depicted, network adapter 20 communicates with the other components of computer system/server 12 via bus 18. It should be understood that although not shown, other hardware and/or software components may be used in conjunction with the computer system/server 12. Examples include, but are not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data archive storage systems, and the like.
The present disclosure may be embodied as systems, methods, and/or computer program products. The computer program product may include computer-readable storage medium(s) having computer-readable program instructions thereon for causing a processor to perform aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device such as a punch card or an in-groove projection structure having instructions stored thereon, and any suitable combination of the foregoing. A computer-readable storage medium as used herein is not to be interpreted as a transitory signal per se, such as a radio wave or other freely propagating electromagnetic wave, an electromagnetic wave propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or an electrical signal transmitted through an electrical wire.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in the computer-readable storage medium in the respective computing/processing device.
Computer-readable program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry, including, for example, programmable logic circuitry, Field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), may execute the computer-readable program instructions in order to perform aspects of the present disclosure by utilizing state information of the computer-readable program instructions to personalize the electronic circuitry.
Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium having stored thereon the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer-implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The description of the various embodiments of the present disclosure has been presented for purposes of illustration but is not intended to be exhaustive or limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques found in the marketplace, or to enable others skilled in the art to understand the embodiments disclosed herein.
1. A system, comprising:
a computing node comprising a computer-readable storage medium having program instructions embodied therewith, the program instructions executable by a processor of the computing node to cause the processor to perform a method comprising:
receiving data of a plant, the data including historical inspection data of the plant;
extracting a plurality of features from the data;
providing the plurality of features to a trained classifier;
a risk score corresponding to a probability that the plant will fail to meet the predetermined performance metric is obtained from the trained classifier.
2. The system of item 1, the method further comprising preprocessing the data.
3. The system of item 2, wherein the pre-processed data comprises aggregated data.
4. The system of item 3, wherein preprocessing the data further comprises filtering the data prior to aggregation.
5. The system of item 1, wherein the data further comprises a performance history of the plant.
6. The system of item 1, wherein the data further comprises geographic information of the plant.
7. The system of item 1, wherein the data further comprises a base true risk score.
8. The system of item 1, wherein the data further comprises product data for the plant.
9. The system of item 1, wherein the data spans a predetermined time window.
10. The system of item 1, wherein
Providing the plurality of features to the trained classifier includes sending the plurality of features to a remote risk prediction server, and
obtaining the risk score from the trained classifier includes receiving the risk score from a risk prediction server.
11. The system of item 1, wherein extracting the plurality of features comprises removing features having low correlation with a target variable.
12. The system of item 1, wherein extracting the plurality of features comprises applying a dimensionality reduction algorithm.
13. The system of item 1, wherein extracting the plurality of features from the data comprises applying an artificial neural network.
14. The system of item 13, wherein applying the artificial neural network comprises receiving a first feature vector as an input, and outputting a second feature vector having a lower dimensionality than the first feature vector.
15. The system of item 1, the method further comprising:
the risk score is provided to the user.
16. The system of item 15, wherein providing the risk score to the user comprises sending the risk score to a mobile or web application.
17. The system of item 16, wherein the sending is performed via a wide area network.
18. The system of item 1, wherein the trained classifier comprises an artificial neural network.
19. The system of item 1, wherein the trained classifier comprises a support vector machine.
20. The system of item 1, wherein obtaining a risk score from the trained classifier comprises applying a gradient enhancement algorithm.
21. The system of item 1, wherein the risk score is related to the probability by a linear mapping.
22. The system of item 1, wherein obtaining a risk score comprises applying a scorecard model.
23. The system of item 1, wherein the method further comprises:
measuring the performance of the trained classifier by comparing the risk score to the underlying true risk score;
parameters of the trained classifier are optimized according to performance.
24. The system of item 23, wherein optimizing the parameters of the trained classifier comprises modifying hyper-parameters of the trained machine learning model.
25. The system of item 24, wherein optimizing the parameters of the trained classifier comprises replacing the first machine learning algorithm with a second machine learning algorithm comprising a hyper-parameter configured to improve the performance of the trained classifier.
26. A method, comprising:
receiving data of a plant, the data including historical inspection data of the plant;
extracting a plurality of features from the data;
providing the plurality of features to a trained classifier;
a risk score corresponding to a probability that the plant will fail to meet the predetermined performance metric is obtained from the trained classifier.
27. The method of item 26, further comprising preprocessing the data.
28. The method of clause 27, wherein the pre-processed data comprises aggregated data.
29. The method of item 28, wherein preprocessing the data further comprises filtering the data prior to aggregation.
30. The method of item 26, wherein the data further comprises a performance history of the plant.
31. The method of item 26, wherein the data further comprises geographic information of the plant.
32. The method of item 26, wherein the data further comprises a base true risk score.
33. The method of item 26, wherein the data further comprises product data for the plant.
34. The method of item 26, wherein the data spans a predetermined time window.
35. The method of item 26, wherein
Providing the plurality of features to the trained classifier includes sending the plurality of features to a remote risk prediction server, and
obtaining the risk score from the trained classifier includes receiving the risk score from a risk prediction server.
36. The method of item 26, wherein extracting the plurality of features comprises removing features having low correlation with a target variable.
37. The method of item 26, wherein extracting the plurality of features comprises applying a dimensionality reduction algorithm.
38. The method of item 26, wherein extracting a plurality of features from the data comprises applying an artificial neural network.
39. The method of item 38, wherein applying an artificial neural network comprises receiving a first feature vector as an input, and outputting a second feature vector having a lower dimensionality than the first feature vector.
40. The method of item 26, further comprising:
the risk score is provided to the user.
41. The method of item 40, wherein providing the risk score to the user comprises sending the risk score to a mobile or web application.
42. The method of item 41, wherein the sending is performed via a wide area network.
43. The method of item 26, wherein the trained classifier comprises an artificial neural network.
44. The method of item 26, wherein the trained classifier comprises a support vector machine.
45. The method of item 26, wherein obtaining a risk score from the trained classifier comprises applying a gradient enhancement algorithm.
46. The method of item 26, wherein the risk score is related to the probability by a linear mapping.
47. The method of item 26, wherein obtaining a risk score comprises applying a scorecard model.
48. The method of item 26, further comprising:
measuring the performance of the trained classifier by comparing the risk score to the underlying true risk score;
parameters of the trained classifier are optimized according to performance.
49. The method of item 48, wherein optimizing the parameters of the trained classifier comprises modifying hyper-parameters of the trained machine learning model.
50. The method of item 49, wherein optimizing parameters of the trained classifier comprises replacing the first machine learning algorithm with a second machine learning algorithm comprising a hyper-parameter configured to improve performance of the trained classifier.
51. A computer program product for plant risk estimation, the computer program product comprising a computer-readable storage medium having program instructions embodied with the computer-readable storage medium, the program instructions executable by a processor to cause the processor to perform a method comprising:
receiving data of a plant, the data including historical inspection data of the plant;
extracting a plurality of features from the data;
providing the plurality of features to a trained classifier;
a risk score corresponding to a probability that the plant will fail to meet the predetermined performance metric is obtained from the trained classifier.
52. The computer program product of item 51, the method further comprising preprocessing the data.
53. The computer program product of item 52, wherein the pre-processed data comprises aggregated data.
54. The computer program product of item 53, wherein preprocessing the data further comprises filtering the data prior to aggregation.
55. The computer program product of item 51, wherein the data further comprises a performance history of the plant.
56. The computer program product of item 51, wherein the data further comprises geographic information of the plant.
57. The computer program product of item 51, wherein the data further comprises a base true risk score.
58. The computer program product of item 51, wherein the data further comprises product data for the plant.
59. The computer program product of item 51, wherein the data spans a predetermined time window.
60. The computer program product of item 51, wherein
Providing the plurality of features to the trained classifier includes sending the plurality of features to a remote risk prediction server, and
obtaining the risk score from the trained classifier includes receiving the risk score from a risk prediction server.
61. The computer program product of item 51, wherein extracting the plurality of features comprises removing features that have a low correlation with a target variable.
62. The computer program product of item 51, wherein extracting the plurality of features comprises applying a dimensionality reduction algorithm.
63. The computer program product of item 51, wherein extracting the plurality of features from the data comprises applying an artificial neural network.
64. The computer program product of item 63, wherein applying an artificial neural network comprises receiving a first feature vector as an input, and outputting a second feature vector having a lower dimensionality than the first feature vector.
65. The computer program product of item 51, the method further comprising:
the risk score is provided to the user.
66. The computer program product of item 65, wherein providing the risk score to the user comprises sending the risk score to a mobile or web application.
67. The computer program product of item 66, wherein the sending is performed via a wide area network.
68. The computer program product of item 51, wherein the trained classifier comprises an artificial neural network.
69. The computer program product of item 51, wherein the trained classifier comprises a support vector machine.
70. The computer program product of item 51, wherein obtaining a risk score from the trained classifier comprises applying a gradient enhancement algorithm.
71. The computer program product of item 51, wherein the risk score is related to the probability by a linear mapping.
72. The computer program product of item 51, wherein obtaining a risk score comprises applying a scorecard model.
73. The computer program product of item 51, wherein the method further comprises:
measuring the performance of the trained classifier by comparing the risk score to the underlying true risk score;
parameters of the trained classifier are optimized according to performance.
74. The computer program product of item 73, wherein optimizing the parameters of the trained classifier comprises modifying hyper-parameters of the trained machine learning model.
75. The computer program product of item 74, wherein optimizing the parameters of the trained classifier comprises replacing the first machine learning algorithm with a second machine learning algorithm comprising a hyper-parameter configured to improve the performance of the trained classifier.

Claims (10)

1. A system, comprising:
a computing node comprising a computer-readable storage medium having program instructions embodied therewith, the program instructions executable by a processor of the computing node to cause the processor to perform a method comprising:
receiving data of a plant, the data including historical inspection data of the plant;
extracting a plurality of features from the data;
providing the plurality of features to a trained classifier;
a risk score corresponding to a probability that the plant will fail to meet the predetermined performance metric is obtained from the trained classifier.
2. The system of claim 1, the method further comprising preprocessing data.
3. The system of claim 2, wherein the pre-processed data comprises aggregated data.
4. The system of claim 3, wherein preprocessing the data further comprises filtering the data prior to aggregation.
5. The system of claim 1, wherein the data further comprises a performance history of the plant.
6. The system of claim 1, wherein the data further comprises geographic information of the plant.
7. The system of claim 1, wherein the data further comprises a base true risk score.
8. The system of claim 1, wherein the data further comprises product data for the plant.
9. The system of claim 1, wherein the data spans a predetermined time window.
10. The system of claim 1, wherein
Providing the plurality of features to the trained classifier includes sending the plurality of features to a remote risk prediction server, and
obtaining the risk score from the trained classifier includes receiving the risk score from a risk prediction server.
CN201910771217.5A 2019-06-21 2019-08-21 Factory risk estimation using historical inspection data Pending CN112116184A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962864947P 2019-06-21 2019-06-21
US62/864,947 2019-06-21

Publications (1)

Publication Number Publication Date
CN112116184A true CN112116184A (en) 2020-12-22

Family

ID=68162714

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910771217.5A Pending CN112116184A (en) 2019-06-21 2019-08-21 Factory risk estimation using historical inspection data

Country Status (3)

Country Link
CN (1) CN112116184A (en)
CA (1) CA3050951A1 (en)
WO (1) WO2020257782A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112241805A (en) * 2019-07-19 2021-01-19 因斯派克托里奥股份有限公司 Defect prediction using historical inspection data
CN112598519A (en) * 2020-12-28 2021-04-02 深圳市佑荣信息科技有限公司 System and method for accounts receivable pledge transfer registered property based on NLP technology
CN113822755A (en) * 2021-09-27 2021-12-21 武汉众邦银行股份有限公司 Method for identifying credit risk of individual user by using feature discretization technology

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111024898B (en) * 2019-12-30 2021-07-06 中国科学技术大学 Vehicle exhaust concentration standard exceeding judging method based on Catboost model
CN112561082A (en) * 2020-12-22 2021-03-26 北京百度网讯科技有限公司 Method, device, equipment and storage medium for generating model
CN113177585B (en) * 2021-04-23 2024-04-05 上海晓途网络科技有限公司 User classification method, device, electronic equipment and storage medium
CN113888019A (en) * 2021-10-22 2022-01-04 山东大学 Personnel dynamic risk assessment method and system based on neural network
CN114066077B (en) * 2021-11-22 2022-09-13 哈尔滨工业大学 Environmental sanitation risk prediction method based on emergency event space warning sign analysis
CN117422306A (en) * 2023-10-30 2024-01-19 广州金财智链数字科技有限公司 Cross-border E-commerce risk control method and system based on dynamic neural network
CN117726181B (en) * 2024-02-06 2024-04-30 山东科技大学 Collaborative fusion and hierarchical prediction method for typical disaster risk heterogeneous information of coal mine

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9671776B1 (en) * 2015-08-20 2017-06-06 Palantir Technologies Inc. Quantifying, tracking, and anticipating risk at a manufacturing facility, taking deviation type and staffing conditions into account
CN108415393A (en) * 2018-04-19 2018-08-17 中江联合(北京)科技有限公司 A kind of GaAs product quality consistency control method and system
CN109492945A (en) * 2018-12-14 2019-03-19 深圳壹账通智能科技有限公司 Business risk identifies monitoring method, device, equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9489630B2 (en) * 2014-05-23 2016-11-08 DataRobot, Inc. Systems and techniques for predictive data analytics

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9671776B1 (en) * 2015-08-20 2017-06-06 Palantir Technologies Inc. Quantifying, tracking, and anticipating risk at a manufacturing facility, taking deviation type and staffing conditions into account
CN108415393A (en) * 2018-04-19 2018-08-17 中江联合(北京)科技有限公司 A kind of GaAs product quality consistency control method and system
CN109492945A (en) * 2018-12-14 2019-03-19 深圳壹账通智能科技有限公司 Business risk identifies monitoring method, device, equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112241805A (en) * 2019-07-19 2021-01-19 因斯派克托里奥股份有限公司 Defect prediction using historical inspection data
CN112598519A (en) * 2020-12-28 2021-04-02 深圳市佑荣信息科技有限公司 System and method for accounts receivable pledge transfer registered property based on NLP technology
CN113822755A (en) * 2021-09-27 2021-12-21 武汉众邦银行股份有限公司 Method for identifying credit risk of individual user by using feature discretization technology
CN113822755B (en) * 2021-09-27 2023-09-05 武汉众邦银行股份有限公司 Identification method of credit risk of individual user by feature discretization technology

Also Published As

Publication number Publication date
WO2020257782A1 (en) 2020-12-24
CA3050951A1 (en) 2019-10-11

Similar Documents

Publication Publication Date Title
CN112116184A (en) Factory risk estimation using historical inspection data
CN108564286B (en) Artificial intelligent financial wind-control credit assessment method and system based on big data credit investigation
US8990145B2 (en) Probabilistic data mining model comparison
US20240078475A1 (en) Attributing reasons to predictive model scores with local mutual information
US20230377037A1 (en) Systems and methods for generating gradient-boosted models with improved fairness
AU2020202909A1 (en) Machine learning classification and prediction system
US20190108471A1 (en) Operational process anomaly detection
CN112116185A (en) Test risk estimation using historical test data
Nazari-Shirkouhi et al. A hybrid approach using Z-number DEA model and Artificial Neural Network for Resilient supplier Selection
CN112241805A (en) Defect prediction using historical inspection data
US11526261B1 (en) System and method for aggregating and enriching data
US11062236B2 (en) Self-learning analytical attribute and clustering segmentation system
US20230281563A1 (en) Earning code classification
Stødle et al. Data‐driven predictive modeling in risk assessment: Challenges and directions for proper uncertainty representation
US12008497B2 (en) Demand sensing and forecasting
US20230410208A1 (en) Machine learning-based, predictive, digital underwriting system, digital predictive process and corresponding method thereof
US20210090101A1 (en) Systems and methods for business analytics model scoring and selection
US20210357699A1 (en) Data quality assessment for data analytics
CN115545481A (en) Risk level determination method and device, electronic equipment and storage medium
CN115062687A (en) Enterprise credit monitoring method, device, equipment and storage medium
CN113987351A (en) Artificial intelligence based intelligent recommendation method and device, electronic equipment and medium
US11004156B2 (en) Method and system for predicting and indexing probability of financial stress
CN117036008B (en) Automatic modeling method and system for multi-source data
US11586705B2 (en) Deep contour-correlated forecasting
Rudnichenko et al. Intelligent System for Processing and Forecasting Financial Assets and Risks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40035231

Country of ref document: HK