CN107194490B

CN107194490B - Predictive modeling optimization

Info

Publication number: CN107194490B
Application number: CN201611262212.2A
Authority: CN
Inventors: A.麦克沙恩; J.多恩胡; B.拉米; A.卡米; N.杜利安; A.阿卜杜勒拉赫曼; L.奥洛格姆; F.马利; M.凯雷斯; E.马凯德
Original assignee: Business Objects Software Ltd
Current assignee: Business Objects Software Ltd
Priority date: 2016-03-14
Filing date: 2016-12-30
Publication date: 2022-08-12
Anticipated expiration: 2036-12-30
Also published as: CN107194490A

Abstract

Techniques are described for identifying an input training data set stored within an underlying data platform; and send instructions to the data platform, the instructions executable by the data platform to train the predictive model based on the input training data set by delegating one or more data processing operations to a plurality of nodes on the data platform.

Description

Predictive modeling optimization

Cross Reference to Related Applications

The present application claims priority from U.S. provisional patent application serial No. 62/307,971, entitled "predictive modeling optimization" and U.S. provisional patent application serial No. 62/307,671, entitled "unified client for distributed processing platform," both of which were filed on 3/14/2016. Both provisional applications are hereby incorporated by reference in their entirety. This application is related to U.S. patent application No. jj, filed _, entitled "unified client for distributed processing platform," which is hereby incorporated by reference in its entirety.

Technical Field

The present specification relates to optimizing predictive modeling.

Background

Predictive modeling is the process of analyzing data using statistical and mathematical methods, finding patterns (patterns), and generating models that can help predict specific results. For business purposes, predictive models are typically built on samples of historical data and may then be applied to different data sets, typically with current data or events.

Disclosure of Invention

The innovative aspects of the subject matter described in this specification can be embodied in methods that include the actions of: identifying an input training data set stored within an underlying data platform; and send instructions to the data platform, the instructions executable by the data platform to train the predictive model based on the input training data set by delegating one or more data processing operations to a plurality of nodes on the data platform. Other embodiments of these aspects include corresponding systems, apparatus, and computer programs configured to perform the actions of the methods encoded on computer storage devices.

These and other embodiments may each optionally include one or more of the following features. For example, a predictive model is applied to the business data set to identify one or more outcomes, each outcome associated with a probability of occurrence. The data platform includes an open source cluster computing framework. The open source cluster computing framework includes Apache Spark. The method is independent of the data transfer of the input training data set from the data platform. The one or more processing operations include calculating one or more statistics associated with the input training data set to reduce a number of variables used to generate the predictive model. The one or more processing operations include encoding data of the input training data set, including converting alphanumeric data into numeric data. The one or more processing operations include performing covariance matrix calculations and matrix inversion calculations with respect to the input training data set. The one or more processing operations include slicing the input training data set and scoring the predictive model with respect to the slices. The one or more processing operations include recalculating the one or more statistics based on the one or more results. The one or more processing operations include iteratively evaluating performance of the predictive model based on the structural risk minimization.

Particular embodiments of the subject matter described in this specification can be implemented to realize one or more of the following advantages. For example, the learning phase of predictive modeling may generally be reduced to 1/10 or more compared to conventional learning techniques. Performance and scalability limitations arising from traditional learning techniques may be transferred from a prediction server or desktop computer to a database server or data platform, such as a distributed processing platform (e.g., Apache Hadoop). Embodiments of the subject matter can be introduced to existing predictive modeling software without major architectural changes. Data transmission requirements may be reduced or eliminated compared to conventional learning techniques, and thus, training may be performed on larger data sets and the solution may be extended to large data. The optimization training process also enables extensibility to broader data sets (e.g., resulting from the data preparation phase). For example, 50,000 columns of training data sets may be employed in an embodiment to train a predictive model.

Moreover, training of traditional models can typically be performed on the client side, thus requiring large data sets to be communicated from the data store to the client, and thus consuming a large amount of network bandwidth. In some embodiments, at least some of the processing is performed on a distributed processing platform (e.g., a Hadoop cluster) and some is performed by a client application (e.g., a modeler), thus reducing the amount of network bandwidth required to transfer large datasets to the client application and to perform modeling jobs only on the client side. In some instances, more data intensive and/or processing intensive processing steps may be performed on the cluster to take advantage of the cluster's greater processing power. Also, because the cluster may be closer to the data storage in the network topology, the performance of more data-intensive operations by the cluster may avoid consuming network bandwidth that would otherwise be consumed by communicating large amounts of data back and forth between the data storage and the modeler, as may occur using conventional training techniques. Embodiments may also provide security advantages in that analysis within a database (e.g., on a cluster) may avoid communicating data over potentially insecure communication channels. Moreover, sensitive and/or private data, such as Personally Identifiable Information (PII), may be more securely processed on the cluster than on other systems.

Embodiments also provide further advantages with respect to machine learning that may be employed in predictive modeling. For example, at least some of the more complex and/or processing intensive internal steps used in machine learning, such as encoding and/or other data preparation operations, may be performed without any user interaction, e.g., these steps may be hidden from the end user. Embodiments may also employ one or more optimizations that may be implemented lazily. Such optimization may include reducing the dimensionality of the data set being analyzed to provide high performance of the modeler. Whereas a model may not be well suited for a particular training set used to train the model, simpler models (e.g., with reduced dimensionality) are generally more useful and robust in processing new data in accordance with the principles of Structure Risk Minimization (SRM).

The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other potential features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.

Drawings

Fig. 1 and 2 depict an example environment for in-database modeling.

3A-3D depict example process flows for modeling within a database.

FIG. 4 depicts an example process for modeling within a database.

FIG. 5 depicts an example computing system that may be used to implement the techniques described herein.

Fig. 6 depicts an example system including a unified client for a distributed processing platform according to an embodiment of the present disclosure.

Fig. 7A depicts an example system including an application employing a unified client according to an embodiment of the present disclosure.

Fig. 7B depicts an example flow diagram of a process for employing a unified client for data processing in accordance with an embodiment of the present disclosure.

Fig. 8 depicts an example class diagram according to an embodiment of the present disclosure.

Detailed Description

There are many different approaches to predictive modeling. For example, regression models the predicted values, while classification distinguishes hidden groups in the data. Furthermore, there are a number of machine learning algorithms, techniques and implementations that vary from off-the-shelf methods (e.g., k-means algorithm in R) to proprietary methods. In particular, proprietary methods can utilize machine learning techniques such as Vapnik-Chervonenkis theory and Structural Risk Minimization to build better quality and more universally applicable models. The quality and robustness of the model can be analyzed based on: i) quality, e.g., how well the model describes the existing data — this is achieved by minimizing the empirically-defined errors; ii) reliability or robustness: when a model is applied to new data, how well the model will predict — this is achieved by minimizing unreliability. In terms of predictive modeling, conventional predictive modeling solutions rely on database connections, such as development database connections (OBDC) and java database connections (JDBC), to connect to a relational database management system (RDBMS), pull data back to memory, and then process the data.

To this end, predictive modeling may be data intensive. In particular, the data preparation phase and the learning (training) phase may require many scans of the same data and many calculations for each individual input parameter. For example, a cross-statistics step in the algorithm may require computation of statistics for each input variable and each target variable. As shown in the table below, for an input data set with N input variables, T target variables, and R rows, the cross-statistics calculation is performed N x T x R times.

Line number

Input variable 1

Input variable 2

Input variable 3

Input variable N

Target variable 1

Target variable 2

1

A

12

76.2

Complete the process

99.67

Product D

2

R

87

98.2

Preparation of

142.32

Product X

…

R

B

4

62.5

Complete the process

150.1

Product A

Traditional architectural designs utilize a hierarchical approach where the data source is at one level and the data processing is at another architectural level. This separation may also be represented as a landscape where the data resides in a database (database server computer or server cluster) and the data is processed on a separate machine (e.g., server or desktop computer). In some examples, communication between layers is via SQL and connectivity is enabled using techniques such as JDBC and ODBC. However, when this architecture is applied to predictive modeling software, it introduces performance and scalability limitations as the entire training data set needs to be transmitted from the database across the network to different machines for processing. For this reason, depending on the algorithm or method employed, the performance penalty of data transmission of a fully trained data set may occur multiple times during the learning (training) phase. Still further, in some examples, architectures relying on data transport may limit performance and scalability when data processing is occurring on hardware (such as a user's desktop computer or a single server computer) that is typically less capable than the more processing power of a database server/cluster or Apache Hadoop cluster. Furthermore, the data transmission method may not scale well with increasing throughput requirements (e.g., the number of models to build in a day and the number of users when the system builds a model).

Embodiments provide automated predictive modeling within a database that overcomes, or at least mitigates, the shortcomings of conventional architectural designs. The modeling may be performed in a big data environment to overcome performance and extensibility limitations of modeling within traditional architectures, such as the limitations described above. Traditional modeling may be performed on the client side, thus requiring large data sets to be communicated from the data storage to the client, and thus consuming a large amount of network bandwidth. In some embodiments, at least some of the processing is performed on the cluster, and some is performed by a client application (e.g., modeler), thus reducing the amount of network bandwidth required to transfer large data sets to the client application and to perform modeling jobs only on the client side. In some instances, more data intensive and/or processing intensive processing steps may be performed on the cluster to take advantage of the cluster's greater processing power. Moreover, because the cluster may be closer to the data storage device in the network topology, the cluster performing more data intensive operations may avoid network bandwidth that would otherwise be consumed by communicating large amounts of data back and forth between the data storage device and the modeler. As described herein, in-database modeling may be modeling performed at least in part in a cluster (e.g., a distributed processing platform) that also stores the data being analyzed. Thus, modeling within a database may provide a security advantage in view of the fact that analysis within the database may avoid communicating data over communication channels that may not be secure. Also, sensitive and/or private data such as Personally Identifiable Information (PII) may be more securely processed on the cluster than on other systems.

Modeling in a database

FIG. 1 illustrates an example environment 100 for modeling within a database. In particular, environment 100 includes a server computing system 102 and a data platform 104. The server computing system 102 may include one or more computing systems, including a cluster of computing systems. The data platform 104 may include one or more computing systems (e.g., nodes) that include a plurality of user-based computing systems. Server computing system 102 may include an automated modeler 106, automated modeler 106 including a modeling service 108. The data platform 104 may include an RDBMS 110, one or more Standard Query Language (SQL) engines 112, and a data repository 114. The engine 112 may be described as a big data SQL engine. In some examples, the engine 112 may include Apache Spark or Apache Hive. Although embodiments of the present disclosure are discussed herein with reference to the data platform 104 as an example distributed processing platform (e.g., the Hadoop framework developed by the Apache software foundation), it is contemplated that embodiments of the present disclosure may be implemented using any suitable distributed processing platform. Although the server computing system 102 is described as a server, the system 102 and/or the modeling service 108 may act as a client when it interacts with the data platform 104.

FIG. 2 illustrates an example environment 200 for in-database modeling similar to environment 100. The environment 200 includes an automated analysis module 202 and a cluster 204. Cluster 204 may contain a distributed processing platform for data processing. In some embodiments, cluster 204 is an Apache Hadoop cluster. The automated analysis module 202 includes a modeler 206. In some embodiments, modeler 206 is a C + + modeler. Modeler 202 may include a connection module 208 and a driver 210. In some embodiments, the connection module 208 is an ODBC connection module. In some embodiments, the Driver 210 is a Spark Driver (JNI) module. In some examples, cluster 204 includes a data warehouse 212, a cluster manager 214, a module 216 associated with a native modeling step, and a distributed file system 218. In some embodiments, the data warehouse 212 is an Apache Hive data warehouse. The connection module 208 may establish a connection (e.g., an ODBC connection) to the data warehouse 212. In some embodiments, cluster manager 214 is a YARN cluster manager. Driver 210 may create a (e.g., YARN) connection to cluster manager 214. In some embodiments, module 216 is an Apache Spark module and the associated modeling step is a native Spark modeling step. In some embodiments, the file system is an Apache Hadoop Distributed File System (HDFS). In some embodiments, the automated analysis module 202 is in communication with the cluster 204. In particular, the connection module 208 communicates with a (e.g., Apache Hive) data warehouse 212 and a (e.g., Spark) driver 210 communicates with a (e.g., YARN) cluster manager 214. The input training data set (e.g., a business data set) may be transmitted over one or both of the connections established by the connection modules 208 and/or drivers 210. Still further, the data warehouse 212 and the modules 216 may communicate with a distributed file system 218, e.g., for in-database modeling. In some embodiments, communication between the cluster 204 and the automated analysis module 202 may employ a unified client, as described below.

The analysis module 202 may use an ODBC connection to interact with a (e.g., Hive) data warehouse 212 to retrieve a result set of processes performed on the cluster 204 by the native modeling step(s) (e.g., Spark job (s)). The YARN connection may be used to request that a job be run on the cluster 204, e.g., via a native modeling step. The results of the native modeling step (e.g., Spark job (s)) may be written to file system 218 (e.g., HDFS). In some instances, the results may be copied from the file system 218 to the data repository 212 to be accessible to the automated analysis module 202 via the unified client.

In some examples, in-database modeling performed by environment 100 may be associated with a method of performing data processing in proximity to a data source. In some examples, in-database modeling of environment 100 is associated with use of in-database processing for predictive modeling. Predictive modeling may include generating database-specific code (e.g., SQL or stored programs) to delegate (delete) a modeling process (e.g., a modeling process within the environment 100) in a language optimized for the data platform 104.

In some examples, the in-database modeling associated with environment 100 may include a data preparation phase, a learning (training) phase, a scoring phase, and/or a retraining phase. The data preparation phase is associated with the cleansing of the data and the processing of outliers (outliers) associated with the data. The data preparation phase may also involve increasing the number of input variables using data manipulation (e.g., by using SQL window functions) to help discover patterns (patterns) in the data. For example, patterns of purchasing behavior within a month are found, rather than patterns on the order of minutes. The learning (training) phase is associated with the application of algorithms and techniques to the input training data set. In some examples, the process of building the model may be iterative to identify a suitable model. This can be performed by software using business domain knowledge or by manually changing the model inputs. Furthermore, the learning (training) phase may be associated with concepts such as overfitting and robustness. Still further, the results of the modeling may include outputs that may be used in the scoring phase. The scoring phase is associated with the application of a training model. The model may be embedded into a business application or used as a microservice to predict results for a given input. The retraining phase is associated with ensuring that existing models are accurate and providing accurate predictions for new data, including model comparisons and re-triggering of the learning process to account for more recent data.

Performance characteristics of modeling within a database

In some embodiments, the data preparation phase of modeling within the database may increase the number of input variables to produce a statistically more robust model with better lift. For example, increasing the number of input variables (e.g., columns) of a data source from 200 to 2000 variables by a factor of 10 may be used to discover patterns within a time window of minutes or days. In some examples, the data manipulation function may use SQL processing in the modeling software to generate additional input variables. This, in turn, increases the size of the data and the performance requirements of the learning process.

In some embodiments, automated machine learning methods that require minimal input from users and minimal machine learning knowledge during the model learning/training phase, such as structural risk minimization, enable scalability for higher throughput and overall simpler processes to provide more roles in the enterprise to use predictive modeling. The results of the automated model building process may use quantitative measures to indicate model quality (error) and robustness (for new data sets) to help the user find the best model.

In some embodiments, the in-database modeling approach provides for delegating the data intensive steps of the predictive modeling process to an underlying data platform, such as an Apache Hadoop cluster and/or database. The data intensive steps are mainly those requiring a data transmission of a full training data set. In some embodiments, the in-database modeling approach minimizes the number of processing steps, including the reuse of results from the learning (training) phase in data sources (e.g., the underlying data platform). Therefore, this reduces the processing cost of recalculation in the subsequent step. For example, the results of the processing steps may be cached (stored in a temporary table) for later reuse.

In some embodiments, parameters associated with the data source (e.g., the client computing system 104 and/or the data warehouse 114) may be used to assist in modeling within the database. In some examples, the database platform associated with the client computing system 104 may include a native low-level language library (e.g., in C + +) and its functionality may be used to support modeling within the database. For example, as described further below, the covariance matrix calculation step, when run on a (e.g., big data) data source, can be delegated to the Apache Spark MLLib (machine learning library). Still further, in some examples, the RDBMS 100, such as Teradata, includes functionality to optimize matrix calculations.

In some embodiments, steps of modeling within a database may be recorded to enable performance tuning of modeling within the database, where the steps include runtime, CPU, and memory footprint (memory footprint). In some embodiments, the in-database modeling may be transparent to end users utilizing existing software, thus providing for the use of the same (or similar) user interface and database connections.

In some embodiments, this configuration may be used to further tune the operation of various modeling steps in the data source to further aid performance. For example, when the modeling step is delegated to Apache Spark, the number of Spark executors, the number of cores, and the memory allocated can be fine-tuned.

Process flow modeling within a database

Linear or polynomial regression analysis can be used to estimate the relationships between variables and form the basis for regression and classification model building. The linear regression model is represented in the following form:

Y-b ₀ +b ₁ X ₁ +b ₂ X ₂ +b ₃ X ₃ +…

where X1, X2, X3, … are predictor variables (features) and Y is a target variable.

A linear regression model is defined when the coefficients (b1, b2, b 3..) and the intercept (b0) corresponding to each variable are known.

FIG. 3A illustrates an example process flow 300 of in-database modeling, e.g., as performed by environment 100 and/or environment 200. In step 302, data preparation and cross-statistics calculation of the data is performed. For example, data manipulation is applied to increase the number of input variables, typically using SQL. Still further, the data manipulation may include combining input variables. For example, the variables "age" and "marital status" are combined together because they may have a more predictive impact on the target variable "salary".

Data preparation may further include slicing the data so that a model derived from one slice may be compared to another slice as part of a learning (training) phase to check for robustness. Data preparation may further include processing data outliers such as "null" values in the database. In some examples, such values may be maintained and categorized. Data preparation may further include variable binning to reduce the number of discrete values associated with the data and to group values with similar or related values (e.g., bins). Cross-statistics computation of data may include computing statistics such as the count and distribution of particular input variable values to each target variable. This can be used to assist the variable reduction process to reduce the number of input variables.

In step 304, data encoding is performed. Specifically, data encoding converts alphanumeric data into numbers. For example, the sample SQL formula that encodes the "AGE" variable is (AGE-Avg (AGE))/SQRT (VAR (AGE)).

In step 306, covariance matrix calculation is performed. The covariance matrix is a matrix whose elements at the i, j positions are the covariance between the i 'th and j' th variables. For example, the covariance between variable X1 and variable X2 is defined as:

further, a matrix inversion calculation is performed. Specifically, the coefficients may be calculated using the following formula:

where C is the covariance matrix of all predictions (predictors), β ^ is the vector of coefficients (b1, b2, …) and Z' represents the transpose of matrix Z. The constant term b0 is the difference between the mean of y and the mean predicted from the estimate X β ^ y.

In step 308, scoring of the predictive model is performed against data slices previously generated to check the robustness of the predictive model. In step 310, recalculation of the cross statistics using the prediction values is performed. At step 312, a performance comparison is performed. In particular, the performance of the predictive model is iteratively evaluated based on the structural risk minimization. In some embodiments, the results of the processing steps may be cached (stored in a temporary table) for later reuse and/or use by other steps. As shown in the example of FIG. 3A, a (e.g., customized) cache may enable results to be shared among various processing steps. Although the example of fig. 3A describes the use of ODBC, JSON, SQL, and HDFS for data connectivity, connectivity formats, query languages, and file systems, respectively, embodiments support the use of other technologies, protocols, and/or formats. Optionally, the data processing steps may be performed in parallel on the cluster, such as the example of steps 310 and 312 shown in FIG. 3A. For example, multiple Spark jobs may be run in parallel by multiple Spark instances running within the cluster.

Fig. 3B through 3D illustrate example process flows for modeling within a database. In these examples, at least a portion of the data processing is performed on the client side (e.g., on an application or other client process separate from the cluster). For example, at least a portion of this processing may be performed by the automated analysis modeler 202. In the example of FIGS. 3B-3D, the automated analysis modeler 202 is a C + + modeler. In some embodiments, modeler 202 may utilize a unified client to interact with a cluster (e.g., with a distributed processing platform such as a Hadoop cluster). The operation of the unified client with respect to the cluster is described further below.

Modeler 202 may use a unified client to request various jobs to run on the cluster serially or in parallel. In the examples of fig. 3B to 3D, the job is a Spark job. These jobs may be requested by modeler 202 through a unified client that includes a Spark client as a child client, as described below. Other types of jobs may also be run to perform various data processing steps. In some embodiments, the results of the various steps may be stored in the data warehouse 212, and the modeler 202 may retrieve the results from the data warehouse 212. In the example of fig. 3B through 3D, the data warehouse 212 is a Hive data warehouse. Embodiments also support the use of other types of data warehouses.

As shown in fig. 3B, modeler 202 may request (e.g., trigger) a Spark job via (e.g., YARN) driver 210, and Spark job 314 (e.g., cross statistics) may be run on the cluster. The job results may be written (e.g., Hive) to the data store 212, and the modeler 202 may read the results from the data store 212. Further processing may be performed thereafter.

As shown in FIG. 3C, further processing may include any suitable number of job types running on the cluster. As shown in the example, the jobs may include a job 316 for encoding data, a job 318 for matrix processing (e.g., using MLLib), a job 320 for scoring a formula, another job 322 for cross statistics, and a job 324 for performance. Other types of jobs are also supported by the embodiments. After each job, the results of the data processing step may be written to the data warehouse 212. Modeler 202 may retrieve the results from data store 212, perform some local processing, and determine another job to execute on the cluster based on the results of the local processing. In this manner, modeler 202 may optionally perform local data processing while performing certain data processing steps using the clusters. In some embodiments, a (e.g., customized) cache may be used to share results between jobs running on the cluster, as described with reference to fig. 3A. In some embodiments, the cache is a workspace used by a unified client as described below.

In some embodiments, flexible configuration may be used to specify jobs to be run on the cluster. FIG. 3D illustrates an example of metadata in JSON format that may be used to configure an example Spark job. Other file formats may also be used to configure the job. In some embodiments, the format and/or schema (schema) of the metadata is flexible and/or generic across multiple jobs, or for all jobs. Thus, new jobs may reuse the same schema and/or the same type of schema.

Process for modeling within a database

FIG. 4 illustrates an example process 400 for modeling within a database. Process 400 may, for example, be performed by environment 100 and/or environment 200 or other data processing apparatus. Process 400 may also be implemented as instructions stored on a computer storage medium, and execution of the instructions by one or more data processing apparatus causes the one or more data processing apparatus to perform some or all of the operations of process 400.

An input training data set stored within the underlying data platform is identified (402). An instruction is sent to the data platform, the instruction executable by the data platform to train a predictive model based on an input training data set by delegating one or more data processing operations to a plurality of nodes on the data platform (404). In some embodiments, the instructions may specify data processing jobs to be executed on the cluster 204 to train or otherwise determine the predictive model, as in the example of fig. 3A-3D. The result set(s) for the job(s) may be retrieved 406 from the data warehouse 212. In some examples, local processing (e.g., on the client-side modeler) may be performed based at least in part on the retrieved result set(s) (408). A determination may be made as to whether additional processing jobs are to be performed to determine the predictive model (410). If so, the process may return to 404 and another instruction set may be sent to request that the job be run on cluster 204, and/or additional local processing may be performed. If no additional processing is to be performed to determine the predictive model, the predictive model may be provided (412). A predictive model may be applied 414 to a data set (e.g., a business data set) to make predictions about the data, e.g., to identify outcome(s) associated with probabilities of subsequent occurrence of particular data values in the data set.

Although fig. 4 depicts an example in which processes are performed in a particular order (e.g., jobs run first on the cluster, followed by local processing) and sequentially, embodiments are not limited thereto. Embodiments support modeling that includes any number of data processing steps (jobs) that are executed on the cluster 204 or locally at the automated analysis module 202, and that may be executed sequentially or in parallel.

Unified client

Such as the distributed processing platform used to perform modeling described herein, large data sets may be stored and processed in a batch mode. In the example of Hadoop, the Hadoop ecosystem has initially included MapReduce and Hadoop Distributed File System (HDFS), and has evolved over time to support other processing engines (e.g., Hive, Impala, Spark, Tez, etc.), other languages (e.g., PIG, HQL, HiveQL, SQL, etc.), and other storage schemas (e.g., Parquet, etc.). In particular, the addition of Spark engines has significantly increased the distributed processing efficiency of Hadoop over previous versions that support the MapReduce architecture but do not support Spark. The Spark engine can handle complex processes with many underlying iterations, such as those used in machine learning.

By supporting technology farms (technical "zos") with many different processing engines, languages, and storage schemas, distributed processing platforms present an engineering challenge as organizations attempt to integrate the platform into specific organizational contexts and/or workflows. For example, a group of information technologies in a business may want to produce an optimal data processing solution that suits the business' specific needs, and to do so they can utilize and/or combine different technologies supported by the platform. The disparate technologies supported by the platforms may complement each other and/or may operate concurrently with each other. Traditionally, a large amount of temporary (ad hoc) and/or proprietary code would need to be written in order for an application to combine and/or coordinate the operation of the multiple technologies supported by the platform. These codes can be difficult to maintain across versions of an application as the design and/or logic of the application changes. Embodiments provide a unified client that acts as a single interface to interact with all subsystems supported by the distributed processing platform and facilitates the consumption of a wide variety of services provided by the distributed processing platform. By combining the different subsystems in a single session, the unified client also operates to overcome individual limitations (e.g., performance limitations, processing capacity, etc.) inherent in each subsystem and/or technology that may be a distributed processing platform.

Spark technology has been designed to support long-running job runs in batch mode. Spark technology supports job execution through shell scripts (e.g., Spark-submit). The configuration of shell scripts presents challenges in its own right. For example, shell scripts impose a number of script arguments (arguments) and prerequisites (prerequisites), such as the client-side Hadoop XLM configuration and the presence of specific Hadoop environment variables.

From the perspective of the client application, utilizing Spark may be difficult for various reasons. For example, Spark is difficult to embed in application runtime landscapes. The traditional way of submitting a Spark job involves creating a custom command line and running the custom command line in a separate process. Moreover, the Spark job is traditionally stand-alone and run at one time, and it is not possible to return to the client workflow (e.g., to take an intermediate step) to continue the Spark job run from the point it was interrupted. Thus, Spark cannot be easily used in an interactive and/or stateful manner in conventional platforms. Also, conventionally the Spark connection description cannot exist as a separate concept. Alternatively, the Spark interface may handle Spark job submissions whose configuration includes connection-related information and other parameters. Furthermore, Spark may not traditionally provide a type of connection repository that is comparable to the connection repository that appears in the context of an RDBMS. For at least these reasons, in conventional solutions, the Spark interface is difficult to embed, difficult to configure, and may only handle job runs in batch mode, avoiding intermediate interactions with the client application.

To alleviate, and in some instances eliminate, the limitations listed above with respect to existing disparate interfaces in a distributed processing platform, embodiments provide for enhanced service consumption in a distributed processing platform. In particular, embodiments provide an embeddable operation Spark client (e.g., driver) so that the Spark driver can be loaded into an application process even in non-JVM processes. In some embodiments, the Spark runtime is based on byte code and the Spark client may be configurable at runtime. The Spark driver may consume predefined Spark connection descriptors that are persisted into a particular repository to simplify connection configuration. Spark job runtime can be specific to each application domain. Spark job runtimes may be stored in a dedicated repository and may be deployable to (e.g., Hadoop) clusters at runtime. In some embodiments, the Spark client provides interactive and/or stateful connections. Spark connections may be established to enable submission of continuous jobs with intermediate states maintained in the virtual workspace. Internally, the Spark connection may correspond to a SparkContext instance.

In some embodiments, at least some (or all) of the Hadoop specific client interfaces may be merged into a single point client component as a unified client. The unified client enables seamless association of various services (e.g., Hive, sparkSQL, Spark, MapReduce, etc.) to enable complex and/or heterogeneous data processing chains. Via the unified client, the Spark driver may be aligned with other drivers (e.g., Hive client, HDFS client, etc.) at the same technology feature level.

Fig. 6 depicts an example system including a unified client for a distributed processing platform according to embodiments of the present disclosure. As shown in the example of fig. 6, the system may include one or more distributed systems 602 in a distributed processing platform. In some examples, distributed system(s) 602 includes Hadoop system(s). Embodiments also support other types of distributed system(s) 602. Distributed system(s) 602 may include subsystems and/or engines, such as MapReduce 606, Hive engine 608, Spark engine 610, Spark sql 612, and storage 614 (e.g., HDFS).

The system canTo include a unification client 604. The unification client 604 can include sub-clients, such as a MapReduce client 616, a Hive client 618, a Spark client 620, a SparkSQL client 622, and/or a storage client 624. The unified client 604 may also include any other suitable type of sub-client, such as a Simple Concurrent Object Oriented Programming (SCOOP) client. The sub-clients may also include HDFS clients. In some embodiments, the child clients may include one or more other (e.g., generic) SQL clients to support SQL implementation(s) other than Spark SQL, such as Cloudera Impala ^TM . Each of the various sub-clients of the unifying client 604 may be configured to interface with a respective subsystem of the distributed system(s) 602. For example, MapReduce client 616 may be configured to interface with MapReduce 606, Hive client 618 may be configured to interface with Hive engine 608, Spark client 620 may be configured to interface with Spark engine 610, Spark sql client 622 may be configured to interface with Spark sql, and storage client 624 may be configured to interface with storage 614.

In some embodiments, Spark client 620 may access Spark job repository 626. The unifying client 604 may access and employ a data workspace 628 and/or unifying metadata 630 (e.g., tables, RDDs, and/or file schemas). In some embodiments, the unified client 604 may access the unified connection repository 632. The unified connection repository 632 may include one or more of Hive connections 634 (e.g., employing ODBC and/or JDBC), SparkSQL connections 636 (e.g., employing ODBC and/or JDBC), native Spark connections 638, and/or native HDFS connections 640. In some instances, there may be a pairing between SparkSQL connection 636 and native Spark connection 638. In some instances, there may be a pairing between native Spark connection 638 and native HDFS connection 640.

The unified connection repository 632 may also be described as a connection metadata repository. Unified connection repository 632 may store metadata indicating pairings between different connections (e.g., connections of different types of pairings). The pairing may enable an interface between different sub-clients, such as MapReduce client 616, Hive client 618, Spark client 620, Spark sql client 622, or storage client 624, etc. During a particular unified session, an application may invoke a number of different sub-clients and may receive and/or transmit data via the various sub-clients. The connection pairing at the metadata level is defined in the unified connection repository 632 to enable the combination of sub-clients for a particular unified session. Defining connection pairings at the metadata level also enables switching between sub-clients used during a session. For example, a session may be initiated using one sub-client (e.g., SparkSQL client), and using the same unified session, the initial sub-client may be associated (e.g., linked) with one or more other sub-clients that may also be used. Switching between sub-clients may be performed lazily because each sub-client shares the smallest common interface and thus becomes interoperable. For example, the Spark sub-client may interoperate with the Hive SQL sub-client or the HDFS client. The actual selection of a sub-client may be determined at run-time by the specific session configuration. The association (e.g., linking) between the sub-clients may be performed in a seamless manner without additional authorization or authentication of client credentials. Authentication may be handled by a "single sign on" method (e.g., using Kerberos) that may authenticate a unified client session once so that it is used on all child clients. In some embodiments, the metadata and/or data that flows from a given step in the chain may not be persisted and may instead be sent to the next child client in the processing chain. Embodiments enable different sub-client interfaces to be combined in a seamless manner for use during a unified session. Each sub-client may be attached to a common interface and interoperability may thus be provided between the sub-clients. Further described with reference to fig. 8.

Fig. 8 depicts an example class diagram 800 according to an embodiment of the present disclosure. In some embodiments, the unified client interface may be implemented according to the class diagram 800. In an example, class diagram 800 includes a hierarchical arrangement of

classes

802, 804, 806, 808, 810, 812, and 814. As shown by example, each class may include various member methods and member fields. For example, the UnifiedConnection class 804 includes member methods subonnectionlist () and createWorkspace (). In some examples, each job handles a specific sub-client, such as Spark SQL or HDFS. Each job, such as an instance of HDFSJob class 808, SQLSJob class 810, SparkJob class 812, and/or MapReduceJob class 814, may implement interface Abstract client 806. The following is an example flow of commands through such an embodiment. 1) The UnifiedConnection 802 may be instantiated. 2) A stateful instance of the works space class 804 may be created in which transit data (stagging data) may reside. 3) Jobs may be added to the workplace. In some examples, JSON may include input and output parameters that may refer to existing results. 4) Job compilation may be triggered (e.g., to build a job graph based on topological dependencies). In some instances, the system may confirm that the job map is well formed. 5) The job plan may run within a unified connection context. Intermediate and/or temporary data may be stored within the workspace. In the example of fig. 8, "subConnectionId," "ApplicationRuntimeId," and/or "mapdepressureruntimeid" may refer to a unified client repository in which connections are predefined and/or in which Spark or mapdepresse runtimes are stored.

Referring back to fig. 6, the linking of the sub-clients may include receiving data at a first sub-client, which then provides the data for processing by a second sub-client. Although examples herein may describe linking two sub-clients together during a unified session, embodiments enable linking of any suitable number of sub-clients to process data sequentially. The links of the sub-clients may be serial links, where data is passed from one sub-client to another sub-client and then to yet another sub-client, and so on. Linking may also enable parallel processing, where multiple sub-clients process the same data at least partially contemporaneously. Linking may involve branching, where processing is performed in parallel in multiple sub-clients and/or multiple sub-client chains. Chaining may also include the merging and/or rejoining (rejoin) of branching chains for further processing.

Pairing of connections may occur at runtime and may be based on a first connection directed to a second (e.g., Hadoop) subsystem, such as a different sub-client than that used for the first connection. Embodiments provide a unified client for combining different types of data processing technologies, e.g., corresponding to different sub-clients, to provide a data processing solution that is more feature rich than traditional solutions. By unifying clients, embodiments also provide a solution that enables greater flexibility in data processing by leveraging multiple capabilities of the (e.g., Hadoop) platform.

Unified connection repository 632 may store metadata for one or more interface-specific connections. In some instances, connections may be paired with each other only if such connections point to the same subsystem of distributed system(s) 602. In some examples, the native Spark connection description includes at least an XML Hadoop file in YARN mode, which is deployed at runtime to the class path of Spark runtime to properly configure the YARN and/or Hadoop components.

In some examples, the Spark client may be stored in a repository separate from Spark job runtime packages (e.g., jar files). If Spark and/or Hadoop versions are compatible, the work piece may be run using any Spark connection.

In some embodiments, the unified client 604 exposes the various individual interfaces it includes. A unified client consumer (e.g., an application) may initiate a given connection to a particular interface (e.g., a Hive client). Depending on the predetermined connection pairing, the unified client consumer may automatically access other service interface(s) to establish a heterogeneous data processing graph, as shown in the example of fig. 7A. In some instances, a credential may be requested to enable access to the paired connection.

A unified connection (e.g., a paired connection set) can be bound to the virtual data workspace 628, and the virtual data workspace 628 can include state information for a unified session between the unified client 604 and the distributed system(s) 602. For example, the data workspace 628 may include state information, such as one or more intermediate states maintained in the form of references and/or identifiers to Hive tables, memory-elastic distributed data (RDD), HDFS filenames, and/or client resources. This information may enable a stateful connection to be maintained. Maintaining a reference to the memory RDD in the state information may enable different jobs (e.g., Spark or otherwise) to be linked to each other. For example, a first Spark job may return an RDD reference as a result, and another job may consume the result by passing in an argument that is the RDD reference. Given that RDDs may be large, a job may pass in and/or return a reference to the RDD rather than the RDD itself. The presence of state information in the data workspace 628 may also enable an auto-purge to be performed at the end of the session. For example, at least some of the state information may be deleted at the end of the session, such as a reference (e.g., Hive table) created to retrieve the results to the unifying client 604 and/or application. Embodiment enable data is passed from one processing step to another along the dataflow graph shown in fig. 7A.

Fig. 6 provides an example of a processing chain as shown by unified connection repository 632. For example, a particular session of interaction between the unifying client 604 and the distributed system(s) 602 may employ a Spark engine and a Hive engine in a particular manner (e.g., using Spark sql), and may utilize HDFS. Depending on the requirements to be met in a single session of component processing at the unified client 604, the stepwise processing may include on the application side, transferring the data set resulting from the intermediate processing and pushing the data set to the distributed system(s) 602. This may be followed by Spark processing of the data set. The unifying client 604 may enable the application to link the execution of these various processing steps in a seamless manner. The steps may further include a data preparation step using the HiveQL language. The use of the unified client 604 eliminates the need to import these data preparation jobs into SparkSQL or other languages. For example, the unified client 604 enables an application to perform data preparation using Hive, perform various modeling steps using Spark engine, and retrieve various results to the application using Hive and/or Spark. The application may then perform intermediate processing of the result(s). The steps may alternate on the unified client side and/or the distributed system side(s). For distributed system side processing(s), embodiments enable the combination of any number of operations, including operations in MapReduce, Spark, Hive, HDFS, etc., in any order.

Although the examples herein describe the use of a unified client for use with a single distributed processing platform (e.g., Hadoop), embodiments are not so limited. In some embodiments, a unified client may be used to facilitate data processing across multiple distributed processing platforms. In such instances, the unified connection repository 632 may include metadata describing the connection pairing between two HDFS connections, e.g., to facilitate the transfer and/or copying of data from one distributed processing platform to another. In such instances, the unifying client 604 may comprise an HDFS client as a sub-client to handle such cross-platform data transfer.

In some embodiments, the coupling or pairing of connections may be user-specific, e.g., one or more specific associations between connections may be established and stored for a particular user. In one example, connection pairing and/or association may be made between: ODBC connections to Hive, Spark SQL, etc.; spark connection (e.g., including configuration files and attributes); and a HDFS connection. A unified client connection may include the 3 connections associated together. A unified client connection configuration may be the same for all users, or there may be user specific values to provide flexibility. For example, the ODBC connection may be generic to all users, with more specific ODBC connections for user 1 and user 2. For user 1, a particular ODBC connection may include information for Spark configuration and HDFS configuration. For user 2, the specific ODBC connection may include information for Spark configuration and HDFS configuration. As another example, a generic (e.g., technical user) ODBC connection may be used, but with a custom Spark configuration for user 2. For user 1, the connection may be a generic ODBC connection with Spark profile and HDFS configuration. For user 2, the connection may be a generic ODBC connection with a Spark profile, a customized add-on configuration for user 2, and an HDFS configuration.

FIG. 7A depicts an example system including an application 702 employing a unified client 604 according to an embodiment of the disclosure. As shown in the example illustrated in FIG. 7A, the system may include an application 702. The application 702 may include a unified client 604 and a unified client workspace 704 (e.g., a data workspace 628). In some instances, the unifying client 604 is embedded (e.g., in-process) to the application 702. For example, the unifying client 604 may be loaded at runtime as a library to provide the application 702 with interface capabilities to the various subsystems of the distributed system(s) 602.

In some examples, the unified client workspace 704 includes data structure metadata 706 and one or more references 708 to tables, HDFS, and/or RDDs. The unified client 604 may be configured to access and employ the unified client workspace 704 to perform its various operations. The unified client 604 may run one or more queries in the HQL 710 (e.g., for data materialization). The unification client 604 can submit a job, such as a Spark job 712 (e.g., for data transformation), and receive an output RDD reference from the Spark job 712. The unifying client 604 may run SQL such as SparkSQL 714 (e.g., for data retrieval) and receive the result(s) from SparkSQL 714. The unified client 604 may run the PUT command via HDFS command 716 (e.g., for data upload). The unifying client 604 may submit the job and RDD and/or HDFS reference(s) to the Spark job 718 (e.g., for data transformation).

In some instances, each data reference hosted by workspace 704 has metadata describing its structure. The unified client 604 may be configured to manage multiple connections to different subsystems of the distributed system(s) 602 (e.g., Hadoop). If the unifying client consumer needs to build a data processing graph across subsystems, the unifying client 604 provides transition data in a staging area (staging area) that is part of the data workspace. After the unified connection is closed, the contents of the temporary workspace are automatically purged by the unified client component.

The unifying client 604 may provide a single point of access to the distributed system(s) 602 to applications or other consumers. The various subsystems of the distributed system(s) 602 can provide different benefits, and the unified client 604 can enable an application to utilize and/or combine the different benefits of each subsystem in a seamless, efficient manner without having to perform a large amount of temporary specific coding.

The unifying client 604 enables the creation of a unified session for the application 702 to interface with the distributed system(s) 602. When a unified session is created from the unified client 604, the unified client 604 may create a unified connection that pairs and/or otherwise combines different individual connection types (e.g., to Hive, Spark, HDFS, MapReduce, etc.). To achieve this unified connection, embodiments may specify the native Spark connection description as a set of schemas (set of schemas).

Traditionally, Spark connections are facilitated by using shell scripts that do not separate the establishment of the connection from job submission. In some embodiments, the task of establishing the Spark connection may be separate from the task of job submission. Conventionally, Spark is configured to enable jobs to run in batch mode and Spark does not enable interactive sessions. In some embodiments, the unifying client 604 enables an interactive Spark session between the application 702 and the distributed system(s) 602. For example, the unifying client 604 may cause the distributed system(s) 602 to initiate a Spark job, interrupt the job to perform some intermediate step, and continue the Spark job after the intermediate step(s) are performed.

Traditionally, information describing Spark connections may be inconveniently placed in multiple locations, such as XML files, Hadoop variables, and so forth. In some embodiments, a single Spark connection descriptor may include various Spark connection information, providing a more convenient way for clients to easily access Spark connection information. The Spark connection descriptor may be in Spark job repository 626. The unified client 604 may access the Spark job repository 626 to access Spark connection descriptors and create and/or restore Spark connections based on connection information therein. In this way, embodiments provide a unified client 604 that treats Spark efficiently similar to other engines supported by the distributed system(s) 602, thus facilitating application processing using Spark. Instead of requiring temporary and/or specialized code to be written to interact with each of the different subsystems, the unifying client 604 provides a single interface through which the enabling application 702 can interact with the various subsystems in a similar manner.

Specific links of the sub-clients shown in fig. 7A, e.g., HQL 710 to Spark job 712 to Spark sql 714, etc., are provided as examples, and embodiments are not limited to this example. In general, any suitable number and type of sub-clients may be linked in any order, serially and/or in parallel, to perform data processing. In the example of fig. 7A, Spark job 712 processes the data and provides the results of the processing to both Spark sql 714 and another Spark job 718 as an example of a branch for parallel processing as described above. A particular sub-client may be used to perform a particular type of operation during the linked instance. For example, some sub-clients may be used to retrieve data from storage, while other sub-clients may be used to transform data in some manner. After the processing step has been performed, some metadata may be returned to the unification client 604 to indicate the results of the processing or to indicate that the processing has been performed. Such returned metadata may include references to results, such as RDD references to the output shown in FIG. 7A when returned from Spark job 712. The results of the various processing steps performed by the various sub-clients may be correlated with one another through the use of references.

Fig. 7B depicts an example flow diagram of a process for data processing with a unified client in accordance with an embodiment of the present disclosure. The operations of the process may be performed by the application 702, the unified client 604, and/or other software modules running on the client computing device, a device of the distributed processing platform, or elsewhere.

A request is received (720) indicating that data processing is to be performed in a distributed processing platform using the unified client 604. In some examples, the request may be received from an application 702 invoking the unifying client 604.

The sub-clients of the unified client 604 are determined (722) to perform data processing steps. In some instances, the data processing flows and chains may be predetermined to address a particular problem. In some instances, data processing flows and chains may be determined at runtime through flexible input configurations and/or based on results of data processing. For example, if the data set is determined to be trivial (e.g., lower cost) relative to the processing of other sub-clients in one sub-client, the selection of the lower cost sub-client may be made at runtime. The data processing step (724) is performed using the determined sub-clients, and the results may be provided for further processing. In some embodiments, a reference to the result may be provided (726), so that other sub-clients may perform further processing steps on the result data.

A determination is made as to whether additional processing is needed 728. If not, the results of the last processing step may be provided (730), e.g., to application 702. If further processing is required, the process may return to 722 and determine another sub-client that is the same as or different from the sub-client used in the previous step. The processing steps may be performed sequentially by a sequence of (same or different) sub-clients and/or the processing steps may be performed in parallel by multiple sub-clients of the same or different types.

In some examples, at least some of the data processing may be performed on the client side, e.g., external to the distributed processing platform. For example, Results may be retrieved from the Hadoop processor via the Get Results stream shown in FIG. 7A. Local processing may be performed on the received results, and the results of the local processing may be sent for further processing by another sub-client. Embodiments enable at least some of the processing steps to be performed outside of a distributed processing platform (e.g., a Hadoop system).

Example computing device

Fig. 5 illustrates an example of a computer device 500 and a mobile computer device 550 that may use the techniques described herein. Computing device 500 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Computing device 550 is intended to represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, and other similar computing devices. The components described herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not intended to limit embodiments described and/or claimed in this document. At least one computing device 500 and/or 550, or one or more components thereof, may be included in any of the computing devices, systems, and/or platforms described herein.

Computing device 500 includes a processor 502, memory 504, a storage device 506, a high-speed interface 508 connecting to memory 504 and high-speed expansion ports 510, and a low-speed interface 512 connecting to low-speed bus 514 and storage device 506. Each of the

components

502, 504, 506, 508, 510, and 512, are interconnected using various buses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 502 can process instructions for execution within the computing device 500, including instructions stored in the memory 504 or on the storage device 506 to display graphical information for a GUI on an external input/output device, such as display 516 coupled to high speed interface 508. In other embodiments, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory. Also, multiple computing devices 500 may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a blade server bank, or a multi-processor system).

The memory 504 stores information within the computing device 500. In one embodiment, the memory 504 is a volatile memory unit or units. In another implementation, the memory 504 is a non-volatile memory unit or units. The memory 504 may also be another form of computer-readable medium, such as a magnetic or optical disk.

The storage device 506 is capable of providing mass storage for the computing device 500. In one embodiment, the storage device 506 may be or contain a computer-readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state storage device, or an array of devices including devices in a storage area network or other configurations. The computer program product may be tangibly embodied in an information carrier. The computer program product may also contain instructions that, when executed, perform one or more methods such as those described above. The information carrier is a computer-or machine-readable medium, such as the memory 504, the storage device 506, or memory on processor 502.

The high-speed controller 508 manages bandwidth-intensive operations for the computing device 500, while the low-speed controller 512 manages lower bandwidth-intensive operations. Such allocation of functions is exemplary only. In one embodiment, the high-speed controller 508 is coupled to memory 504, display 516 (e.g., via a graphics processor or accelerator), and to high-speed expansion ports 510, which may accept various expansion cards (not shown). In an embodiment, low-speed controller 512 is coupled to storage device 506 and low-speed expansion port 514. The low-speed expansion port, which may include various communication ports (e.g., USB, bluetooth, ethernet, wireless ethernet), may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a network device, such as a switch or router, e.g., via a network adapter.

The computing device 500 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 520, or multiple times in a group of such servers. It may also be implemented as part of a rack server system 524. Further, it may be implemented in a personal computer such as laptop computer 522. Alternatively, components from computing device 500 may be combined with other components in a mobile device (not shown), such as device 550. Each of such devices may contain one or more of computing devices 500, 550, and an entire system may be made up of multiple computing devices 500, 550 communicating with each other.

Computing device 550 includes a processor 552, memory 564, an input/output device such as a display 554, a communication interface 566, and a transceiver 568, among other components. The device 550 may also be equipped with a storage device, such as a microdrive or other device, to provide additional storage. Each of the

components

550, 552, 564, 554, 566, and 568, are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as appropriate.

The processor 552 can execute instructions within the computing device 540, including instructions stored in the memory 564. The processor may be implemented as a chipset of chips that include separate and multiple analog and digital processors. The processor may provide, for example, for coordination of the other components of the device 550, such as control of user interfaces, applications run by device 550, and wireless communication by device 550.

Processor 552 may communicate with a user via control interface 648 and display interface 556, which is coupled to a display 554. The display 554 may be, for example, a TFT LCD (thin film transistor liquid Crystal display) or OLED (organic light emitting diode) display, or other suitable display technology. The display interface 556 may comprise appropriate circuitry for driving the display 554 to present graphical and other information to a user. The control interface 558 may receive commands from a user and convert them for submission to the processor 552. In addition, an external interface 562 may be provided in communication with processor 552, so as to enable near-range communication of device 550 with other devices. External interface 562 may provide, for example, for wired communication in some embodiments, or for wireless communication in other embodiments, and multiple interfaces may also be used.

The memory 564 stores information within the computing device 550. The memory 564 may be implemented as one or more of the following: one or more computer-readable media, one or more volatile memory units, or one or more non-volatile memory units. Expansion memory 554 may also be provided and connected to device 550 via expansion interface 552, which expansion interface 552 may comprise, for example, a SIMM (Single in line memory Module) card interface. Such expansion memory 554 may provide additional storage space for device 550, or may also store applications or other information for device 550. Specifically, expansion memory 554 may include instructions to carry out or supplement the processes described above, and may include secure information also. Thus, for example, expansion memory 554 may be provided as a security module for device 550, and may be programmed with instructions that permit secure use of device 550. In addition, secure applications may be provided via the SIMM cards, along with additional information, such as placing identifying information on the SIMM cards in a non-hacking manner.

The memory may include, for example, flash memory and/or NVRAM memory, as discussed below. In one embodiment, a computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods such as those described above. The information carrier is a computer-or machine-readable medium, such as the memory 564, expansion memory 554, memory on processor 552, or a propagated signal that may be received, for example, over transceiver 568 or external interface 562.

Device 550 may communicate wirelessly via communication interface 566, which may include digital signal processing circuitry as necessary. Communication interface 566 may provide for communications under various modes or protocols, such as GSM voice calls, SMS, EMS, or MMS messaging, CDMA, TDMA, PDC, WCDMA, CDMA2000, or GPRS, among others. Such communication can occur, for example, via radio frequency transceiver 568. Further, short-range communication may occur, such as using a Bluetooth, WiFi, or other such transceiver (not shown). In addition, a GPS (Global positioning System) receiver module may provide additional navigation-and location-related wireless data to device 550, which may optionally be used by applications running on device 550.

Device 550 may also communicate voice using audio codec 560. audio codec 560 may receive voice information from a user and convert it to usable digital information. Audio codec 660 may likewise generate sound for a user, such as via a speaker, e.g., audible in a handset of device 550. Such sound may include sound from voice telephone calls, may include recordings (e.g., voice messages, music files, etc.) and may also include sound generated by applications running on device 550.

The computing device 550 may be implemented in a number of different forms, as shown. For example, it may be implemented as a cellular telephone 580. It may also be implemented as part of a smart phone 582, personal digital assistant, or other similar mobile device.

Various embodiments of the systems and techniques described here can be implemented in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include embodiments implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" "computer-readable medium" refers to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) display screen) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other types of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or touch input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a client computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), and the internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

While this disclosure includes certain features, these should not be construed as limitations on the scope of the disclosure or of what may be claimed, but rather as descriptions of features of example embodiments of the disclosure. Certain features that are described in the context of separate embodiments of the disclosure can also be provided in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be provided in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the disclosure have been described. Other embodiments are within the scope of the following claims. For example, the operations recited in the claims can be performed in a different order and still achieve desirable results. Various embodiments have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the disclosure. For example, various forms of the flows shown above may be used, including steps of reordering, adding, or removing. Accordingly, other embodiments are within the scope of the following claims.

Claims

1. A computer-implemented method performed by at least one processor, the method comprising:

identifying, by the at least one processor, an input training data set stored within a data warehouse of a distributed processing platform comprising a plurality of subsystems;

sending, by the at least one processor, an instruction from a client application to the distributed processing platform to request at least one of the plurality of subsystems to be executed to perform at least one data processing operation to determine a predictive model based on the input training dataset, each of the at least one of the plurality of subsystems being executed within a cluster and receiving at least a portion of the input training dataset from the data warehouse, wherein the at least one data processing operation includes slicing the input training dataset to determine one or more slices and scoring the predictive model with respect to the one or more slices to compute cross-statistics for the predictive model;

receiving, by the client application, a result set of at least one data processing operation from a data warehouse, the result set being stored in the data warehouse by a respective subsystem, wherein the result set comprises cross statistics calculated on slice input data, including statistics calculating a distribution of input variable values relative to target variables, to reduce a number of input variables used to train a predictive model;

running, by the client application, a local process based on the result set and determining, based on results of the local process, whether additional data processing operations are to be performed to determine a predictive model; and

providing, by the client application, the predictive model to determine one or more outcomes, each outcome associated with a probability of occurrence of a data set value.

2. The computer-implemented method of claim 1, wherein the instructions are sent from the client application to the distributed processing platform via a unified client comprising a plurality of sub-clients, each sub-client configured to interface with a respective subsystem of the distributed processing platform.

3. The computer-implemented method of claim 1, further comprising:

executing, by the at least one processor, at least one local data processing operation on the client application to determine the predictive model;

wherein the at least one local data processing operation accepts input comprising a result set resulting from at least one data processing operation performed on the distributed processing platform.

4. The computer-implemented method of claim 1, wherein the method is independent of data transmission of the input training data set from the distributed processing platform.

5. The computer-implemented method of claim 1, wherein the at least one data processing operation includes computing one or more statistics associated with the input training data set to reduce a number of variables used to generate the predictive model.

6. The computer-implemented method of claim 5, wherein the at least one data processing operation further comprises recalculating the one or more statistics based on the one or more results.

7. The computer-implemented method of claim 1, wherein the at least one data processing operation comprises encoding data of the input training data set, which includes converting alphanumeric data into numeric data.

8. The computer-implemented method of claim 1, wherein the at least one data processing operation comprises performing a covariance matrix calculation and a matrix inversion calculation with respect to the input training data set.

9. The computer-implemented method of claim 1, wherein the at least one data processing operation comprises iteratively evaluating performance of the predictive model based on a structural risk minimization.

10. A system for predictive modeling optimization, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor, the memory storing instructions that, when executed by the at least one processor, cause the at least one processor to perform operations comprising:

identifying an input training data set stored within a data warehouse of a distributed processing platform comprising a plurality of subsystems;

sending an instruction from a client application to the distributed processing platform to request at least one of the plurality of subsystems to be executed to perform at least one data processing operation to determine a predictive model based on the input training dataset, each of the at least one of the plurality of subsystems executing within a cluster and receiving input training data and at least a portion of the input training data from the data warehouse, wherein the at least one data processing operation includes slicing the input training dataset to determine one or more slices and scoring the predictive model with respect to the one or more slices to compute cross-statistics for the predictive model;

receiving a result set of at least one data processing operation from a data warehouse, the result set being stored in the data warehouse by a respective subsystem, wherein the result set comprises cross statistics calculated on slice input data, including statistics calculating a distribution of input variable values relative to target variables, to reduce a number of input variables used to train a predictive model;

running a local process based on the result set and determining whether additional data processing operations are to be performed to determine a predictive model based on results of the local process; and

the predictive model is provided to determine one or more outcomes, each outcome associated with a probability of occurrence of a data concentration value.

11. The system of claim 10, wherein the instructions are sent from the client application to the distributed processing platform via a unified client comprising a plurality of sub-clients, each configured to interface with a respective subsystem of the distributed processing platform.

12. The system of claim 10, the operations further comprising:

running at least one local data processing operation on the client application to determine the predictive model;

wherein the at least one local data processing operation accepts input comprising a result set resulting from the at least one data processing operation executing on the distributed processing platform.

13. A non-transitory computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform operations comprising:

14. A non-transitory computer readable storage medium as recited in claim 13, wherein the at least one data processing operation comprises computing one or more statistics associated with the input training data set to reduce a number of variables used to generate the predictive model.

15. A non-transitory computer readable storage medium as recited in claim 14, wherein the at least one data processing operation further comprises recalculating the one or more statistics based on the one or more results.

16. A non-transitory computer readable storage medium as recited in claim 13, wherein the at least one data processing operation comprises performing covariance matrix calculations and matrix inversion calculations on the input training data set.