CN107423823B - R language-based machine learning modeling platform architecture design method - Google Patents

R language-based machine learning modeling platform architecture design method Download PDF

Info

Publication number
CN107423823B
CN107423823B CN201710684578.7A CN201710684578A CN107423823B CN 107423823 B CN107423823 B CN 107423823B CN 201710684578 A CN201710684578 A CN 201710684578A CN 107423823 B CN107423823 B CN 107423823B
Authority
CN
China
Prior art keywords
operator
machine learning
component
operators
oozie
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710684578.7A
Other languages
Chinese (zh)
Other versions
CN107423823A (en
Inventor
竹登虎
勇萌哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Youe Data Co ltd
Original Assignee
Chengdu Youe Data Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Youe Data Co ltd filed Critical Chengdu Youe Data Co ltd
Priority to CN201710684578.7A priority Critical patent/CN107423823B/en
Publication of CN107423823A publication Critical patent/CN107423823A/en
Application granted granted Critical
Publication of CN107423823B publication Critical patent/CN107423823B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a machine learning modeling platform architecture design method based on R language, which comprises the steps of constructing a visual machine learning operator based on R language, distributing the R operator in the machine learning operator to different Hadoop cluster computing nodes by using an Oozie component, calling data managed by an HDFS component by the Hadoop cluster computing nodes, and computing according to the logical relation of the machine learning operator to obtain the final result of the machine learning operator; the method is adopted to realize the distributed computation of the R language-based visual machine learning operator, so that the modeling platform has the R language-rich machine learning operator and an efficient and flexible programming system, the R operator is adaptively scheduled to different Hadoop cluster computing nodes by the Oozie flow control component, and the balance of cluster loads and the multi-user high-capacity concurrent modeling computation are realized.

Description

R language-based machine learning modeling platform architecture design method
Technical Field
The invention belongs to the field of big data analysis and processing, and particularly relates to a machine learning modeling platform architecture design method based on an R language, which is used for performing distributed computation on a machine learning operator.
Background
The big data analysis processing platform is based on a distributed computing architecture and a machine learning operator and is used for solving the problem of data mining modeling under the condition of super-large data scale. However, in the actual use process of the platform, it is found that the input data and modeling requirements of small data size are main use forms, and the distributed processing architecture has no obvious efficiency advantage for processing the small input data size, but has a more obvious data interaction delay problem; meanwhile, limited by the number of machine learning operators currently supporting distributed computing, the modeling capability of the platform under a complete distributed processing architecture is not as powerful as that under a single machine condition.
The R language is a common single-machine modeling tool in the field of data mining, has rich machine learning operators and a high-efficiency flexible programming system, and can greatly enhance the operator richness of the platform, improve the modeling execution efficiency under small data volume and solve the problem of limited access capacity of a multi-user platform under single-machine if the advantages of the R language can be combined with the distributed architecture of the current platform.
However, the machine learning operator of the R language is a single-machine program and can only operate on one computer, and if different R operators are allocated to different computers to operate, data transmission between the operators related in the front and back cannot be executed; meanwhile, the R operator does not support reading of distributed data at present, and output results cannot be automatically stored in a distributed cluster. If a plurality of users are concentrated on a certain server to execute machine learning tasks, the load of the current server is possibly out of standard, and the use experience is influenced.
Disclosure of Invention
The invention aims to: the R language-based machine learning modeling platform architecture design method is provided, and the technical problem that distributed calculation cannot be performed on R operators is solved.
The technical scheme adopted by the invention is as follows:
a machine learning modeling platform architecture design method based on an R language is characterized in that a visual machine learning operator based on the R language is built, the R operator in the machine learning operator is distributed to different Hadoop cluster computing nodes by using an Oozie component, the Hadoop cluster computing nodes call data managed by an HDFS component and calculate according to the logical relation of the machine learning operator to obtain the final result of the machine learning operator.
Further, the method is realized by the following specific steps:
s201: building a visual machine learning operator based on an R language by using a modeling platform, wherein the machine learning operator comprises n R operators, and the data flow of the n R operators flows from the 1 st operator to the nth operator;
s202: dynamically distributing n R operators to different Hadoop cluster computing nodes by using an Oozie component;
s203: a computing node where the 1 st operator is located downloads a data source from a modeling data source managed by the HDFS component, calls a local R operating environment to execute a data processing function of the 1 st operator, and uploads a computing result to a temporary path TmpPath managed by the HDFS component after computing is completed;
s204: sequentially calculating the 2 nd operator to the n-1 st operator, namely downloading data from the temporary path TmpPath by the calculation node corresponding to each R operator, calling a local R running environment to execute the data processing function of the R operator, uploading the calculation result to the temporary path TmpPath after the calculation is finished, wherein the uploaded calculation result covers the data stored in the temporary path TmpPath each time;
s205: and downloading data from the temporary path TmpPath by the computing node where the nth operator is located, calling a local R operating environment to execute the data processing function of the nth operator, uploading a computing result to a ModelPath managed by the HDFS component after computing is finished, wherein the data stored in the ModelPath is the final computing result of the machine learning operator.
Further, in step S201, the step of building a machine learning operator is as follows:
s301: a user adds various R operators on a WEB application modeling platform by using the inherent packaging format of the R operator, and classifies the R operators according to functions;
s302: the WEB application modeling platform sets a classification catalogue according to a classification result, and performs visual management and display;
s303: and freely dragging the n R operators in the classification directory to a workflow editing area, and connecting lines according to a certain logical relation to complete the construction of the machine learning operator.
Further, in step S202, the step of allocating the R operator by the Oozie component is:
s401: compiling a Shell script file for each R operator, and receiving configuration parameters of the R operator in the modeling platform;
s402: uploading all Shell script files to a storage address of an Oozie component configured in an HDFS component;
s403: according to the logical relation of the machine learning operators, the scheduling unit of the Oozie component generates an Oozie scheduling configuration file for calling each R operator to correspond to the Shell script file;
s404: and starting the Oozie component, and finishing scheduling distribution of the Shell script files corresponding to the R operators in the Hadoop cluster computing node according to the Oozie scheduling configuration file.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:
1. and realizing cluster cooperative computing of the R language-based visual machine learning operator.
2. The modeling platform has a machine learning operator with rich R language and a high-efficiency flexible programming system, the R operator is adaptively scheduled to different Hadoop cluster computing nodes by the Oozie flow control component, and the balance of cluster loads and multi-user high-capacity concurrent modeling and computing are achieved.
3. And integrating the open-source machine learning operators rich in R language, and enhancing the supporting strength of the platform machine learning algorithm.
And 4, the HDFS realizes data sharing among operators, so that after one operator is executed, an operator on the next node can obtain the output processing result and then perform corresponding calculation, and the integrity and the continuity of the flow among the single machine learning operators in the cluster environment are realized.
5. The common R operators and the like are visually packaged, a user is allowed to graphically combine any operator according to different application mining requirements, the workload of manually compiling and developing an R script by the user is saved, rich model evaluation functions and model storage functions are provided, and the efficient completion of R modeling function development and model secondary sharing is realized.
Drawings
The invention will now be described, by way of example, with reference to the accompanying drawings, in which:
FIG. 1 is an overall architecture diagram of the present invention;
fig. 2 is a flow chart of the present invention.
Detailed Description
All of the features disclosed in this specification, or all of the steps in any method or process so disclosed, may be combined in any combination, except combinations of features and/or steps that are mutually exclusive.
The present invention will be described in detail with reference to fig. 1 and 2.
Detailed description of the preferred embodiment 1
A machine learning modeling platform architecture design method based on R language comprises the following steps:
step 1: a user adds various R operators on a modeling platform by utilizing the inherent packaging format of the R operator, and classifies the R operators according to functions, wherein the modeling platform is a WEB modeling platform; the WEB application modeling platform sets a classification catalogue according to a classification result, and performs visual management and display; freely dragging n R operators in the classified catalog to a workflow editing area, and connecting lines according to a certain logical relation to complete the construction of machine learning operators; and the data flow of the n R operators flows from the 1 st operator to the n th operator.
Step 2: compiling a Shell script file for each R operator, receiving the configuration parameters of the R operator in the modeling platform, completing the calling of the corresponding R operator in the Shell script file and transmitting the received configuration parameters to the R operator; copying Shell script files corresponding to all R operators to the same public path of the Hadoop cluster computing nodes, and installing an R running environment on each computing node; uploading all Shell script files to a storage address of an Oozie component configured in an HDFS component; according to the logical relationship of machine learning operators in the current platform, the scheduling unit generates an Oozie scheduling configuration file for calling the Shell script files corresponding to the R operators, each scheduling unit is composed of Shell actions, and an Oozie component completes the transmission and function activation of operator parameters by calling the Shell script files corresponding to the R operators; and starting the Oozie component, and finishing scheduling distribution of the Shell script files corresponding to the R operators in the Hadoop cluster computing node according to the Oozie scheduling configuration file.
And step 3: a computing node where the 1 st operator is located downloads a data source from a modeling data source managed by the HDFS component, calls a local R operating environment to execute a data processing function of the 1 st operator, and uploads a computing result to a temporary path TmpPath managed by the HDFS component after computing is completed; sequentially calculating the 2 nd operator to the n-1 st operator, namely downloading data from the temporary path TmpPath by the calculation node corresponding to each operator, calling a local R running environment to execute the data processing function of the operator, uploading the calculation result to the temporary path TmpPath after the calculation is finished, wherein the uploaded calculation result covers the data stored in the temporary path TmpPath each time; and downloading data from the temporary path TmpPath by the computing node where the nth operator is located, calling a local R operating environment to execute the data processing function of the nth operator, and uploading a computing result to a ModelPath managed by the HDFS component after computing is completed.
And 4, step 4: the method comprises the steps that an Oozie component periodically initiates heartbeat monitoring, if the process that an nth operator is located is monitored to be finished, an instruction that a machine learning process is finished is returned, the modeling platform displays that the process is finished, data stored under ModelPath are final calculation results of the machine learning operator, a progress bar is arranged on the modeling platform and used for displaying the calculation progress of the machine learning operator, and after the progress bar displays that the model file and the like generated by modeling can be downloaded under a link provided by the modeling platform.
Specific example 2
Step 1: adding various R operators to the WEB application modeling platform by a user according to the inherent packaging format of the R operator, and classifying according to functions; the WEB application modeling platform sets a classification catalogue according to a classification result, and performs visual management and display; drag-free 3R operators in the classification directory: an operator A, an operator B and an operator C are connected to the workflow editing area according to a certain logical relation, and the construction of a machine learning operator is completed; and the data flow of the 3R operators is from operator A to operator B to operator C.
Step 2: the method comprises the steps of dynamically distributing 3 operators to different Hadoop cluster computing nodes by using an Oozie component, wherein the operator A is distributed to the computing node A, the operator B is distributed to the computing node B, and the operator C is distributed to the computing node C.
And step 3: the calculation node A downloads a data source from a modeling data source managed by the HDFS component, calls a data processing function of a local R operation environment execution operator A, and uploads a calculation result to a temporary path TmpPath managed by the HDFS component after calculation is finished; the computing node B downloads data from the temporary path TmpPath, calls a data processing function of a local R operating environment execution operator B, uploads a computing result to the temporary path TmpPath after computing is finished, and the uploaded computing result covers data stored in the temporary path TmpPath; and the computing node C downloads data from the temporary path TmpPath, calls the data processing function of the local R operating environment execution operator C, and uploads a computing result to the ModelPath managed by the HDFS component after computing is completed.
And 4, step 4: the method comprises the steps that an Oozie component periodically initiates heartbeat monitoring, if the process that an nth operator is located is monitored to be finished, an instruction that a machine learning process is finished is returned, the modeling platform displays that the process is finished, data stored under ModelPath are final calculation results of the machine learning operator, a progress bar is arranged on the modeling platform and used for displaying the calculation progress of the machine learning operator, and after the progress bar displays that the model file and the like generated by modeling can be downloaded under a link provided by the modeling platform.
The working principle of the invention is as follows: the method comprises the steps of building machine learning operators on a platform, installing R operating environments on each Hadoop cluster computing node, compiling Shell script files of each R operator, copying the Shell script files to each computing node, uploading the Shell script files to a storage address managed by an HDFS (Hadoop distributed file system) component, scheduling and distributing the Shell script files corresponding to each R operator in the cluster by the Oozie component, downloading data from the HDFS by the computing nodes for computing, and after computing of the R operator located at the tail end of a data stream in the machine learning operators is completed, taking the computing result of the R operator as the final computing result of the machine learning operators.

Claims (3)

1. A machine learning modeling platform architecture design method based on R language is characterized in that: building a visual machine learning operator based on an R language, distributing the R operator in the machine learning operator to different Hadoop cluster computing nodes by using an Oozie component, calling data managed by an HDFS component by the Hadoop cluster computing nodes, and computing according to the logical relation of the machine learning operator to obtain a final result of the machine learning operator;
the method comprises the following specific steps:
s201: building a visual machine learning operator based on an R language by using a modeling platform, wherein the machine learning operator comprises n R operators, and the data flow of the n R operators flows from the 1 st operator to the nth operator;
s202: dynamically distributing n R operators to different Hadoop cluster computing nodes by using an Oozie component;
s203: a computing node where the 1 st operator is located downloads a data source from a modeling data source managed by the HDFS component, calls a local R operating environment to execute a data processing function of the 1 st operator, and uploads a computing result to a temporary path TmpPath managed by the HDFS component after computing is completed;
s204: sequentially calculating the 2 nd operator to the n-1 st operator, namely downloading data from the temporary path TmpPath by the calculation node corresponding to each R operator, calling a local R running environment to execute the data processing function of the R operator, uploading the calculation result to the temporary path TmpPath after the calculation is finished, wherein the uploaded calculation result covers the data stored in the temporary path TmpPath each time;
s205: and downloading data from the temporary path TmpPath by the computing node where the nth operator is located, calling a local R operating environment to execute the data processing function of the nth operator, uploading a computing result to a ModelPath managed by the HDFS component after computing is finished, wherein the data stored in the ModelPath is the final computing result of the machine learning operator.
2. The R language-based machine learning modeling platform architecture design method according to claim 1, characterized in that: in step S201, the step of building a machine learning operator is as follows:
s301: a user adds various R operators on a WEB application modeling platform by using the inherent packaging format of the R operator, and classifies the R operators according to functions;
s302: the WEB application modeling platform sets a classification catalogue according to a classification result, and performs visual management and display;
s303: and freely dragging the n R operators in the classification directory to a workflow editing area, and connecting lines according to a certain logical relation to complete the construction of the machine learning operator.
3. The R language-based machine learning modeling platform architecture design method according to claim 1, characterized in that: in step S202, the step of allocating the R operator by the Oozie component is:
s401: compiling a Shell script file for each R operator, and receiving configuration parameters of the R operator in the modeling platform;
s402: uploading all Shell script files to a storage address of an Oozie component configured in an HDFS component;
s403: according to the logical relation of the machine learning operators, the scheduling unit of the Oozie component generates an Oozie scheduling configuration file for calling each R operator to correspond to the Shell script file;
s404: and starting the Oozie component, and finishing scheduling distribution of the Shell script files corresponding to the R operators in the Hadoop cluster computing node according to the Oozie scheduling configuration file.
CN201710684578.7A 2017-08-11 2017-08-11 R language-based machine learning modeling platform architecture design method Active CN107423823B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710684578.7A CN107423823B (en) 2017-08-11 2017-08-11 R language-based machine learning modeling platform architecture design method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710684578.7A CN107423823B (en) 2017-08-11 2017-08-11 R language-based machine learning modeling platform architecture design method

Publications (2)

Publication Number Publication Date
CN107423823A CN107423823A (en) 2017-12-01
CN107423823B true CN107423823B (en) 2020-11-10

Family

ID=60437649

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710684578.7A Active CN107423823B (en) 2017-08-11 2017-08-11 R language-based machine learning modeling platform architecture design method

Country Status (1)

Country Link
CN (1) CN107423823B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109063842A (en) * 2018-07-06 2018-12-21 无锡雪浪数制科技有限公司 A kind of machine learning platform of compatible many algorithms frame
CN109920486B (en) * 2019-01-10 2021-03-16 江苏理工学院 Method for improving molecular dynamics batch modeling efficiency based on Shell language
CN109871809A (en) * 2019-02-22 2019-06-11 福州大学 A kind of machine learning process intelligence assemble method based on semantic net
CN114072820A (en) * 2019-06-04 2022-02-18 瑞典爱立信有限公司 Executing machine learning models
CN111240662B (en) * 2020-01-16 2024-01-09 同方知网(北京)技术有限公司 Spark machine learning system and method based on task visual drag
CN111753040A (en) * 2020-06-30 2020-10-09 北京超图软件股份有限公司 Method, device and system for processing geospatial data
CN111948992B (en) * 2020-08-05 2021-09-10 上海微亿智造科技有限公司 Method and system for performing multistage progressive modeling on industrial batch type big data
CN113542352B (en) * 2021-06-08 2024-04-09 支付宝(杭州)信息技术有限公司 Node joint modeling method and node

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103024027A (en) * 2012-12-07 2013-04-03 中国电信股份有限公司云计算分公司 Data mining achieving method and system based on cloud computing
CN103399887A (en) * 2013-07-19 2013-11-20 蓝盾信息安全技术股份有限公司 Query and statistical analysis system for mass logs
CN103838617A (en) * 2014-02-18 2014-06-04 河海大学 Method for constructing data mining platform in big data environment
CN104714830A (en) * 2015-04-03 2015-06-17 普元信息技术股份有限公司 System and method for achieving cross-platform application development based on native development language

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103024027A (en) * 2012-12-07 2013-04-03 中国电信股份有限公司云计算分公司 Data mining achieving method and system based on cloud computing
CN103399887A (en) * 2013-07-19 2013-11-20 蓝盾信息安全技术股份有限公司 Query and statistical analysis system for mass logs
CN103838617A (en) * 2014-02-18 2014-06-04 河海大学 Method for constructing data mining platform in big data environment
CN104714830A (en) * 2015-04-03 2015-06-17 普元信息技术股份有限公司 System and method for achieving cross-platform application development based on native development language

Also Published As

Publication number Publication date
CN107423823A (en) 2017-12-01

Similar Documents

Publication Publication Date Title
CN107423823B (en) R language-based machine learning modeling platform architecture design method
JP7044808B2 (en) Data processing methods and related products
Shiraz et al. Energy efficient computational offloading framework for mobile cloud computing
CN109961151B (en) System of computing services for machine learning and method for machine learning
CN103593192B (en) A kind of algorithm integration based on SLURM scheduling and evaluating system and method
US10977076B2 (en) Method and apparatus for processing a heterogeneous cluster-oriented task
Zatsarinny et al. Toward high performance solutions as services of research digital platform
Malawski Towards Serverless Execution of Scientific Workflows-HyperFlow Case Study.
CN109614227A (en) Task resource concocting method, device, electronic equipment and computer-readable medium
CN103257852A (en) Method and device for building development environment of distributed application system
KR20210105378A (en) How the programming platform's user code works and the platform, node, device, medium
CN111459621B (en) Cloud simulation integration and scheduling method and device, computer equipment and storage medium
CN110109748A (en) A kind of hybrid language task executing method, device and cluster
CN112788112A (en) Automatic publishing method, device and platform for equipment health management micro-service
Genez et al. Time-discretization for speeding-up scheduling of deadline-constrained workflows in clouds
CN111158800A (en) Method and device for constructing task DAG based on mapping relation
Agarwal et al. Azurebot: A framework for bag-of-tasks applications on the azure cloud platform
da Rosa Righi et al. MigPF: Towards on self-organizing process rescheduling of bulk-synchronous parallel applications
US20180150786A1 (en) Efficient task planning using past performance
da Rosa Righi et al. Towards cloud-based asynchronous elasticity for iterative HPC applications
Benini et al. Resource management policy handling multiple use-cases in mpsoc platforms using constraint programming
Abase et al. Locality sim: cloud simulator with data locality
CN114780232A (en) Cloud application scheduling method and device, electronic equipment and storage medium
Eichelberger et al. From resource monitoring to requirements-based adaptation: An integrated approach
CN111026432A (en) Big data processing platform, platform construction method and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant