CN112306586A

CN112306586A - Data processing method, device, equipment and computer storage medium

Info

Publication number: CN112306586A
Application number: CN202011314975.3A
Authority: CN
Inventors: 吴梓煜; 周可; 卢明杰; 邸帅; 卢道和
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2021-02-02
Also published as: WO2022105736A1

Abstract

The embodiment of the application provides a data processing method, a data processing device, electronic equipment and a computer storage medium; the method comprises the following steps: in an interactive environment of Python language, sending an engine execution request to data computing middleware based on a Magic function, wherein the data computing middleware is used for realizing the calling of at least two data processing engines, and the data processing engines are computing engines or storage engines; calling a data processing engine corresponding to the engine execution request through the data computing middleware; and performing data processing on the data to be processed acquired in advance based on the called data processing engine, and/or managing the data processing engine corresponding to the engine execution request.

Description

Data processing method, device, equipment and computer storage medium

Technical Field

The present application relates to the field of financial technology (Fintech) big data, and relates to, but is not limited to, a data processing method, apparatus, electronic device, and computer storage medium.

Background

With the development of computer technology, more and more technologies are applied in the financial field, and the traditional financial industry is gradually changing to financial technology, but higher requirements are also put forward on the technologies due to the requirements of the financial industry on safety and real-time performance.

Currently, in the field of financial technology, if a computing engine or a storage engine needs to be called in an interaction environment of Python language, a direct connection mode or Apache Livy service may be used to connect the computing engine or the storage engine, however, in a scheme of connecting the computing engine or the storage engine in the direct connection mode, a dependency package needs to be installed for each set of computing engine or storage engine, and implementation is complex; the scheme of connecting the Apache Livy service with a computing engine or a storage engine can be realized only in the SPARK distributed computing engine, and the application range is limited.

Disclosure of Invention

Embodiments of the present application provide a data processing method and apparatus, an electronic device, and a computer storage medium, which can solve the problem in the prior art that implementation is complex or can only be implemented in a SPARK distributed computing engine.

The technical scheme of the embodiment of the application is realized as follows:

an embodiment of the present application provides a data processing method, including:

in an interactive environment of Python language, determining a Magic function according to a data processing engine needing to be called, and sending an engine execution request to a target interface of data computing middleware based on the Magic function, wherein the data computing middleware is used for realizing the calling of at least two data processing engines, and the data processing engines are computing engines or storage engines; the target interface is an interface determined according to the Magic function;

calling a data processing engine corresponding to the engine execution request through the data computing middleware; and performing data processing on the data to be processed acquired in advance based on the called data processing engine, and/or managing the data processing engine corresponding to the engine execution request.

In some embodiments of the present application, the data computing middleware comprises a resource manager;

the method further comprises the following steps: and in the process of carrying out data processing on the pre-acquired data to be processed and/or managing the data processing engine corresponding to the engine execution request, carrying out resource isolation management on resources required by the data processing engine according to preset resource attributes by using the resource manager.

In some embodiments of the present application, the predetermined resource attribute is a user to which the resource belongs or a source of the resource.

In some embodiments of the present application, the data computing middleware further comprises at least one engine manager;

the calling of the data processing engine corresponding to the engine execution request through the data computing middleware comprises:

in a case where the resource manager allows creation of a resource for the engine execution request, creating a program for calling a data processing engine with the at least one engine manager, calling the data processing engine corresponding to the engine execution request based on the created program.

In some embodiments of the present application, the method further comprises:

acquiring load information of the at least one engine manager through the resource manager; determining a target engine manager for receiving the engine execution request among the at least one engine manager according to the load information of the at least one engine manager;

said creating with said at least one engine manager a program for invoking a data processing engine, comprising:

creating, with the target engine manager, a program for invoking a data processing engine.

In some embodiments of the present application, the data computing middleware further comprises a gateway and at least one ingress node;

the sending an engine execution request to a target interface of data computing middleware based on the Magic function comprises: sending an engine execution request to the gateway of the data computing middleware based on the Magic function and through the target interface;

the obtaining, by the resource manager, load information of the at least one engine manager includes:

when the gateway forwards the engine execution request to a corresponding entry node of the at least one entry node according to the identifier of the data processing engine carried by the engine execution request, sending a load information acquisition request to the resource manager by using the corresponding entry node of the at least one entry node to acquire the load information of the at least one engine manager.

In some embodiments of the present application, the method further comprises:

and synchronizing the data to be processed and/or variables between the interactive environment of the Python language and the data computing middleware through a newly-added file transmission interface.

In some embodiments of the present application, the Python language interactive environment comprises a synchronization module;

the method further comprises the following steps:

and mounting the same storage system on the synchronization module and the data calculation middleware, and realizing synchronization of the data to be processed and/or the variable between the interaction environment of the Python language and the data calculation middleware through data read-write operation on the storage system.

In some embodiments of the present application, the implementing synchronization of the to-be-processed data and/or the variable between the interaction environment of the Python language and the data computing middleware by performing data read/write operations on the storage system includes:

in an interactive environment of Python language, acquiring a first code for enabling a data calculation middleware to store data to the storage system, sending the first code to the data calculation middleware, and storing the data to be processed and/or the variable to the storage system by using the data calculation middleware;

and in an interactive environment of Python language, loading the data to be processed and/or the variable stored in the storage system by using a synchronization module.

In some embodiments of the present application, saving the to-be-processed data and/or the variable to a storage system by using the data computing middleware includes:

compressing the data to be processed and/or the variable by using the data calculation middleware to obtain compressed data; and saving the compressed data to the storage system by using the data computing middleware.

in an interactive environment of Python language, the data to be processed and/or the variable are/is saved in the storage system by utilizing the synchronization module;

in an interactive environment of Python language, acquiring a second code for enabling a data computing middleware to read data from the storage system, sending the second code to the data computing middleware, and loading the to-be-processed data and/or the variable stored in the storage system by using the data computing middleware.

An embodiment of the present application provides a data processing apparatus, the apparatus includes:

the system comprises a first processing module, a second processing module and a third processing module, wherein the first processing module is used for determining a Magic function according to a data processing engine needing to be called in an interactive environment of Python language, and sending an engine execution request to a target interface of data computing middleware based on the Magic function, the data computing middleware is used for realizing the calling of at least two data processing engines, and the data processing engines are computing engines or storage engines; the target interface is an interface determined according to the Magic function;

the second processing module is used for calling a data processing engine corresponding to the engine execution request through the data computing middleware; and performing data processing on the data to be processed acquired in advance based on the called data processing engine, and/or managing the data processing engine corresponding to the engine execution request.

the second processing module is further configured to perform resource isolation management on resources required by the data processing engine according to a preset resource attribute by using the resource manager in a process of performing data processing on the pre-acquired data to be processed and/or managing the data processing engine corresponding to the engine execution request.

the second processing module is configured to invoke, through the data computing middleware, a data processing engine corresponding to the engine execution request, and includes:

In some embodiments of the present application, the second processing module is further configured to:

the first processing module is used for sending an engine execution request to a target interface of the data computing middleware based on the Magic function, and comprises: sending an engine execution request to the gateway of the data computing middleware based on the Magic function and through the target interface;

the second processing module is configured to obtain load information of the at least one engine manager through the resource manager, and includes:

In some embodiments of the present application, the first processing module is further configured to implement synchronization of the data to be processed and/or variables between the interaction environment of the Python language and the data computing middleware through an additional file transfer interface.

the first processing module is further configured to mount the same storage system as the synchronization module and the data computing middleware, and implement synchronization of the data to be processed and/or the variable between the Python language interaction environment and the data computing middleware through data read-write operation on the storage system.

In some embodiments of the present application, the first processing module is configured to implement synchronization of the to-be-processed data and/or the variable between the interaction environment of the Python language and the data computing middleware through data read-write operation on the storage system, and includes:

In some embodiments of the present application, the first processing module, configured to save the data to be processed and/or the variable to a storage system by using the data computing middleware, includes:

An embodiment of the present application provides an electronic device, which includes:

a memory for storing executable instructions;

and the processor is used for realizing any one of the data processing methods when executing the executable instructions stored in the memory.

An embodiment of the present application provides a computer-readable storage medium, which stores executable instructions and is configured to, when executed by a processor, implement any one of the data processing methods described above.

In the embodiment of the application, in an interactive environment of Python language, a Magic function is determined according to a data processing engine needing to be called, an engine execution request is sent to a target interface of data computing middleware based on the Magic function, the data computing middleware is used for realizing the calling of at least two data processing engines, and the data processing engines are computing engines or storage engines; the target interface is an interface determined according to the Magic function; calling a data processing engine corresponding to the engine execution request through the data computing middleware; and performing data processing on the data to be processed acquired in advance based on the called data processing engine, and/or managing the data processing engine corresponding to the engine execution request.

It can be seen that, in the embodiment of the application, the Magic function can be determined according to the requirement for calling the data processing engine, the target interface of the data computing middleware can be determined according to the fixed Magic function, and then various types of computing engines or storage engines can be called, so that a dependency package does not need to be installed for each set of computing engine or storage engine, the framework is not limited to the SPARK distributed computing engine, the implementation is simple, and the application range is expanded.

Drawings

FIG. 1 is a schematic diagram of a scene architecture of an interactive environment of Python language in the related art;

FIG. 2 is an alternative flow chart of a data processing method provided by the embodiments of the present application;

FIG. 3 is a diagram illustrating an overall architecture for implementing a data processing method according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a Linkis magic module in an embodiment of the present application;

FIG. 5 is a block flow diagram of a call engine in an embodiment of the present application;

FIG. 6 is an alternative schematic diagram of the implementation of data and/or variable synchronization in embodiments of the present application;

FIG. 7 is another alternative schematic diagram of implementing data and/or variable synchronization in an embodiment of the present application;

FIG. 8 is a further alternative schematic diagram of implementing data and/or variable synchronization in an embodiment of the present application;

FIG. 9 is a schematic diagram of an alternative structure of a data processing apparatus according to an embodiment of the present application;

fig. 10 is an alternative structural schematic diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The Python language is the most common language in the field of machine learning and data analysis; fig. 1 is a schematic view of a scene architecture of an interactive environment of Python language in the related art, and as shown in fig. 1, a user usually performs data analysis on the interactive environment of Python language, such as an IPython environment, an exemplary IPython environment is a jupiter Lab interactive development environment, and modeling operations in the field of machine learning are usually implemented in a Graphics Processing Unit (GPU) cluster (cluster); in fig. 1, the Jupyter Lab runs in the GPU cluster, and can realize functions such as IPython running (Runtime), Hadoop Setting (Hadoop Setting), MYSQL Setting (MYSQL Setting), SPARK Setting (SPARK Setting), and HIVE Setting (HIVE Setting) on the Jupyter Lab; engines such as Hadoop Distributed File System (HDFS), HIVE, SPARK, ANACONDA, Hadoop, TiDB, HBASE and the like can be connected to the Jupiter Lab.

In the fields of finance and many marketing, most data are structured data, the data are stored on a HIVE data warehouse based on a Hadoop cluster, and a user needs to synchronize the data to a corresponding environment before data analysis and modeling. Users also have the need to compute using a big data compute engine. Therefore, a user needs to install the configuration of various engines and connect various types of calculation and storage engines through different functions, such as engines such as HDFS, MYSQL, SPARK, and HIVE, and meanwhile, such resources need to be actively maintained by the user, and after data is pulled for different types of storage engines, the obtained data is also in different formats and needs to be further converted into data formats required for data analysis and modeling. Various storage and calculation engines are used in the IPython environment, so that functions need to be realized in data analysis and modeling.

In the related art, if the interactive environment of Python language needs to call a computing engine or a storage engine, the following two schemes can be adopted.

1) Connecting a computing engine or a storage engine by adopting a direct connection mode; specifically, various dependency packages of the execution-side computing and storage engines can be configured in an environment where Jupyter Lab (IPython) runs, for example, dependency packages related to various configurations such as Hadoop, SPARK, and HIVE are installed, and meanwhile, different computing engines or storage engines are connected by installing corresponding Python packages (Python packages).

2) Connecting a computing engine or a storage engine through an Apache Livy service; the Apache Livy service has the capability of connecting a SPARK distributed computing engine, can manage a plurality of connections, such as a SPARK connection Session control (SPARK Session), and provides a reset interface for executing codes and creating connections. Other services may perform SPARK engine computations for a Hadoop cluster by sending HyperText Transfer Protocol (HTTP) requests to Livy.

Adopting a scheme of connecting a computing engine or a storage engine in a direct connection mode, and installing a corresponding dependency package for each set of computing engine or storage engine; meanwhile, the scheme lacks uniform management of connection resources, and the connection resources must be actively released by users.

The scheme of connecting the computation engine or the storage engine by using the Apache Livy service can only realize the connection of the computation engine SPARK, and the storage or computation engine can only be expanded under the SPARK framework, so that the function is limited.

In view of the above technical problems, the technical solutions of the embodiments of the present application are provided.

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

The data processing method of the embodiment of the application can be applied to electronic equipment, the electronic equipment can be implemented as a server, the server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server providing basic cloud computing services such as cloud services, a cloud database, cloud computing, cloud functions, cloud storage, Network services, cloud communication, middleware services, domain name services, security services, Content Delivery Networks (CDN), big data and artificial intelligence platforms and the like.

The following describes an exemplary data processing method according to an embodiment of the present application.

Fig. 2 is an optional flowchart of the data processing method according to the embodiment of the present application, and as shown in fig. 2, the flowchart may include:

step 201: in an interactive environment of Python language, determining a Magic function according to a data processing engine needing to be called, sending an engine execution request to a target interface of a data computing middleware based on the Magic function, wherein the data computing middleware is used for realizing the calling of at least two data processing engines, and the data processing engines are computing engines or storage engines; the target interface is an interface determined according to the Magic function.

In the embodiment of the present application, the interactive environment in Python language may be Jupyter Notebook (Notebook), Jupyter Lab, or other types of interactive environments.

Here, Jupyter Notebook (referred to as IPython Notebook before) is an interactive Notebook, and the essence of Jupyter Notebook is a Web application, which is convenient for creating and sharing literature program documents, and supports real-time codes, mathematical equations, visualization, and lightweight markup language markdown; the application comprises the following steps: data cleaning and conversion, numerical simulation, statistical modeling, machine learning, and the like.

In the embodiment of the application, a data processing engine needing to be called can be determined according to the data processing requirement, and then a Magic function is generated; the Magic function is applied to tools such as Jupitter, Ipython and the like of Python language, and is an encapsulation function convenient for data analysts to use.

The data computing middleware may be any bridge (Linkis) or other data computing middleware, where Linkis is a data computing middleware that opens up multiple computing storage engines, provides a uniform Restful interface to the outside, and submits scripts for executing Structured Query Language (SQL), Pyspark, HiveQL, Scala, and the like. In some embodiments, the Linkis may also provide services such as global User-Defined Function (UDF), material management, etc. in public services (public service).

In the embodiment of the application, the data processing engine may be an engine such as HDFS, HIVE, SPARK, ANACONDA, Hadoop, TiDB, HBASE, and the like.

In some embodiments, the engine execution request may be a request for data processing using a compute engine or a storage engine; for example, when the Magic function is a function of%% spark,%% pyspark,%% hive,%% spark ql,% python, or the like, the function of the Magic function is to use a calculation engine or a storage engine.

In some embodiments, the engine execution request may be a request to manage a compute engine or a storage engine; for example, when the Magic function is a function of% enginelist,% log,% status,% enginekill,% jobkill,% joblist, or the like, the function of the Magic function is management of a calculation engine or a storage engine.

In some embodiments, a linkis map module that interacts with the data computing middleware may be established in a Python language interactive environment, and referring to fig. 3, in the Python language interactive environment, a linkis map Package (Package) may be created based on IPython Runtime, and a linkis map module may be established based on the linkis map Package.

The Linkis Magic module is connected with a big data computing middleware Linkis, and can submit tasks and execute various types of tasks through a packaged Magic function; tasks submitted and executed by the linkis magic module include, but are not limited to, tasks of large data clusters, such as SPARK tasks or HIVE-type tasks; in some embodiments, the Linkis magic module may also submit and execute Python, Shell (Shell), etc. tasks by interfacing with big data computing middleware Linkis.

The Linkis Magic module can be used for butting Linkis middleware of big data calculation through a complete set of Magic functions, referring to fig. 3 and 4, the Linkis Magic module comprises an encapsulation function module (also called as a Magic module), the encapsulation function module is the uppermost layer module of the Linkis Magic module, the encapsulation function module defines a set of IPython Magic functions, and a user can interact with the Linkis middleware of big data calculation through the Magic functions defined by the encapsulation function module.

The principle of the method is that after applying for a Cell after a Magic function, the Cell calls the corresponding function, and the function receives the code of the Cell and runs the code end. IPython provides a set of criteria for custom Magic functions, as long as comments are applied for in the custom Python type and loaded at the time of use. The Magic module is the uppermost layer module of the Linkis Magic, and a user interacts with Linkis through the calling module.

The Linkis Magic module further comprises a Linkis client module (also called Linkis client module), the Linkis client module can be connected with the encapsulation function module and the big data computing middleware Linkis, the Linkis client module can receive a Magic function provided by the encapsulation function module and can interact with the big data computing middleware Linkis through HTTP requests or Web Socket (WS) requests, and in this case, the encapsulation function module can interact with a target interface of the big data computing middleware Linki called by the Linkis client module. The target interface may be an upper layer interface such as a Rest interface; in some embodiments, the Linkis client module may also encapsulate the Linkis upper layer interfaces that need to be used.

In some embodiments, the Linkis client module interfaces various upper layer interfaces provided by the big data computing middleware Linkis, and the following lists the main target interfaces interfaced by the Magic function:

/api/rest _ j/v1/entrance/$ { execID }/status corresponds to% status

,/api/rest _ j/v1/entrance/$ execID }/logfromLine $ { fromLine } & size $ { size } corresponds to% log

/api/rest _ j/v 1/entry/$ { execID }/progress corresponds to% progress

/api/rest _ j/v1/entrance/$ { execID }/kill corresponds to% jobkill

In practical application, an HTTP submit execution (Post) Request can be sent to a big Data computing middleware linkit, and in some embodiments, the format of the Request Data (Request Data) is a JS Object profile (JavaScript Object Notation, Json) format.

Step 202: calling a data processing engine corresponding to the engine execution request through the data computing middleware; and performing data processing on the data to be processed acquired in advance based on the called data processing engine, and/or managing the data processing engine corresponding to the engine execution request.

In the embodiment of the application, the data computing middleware can determine the type of an engine execution request, and when the engine execution request is a request for data processing by using a computing engine or a storage engine, data processing can be performed on pre-acquired data to be processed based on a called data processing engine; when the engine execution request is a request for managing a calculation engine or a storage engine, a data processing engine corresponding to the engine execution request may be managed.

In practical applications, the steps 201 to 202 may be implemented based on a Processor of an electronic Device, and the Processor may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a CPU, a controller, a microcontroller, and a microprocessor. It is understood that the electronic device implementing the above-described processor function may be other electronic devices, and the embodiments of the present application are not limited thereto.

It can be seen that in the embodiment of the present application, various types of compute engines or storage engines can be called based on the Magic function, so that it is not necessary to install a dependency package for each set of compute engine or storage engine, and the implementation is not limited to the SPARK distributed compute engine, which is simple and extends the application range.

Further, the embodiment of the application can view or manage task logs, task states and the like of various tasks based on the Magic function, wherein the tasks can be tasks such as data analysis or machine learning modeling.

In some embodiments of the present application, the data computing middleware comprises a Resource manager (Resource Management).

The data processing method further comprises the following steps: and in the process of carrying out data processing on the pre-acquired data to be processed and/or managing the data processing engine corresponding to the engine execution request, carrying out resource isolation management on resources required by the data processing engine according to preset resource attributes by using a resource manager.

Here, the preset resource attribute may be a user to which the resource belongs, or may be a source of the resource; the source of a resource represents a device or system that provides the resource; in some embodiments, for a resource from the same source, the number of users to which the resource belongs may be one or multiple.

In the related art, different resource isolation cannot be realized by adopting a scheme of connecting a computing engine or a storage engine in a direct connection mode and a scheme of connecting the computing engine or the storage engine by adopting an Apache Livy service; in the embodiment of the present application, by setting the resource manager, resource isolation management can be performed according to the preset resource attribute, and the management and control capability of the resource required by the data processing engine can be enhanced.

In some embodiments of the present application, the data computing middleware further comprises at least one Engine manager (Engine Management).

Accordingly, the implementation manner of calling the data processing engine corresponding to the engine execution request through the data computing middleware may be:

in the case where the resource manager allows creation of a resource for an engine execution request, a program for calling a data processing engine is created with at least one engine manager, and the data processing engine corresponding to the engine execution request is called based on the created program.

In one embodiment, after a resource manager of the data computing middleware receives an engine execution request, it may determine whether to allow creation of a resource for the engine execution request; for example, in a case where it is determined whether a user corresponding to the engine execution request can create a new resource, it is determined that the creation of the resource for the engine execution request is allowed, and otherwise, it is determined that the creation of the resource for the engine execution request is not allowed.

In the case where the resource manager allows creation of a resource for an engine execution request, the resource manager may transmit response information allowing creation of the resource to at least one engine manager, and the at least one engine manager may create a program for calling the data processing engine upon receiving the response information.

It can be seen that the embodiment of the application can allow the data processing engine corresponding to the engine execution request to be called under the condition that the resource is created according to the engine execution request, so that the reliability of calling the data processing engine can be increased.

In some embodiments of the present application, the data processing method further includes:

acquiring load information of at least one engine manager through a resource manager; and determining a target engine manager for receiving the engine execution request in the at least one engine manager according to the load information of the at least one engine manager.

Accordingly, creating a program for invoking a data processing engine with at least one engine manager may include: a program for invoking a data processing engine is created with a target engine manager.

In the embodiment of the application, one or more engine managers in the data computing middleware can be provided; the load information of each engine manager is used for indicating the state of the engine manager for calling data processing, and when the load of the engine manager is large, the engine manager is not suitable for processing a new engine execution request.

It can be seen that, according to the embodiment of the application, the engine manager for calling the data processing engine can be reasonably selected based on the load information of at least one engine manager, so that the load balance of each engine manager is favorably realized.

In some embodiments of the present application, the data computing middleware further comprises a Gateway (Gateway) and at least one ingress node; illustratively, the portal node may be a SPARK portal (Entrance) node.

Accordingly, sending an engine execution request to a target interface of the data computing middleware based on the Magic function, comprising: sending an engine execution request to a gateway of the data computing middleware through a target interface based on a Magic function;

the implementation manner of obtaining the load information of the at least one engine manager through the resource manager may be: when the gateway forwards the engine execution request to a corresponding entry node of the at least one entry node according to the identifier of the data processing engine carried by the engine execution request, the gateway sends a load information acquisition request to the resource manager by using the corresponding entry node of the at least one entry node to acquire the load information of the at least one engine manager.

It can be seen that, in the embodiment of the present application, load information of at least one engine manager can be obtained more easily through data interaction of the gateway, the ingress node, and the resource manager.

The overall flow of the Linkis magic module executing Pyspark code via Linkis is described in detail below with reference to the accompanying drawings.

The linkis Magic module may execute Cell code via the following Magic functions:

％％pyspark

spark.sql(“show tables”).show()；

the packaging function module can transmit the Cell codes to the Linkis client module, and the Linkis client module can format the codes or add variable conversion codes according to needs to obtain preliminarily processed codes; in some embodiments, the Linkis client module may encapsulate the primarily processed code into an HTTP request format, and the Linkis client module may send an engine execution request encapsulated into the HTTP request format to an Linkis server on which a big data computing middleware Linkis is installed

Referring to fig. 3 and 5, after receiving an engine execution request, a gateway of the big data computing middleware Linkis may forward the engine execution request to a corresponding entry node according to an identifier of a data processing engine carried in the engine execution request, where the identifier of the data processing engine may be recorded as an application name (executeApplicationName) to be executed.

After receiving the engine execution request, the entry node performs operations such as data persistence and parameter check on the engine execution request, and then the entry node may send a load information acquisition request to the resource manager, so as to acquire load information of each engine manager in the big data computing middleware Linkis, and the entry node may also forward the engine execution request to the corresponding engine manager according to the load information of each engine manager.

After receiving the engine execution request, the engine manager may send the engine execution request to the resource manager, and in a case that the resource manager allows creating a resource for the engine execution request, the engine manager creates a program for invoking the data processing engine, where the program created by the engine manager may be referred to as a SPARK engine (SPARKEngine); managed by the engine manager throughout the life of the program.

The program created by the engine manager is essentially a Java running command program, and in the program created by the engine manager, SPARK Driver (Driver) can be run; in some embodiments, referring to fig. 5, the SPARK Driver may communicate with an Executor (executive) on the YARN cluster through a Remote Procedure Call (RPC) request to obtain the execution result.

It can be seen that, in the embodiment of the application, each module of the Linkis magic module may interact with the big data computing middleware Linkis, after receiving an engine execution request of the Linkis magic module, the big data computing middleware Linkis may forward the engine execution request to a corresponding entry node, and then the big data computing middleware Linkis may create a program for calling the engine, and different types of engine managers may create different Java operation command programs to operate different engine services.

Based on the data processing process of the big data computing middleware Linkis, all engine execution requests can be executed from Linkis through the Linkis magic module, connection resources can be managed and controlled by the resource manager, idle resources can be released when time is out, meanwhile, the resource manager can also manage the resources, resource isolation is achieved, and the requirement of executing various types of data processing engines on IPython is met.

In some embodiments of the present application, synchronization of data to be processed and/or variables needs to be achieved between an interactive environment in Python language and the data computing middleware, where the data to be processed represents data that needs to be processed by a data processing engine, and the variables represent variables used by the data processing engine.

In some scenarios, if data and/or variables to be processed of the data computing middleware need to be used in an interactive environment of Python language, the data and/or variables to be processed need to be loaded in the interactive environment of Python language through synchronization of the data and/or variables to be processed; in the related art, the Apache Livy service only puts the data processing process at the SPARK cluster side and is blind, and does not consider the requirement that a user can calculate, analyze and visually check volume data in other positions such as an interactive environment of Python language; if the data and/or variables to be processed of the data computing middleware need to be used in the Python language interactive environment, the data and/or variables to be processed can be queried in the Python language interactive environment on the basis of a scheme that an Apache Livy service is adopted to connect a computing engine or a storage engine, so as to obtain a query result, the query result is usually a JSON format which is usually used as an object serialization means, the JSON format is not consistent with the initial format of the data and usually needs to be further converted, and the query result based on a single HTTP request is limited, so that data analysis and modeling can be performed on one side of the Python language interactive environment only by actively transmitting files or transmitting for multiple times, and finally, an array variable needs to be reconstructed.

In other scenarios, if the data and/or variables to be processed in the interaction environment of Python language are required to be used in the data computing middleware, the data and/or variables to be processed need to be loaded in the data computing middleware through synchronization of the data and/or variables to be processed.

In some embodiments of the present application, synchronization of the to-be-processed data and/or the variables may be implemented between an interactive environment in Python language and a data computing middleware through a newly added file transmission interface.

Here, the newly added file transmission interface may be any interface for interaction between an interaction environment of Python language and data computing middleware, and the embodiment of the present application is not limited.

It can be seen that, in the embodiment of the present application, synchronization of data and/or variables to be processed can be easily achieved between an interaction environment of Python language and data computing middleware by adding a new file transmission interface.

In some embodiments of the present application, a synchronization module may be set in an interactive environment of Python language, for the scenario where to-be-processed data and/or variables need to be synchronized;

correspondingly, the data processing method further comprises the following steps:

and mounting a storage system with the same synchronization module and the data calculation middleware, and realizing the synchronization of the data to be processed and/or the variable between the interaction environment of Python language and the data calculation middleware through data read-write operation of the storage system.

Here, the type of the storage system may be determined according to actual requirements, for example, the storage system is a distributed file storage cluster Ceph, after the synchronization module and the data computing middleware Mount Ceph (which may be implemented by Mount command), corresponding data and/or variables may be directly saved in Ceph, when synchronization of data and/or variables to be processed is required, synchronization may be implemented by reading data in Ceph, Ceph may support synchronization of data and/or variables with a large data volume, and synchronization of data and/or variables is easier to implement.

It can be seen that, in the embodiment of the present application, synchronization of data and/or variables to be processed can be easily achieved by mounting the storage system.

For an implementation manner of implementing synchronization of data to be processed and/or variables between a Python language interaction environment and a data calculation middleware through data read-write operation on a storage system, exemplarily, a first code for enabling the data calculation middleware to store data to the storage system can be obtained in the Python language interaction environment, the first code is sent to the data calculation middleware, and the data to be processed and/or the variables are stored to the storage system by using the data calculation middleware;

and in an interactive environment of Python language, loading the data and/or variables to be processed stored in the storage system by using a synchronization module.

It can be seen that, in the embodiment of the present application, the data computing middleware may store the data and/or variables to be processed into the storage system through sending of the first code, so that the interaction environment in the Python language is facilitated to load the data and/or variables to be processed, and synchronization of the data and/or variables to be processed is achieved.

In some embodiments of the present application, the implementation manner of saving the data and/or variables to be processed to the storage system by using the data computing middleware may be: compressing the data and/or variables to be processed by using the data computing middleware to obtain compressed data; and saving the compressed data to the storage system by using data calculation middleware.

In some scenarios, on the Linkis side of the big data computing middleware, Pandas data frame (DataFrame) and Numpy Array (Array) type variables can be used, and the Padans DataFrame and Numpy Array type variables are usually represented by using the maximum number of bits (floating point type uses 64 bits, integer type uses 32 bits), which occupies a large amount of memory; on the interactive environment side of the Python language, an array occupying a large memory cannot be supported, so that before the variable is stored, data containing the variable is compressed, the array is compressed based on the maximum position of each row in the array, and then the array is stored.

It can be seen that, by performing compression operation on the data and/or variables to be processed, the embodiment of the present application is beneficial to loading the data and/or variables to be processed on the interactive environment side of the Python language, and can reduce the loading and transmission costs.

For an implementation manner of implementing synchronization of data to be processed and/or variables between an interactive environment of Python language and data computing middleware through data read-write operation on a storage system, illustratively, the data to be processed and/or the variables can be stored in the storage system by using a synchronization module in the interactive environment of Python language; then, in an interactive environment of Python language, acquiring a second code for enabling the data computing middleware to read data from the storage system, sending the second code to the data computing middleware, and loading the to-be-processed data and/or variables stored in the storage system by using the data computing middleware.

It can be seen that, in the embodiment of the present application, through sending of the second code, the data and/or the variable to be processed may be stored in the storage system, thereby facilitating the data computing middleware to load the data and/or the variable to be processed, and thus implementing synchronization of the data and/or the variable to be processed.

In this embodiment of the present application, the synchronization module may include a Code Parser module and a Variable synchronization (Variable Context) module, where, referring to fig. 3 to 8, the Code Parser module is configured to generate a first Code or a second Code after receiving a synchronization instruction of data and/or a Variable to be processed, and send the first Code or the second Code to the data computing middleware; the variable synchronization module can collect various explicitly assigned variables of the interaction environment of the Python language and the data calculation middleware, and can be in butt joint with the data calculation middleware to realize synchronization of the data to be processed and/or the variables.

The variable synchronization process of the embodiment of the present application is exemplarily described below with reference to the drawings.

1) And synchronizing the variables of the data computing middleware to the IPython scene.

Referring to fig. 6, when a package function module of the Linkis magic module executes a cell code, the cell code may be sent to a code analysis submodule of the code conversion module through the Linkis client module, the code analysis submodule converts the cell code into a first code for the big data computing middleware Linkis to save data to Ceph through interaction with a code addition submodule of the code conversion module, and sends the first code to an entry node of the big data computing middleware Linkis through the Linkis client module; in a big data computing middleware Linkis, an entry node sends a first code to an engine manager through interaction with an application manager, a resource manager and the engine manager; the Engine manager is respectively connected with a Linkis Engine (Linkis Engine) or a Pipeline Engine (Pipeline Engine), wherein the Linkis Engine (Linkis Engine) or the Pipeline Engine (Pipeline Engine) is a data processing Engine needing to be called.

Referring to fig. 6 and 7, the Lnkis engine may store the variable as a Python serialized object Pickle, that is, the variable may be stored in Ceph using a Pickle module in Python language, the Pickle module is a standard module in Python language, the Pickle module realizes basic data serialization and deserialization, and through the serialization operation of the Pickle module, we can store the object information running in the program into a file, so as to realize permanent storage.

The pipeline engine may store variables in Comma-Separated Values (CSV) format into Ceph.

The big data computing middleware Linkis can send a variable loading instruction to the Linkis client module, the variable loading instruction can carry address information of variables, the variable synchronization module can load the variables declared in the engine in the Ceph according to the variable loading instruction, and the loaded variables are sent to the Linkis client module through the code conversion module.

In one example, the following code is executed in IPython:

Cell:

％％python-v data

data＝pd.read_csv(“test.csv”)

when the above codes are executed, it is described that the user needs to transmit the variables used in the engine to the local IPython for use.

Referring to fig. 6 and 7, in the Linkis magic module, after receiving a statement that a variable needs to be synchronized, the encapsulation function module calls the code conversion module and the variable synchronization module through the Linkis client module, so as to obtain a code for enabling the big data computing middleware Linkis to store data to Ceph, and then the Linkis client module can send the first code to the big data computing middleware Linkis through the Linkis client module.

The Linkis client module executes variable loading codes of different types according to different engine types, and if the data processing engine is of a Python type or a Pysspark type, the data processing engine can compress the variables and then store the compressed variables into the Ceph by using the Pickle module; if the data processing engine is not Python type or Pysspark type, calling a pipeline engine of big data computing middleware Linkis to convert variables into a CSV format file, and saving the CSV format file into Ceph.

After the engine stores the variables to Ceph, the big data computing middleware Linkis can send a variable loading instruction to a Linkis client module, and the variable synchronization module can load the variables declared in the engine in the Ceph according to the variable loading instruction and assign the loaded variables to the variables declared by the encapsulation function module; when the variable is in a Pickle format, the variable synchronization module can be directly loaded from the Ceph; when the variable is in the CSV format, the variable synchronization module needs to convert the variable into the Pandas DataFrame format after loading the variable from the Ceph.

2) And synchronizing the variables in the IPython to the scene of the data computing middleware.

In such a scenario, an engine that is a Python type may transmit variables through Pickle object serialization, while a non-Python type engine may achieve synchronization of text type variables.

Referring to fig. 6 and 8, in the Linkis magic module, after receiving a variable transmission statement, the encapsulation function module calls a variable synchronization module through the Linkis client module; when the data processing engine is a Python type engine, the variable synchronization module can execute the serialization operation of the variables, convert the variables into a Pickle format and store the variables in the Pickle format in the Ceph; when the data processing engine is not a Python type engine, the variable synchronization module may store the variable in a Json or CSV format, which may perform serialization processing on the variable of the conventional variable type, and store the variable in the Json or CSV format in Ceph.

The Linkis client module can send a variable loading instruction to the big data computing middleware Linkis, the big data computing middleware Linkis can execute different loading codes according to different engine types, and if the data processing engine is Python type or Pysspark type, the data processing engine can directly use the variable in the Pickle format after loading the variable in the Pickle format from Ceph; if the data processing engine is not a Python type or a Pysspark type, the data processing engine may obtain the assignment of the text type from the loaded variable. The same named variables may be used in the Linkis Engine after loading.

On the basis of the data processing method provided by the foregoing embodiment, the embodiment of the present application further provides a data processing apparatus; fig. 9 is a schematic diagram of an alternative configuration of a data processing apparatus according to an embodiment of the present application, and as shown in fig. 9, the data processing apparatus 900 may include:

a first processing module 901, configured to determine a Magic function according to a data processing engine to be called in an interactive environment of a Python language, and send an engine execution request to a target interface of a data computing middleware based on the Magic function, where the data computing middleware is configured to implement calling of at least two data processing engines, where the data processing engines are computing engines or storage engines; the target interface is an interface determined according to the Magic function;

a second processing module 902, configured to invoke, through the data computing middleware, a data processing engine corresponding to the engine execution request; and performing data processing on the data to be processed acquired in advance based on the called data processing engine, and/or managing the data processing engine corresponding to the engine execution request.

the second processing module 902 is further configured to perform resource isolation management on resources required by the data processing engine according to a preset resource attribute by using the resource manager in a process of performing data processing on pre-acquired data to be processed and/or managing the data processing engine corresponding to the engine execution request.

the second processing module 902 is configured to invoke, through the data computing middleware, a data processing engine corresponding to the engine execution request, and includes:

In some embodiments of the present application, the second processing module 902 is further configured to:

the first processing module 901 is configured to send an engine execution request to a target interface of the data computing middleware based on the Magic function, and includes: sending an engine execution request to the gateway of the data computing middleware based on the Magic function and through the target interface;

the second processing module 902 is configured to obtain load information of the at least one engine manager through the resource manager, and includes:

In some embodiments of the present application, the first processing module 901 is further configured to implement synchronization of the to-be-processed data and/or variables between the interaction environment of the Python language and the data computing middleware through an additional file transfer interface.

the first processing module 901 is further configured to mount the same storage system as the synchronization module on the data computing middleware, and implement synchronization of the to-be-processed data and/or the variable between the interaction environment of the Python language and the data computing middleware through data read-write operation on the storage system.

In some embodiments of the present application, the first processing module 901 is configured to implement synchronization of the to-be-processed data and/or the variable between the interaction environment of the Python language and the data computing middleware through data read-write operation on the storage system, and includes:

In some embodiments of the present application, the first processing module 901, configured to save the data to be processed and/or the variable to a storage system by using the data computing middleware, includes:

In practical applications, the first processing module 901 and the second processing module 902 may be implemented by a processor of an electronic device, and the processor may be at least one of an ASIC, a DSP, a DSPD, a PLD, an FPGA, a CPU, a controller, a microcontroller, and a microprocessor. It is understood that the electronic device implementing the above-described processor function may be other electronic devices, and the embodiments of the present application are not limited thereto.

It should be noted that the above description of the embodiment of the apparatus, similar to the above description of the embodiment of the method, has similar beneficial effects as the embodiment of the method. For technical details not disclosed in the embodiments of the apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be noted that, in the embodiment of the present application, if the data processing method is implemented in the form of a software functional module and sold or used as a standalone product, the data processing method may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a terminal, a server, etc.) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

Correspondingly, an embodiment of the present application further provides a computer program product, where the computer program product includes computer-executable instructions, and the computer-executable instructions are used to implement any one of the data processing methods provided in the embodiment of the present application.

Accordingly, an embodiment of the present application further provides a computer storage medium, where computer-executable instructions are stored on the computer storage medium, and the computer-executable instructions are used to implement any one of the data processing methods provided in the foregoing embodiments.

An embodiment of the present application further provides an electronic device, fig. 10 is an optional schematic structural diagram of the electronic device provided in the embodiment of the present application, and as shown in fig. 10, the electronic device 1000 includes:

a memory 1001 for storing executable instructions;

the processor 1002 is configured to implement any one of the data processing methods described above when executing the executable instructions stored in the memory 1001.

The processor 1002 may be at least one of an ASIC, a DSP, a DSPD, a PLD, an FPGA, a CPU, a controller, a microcontroller, and a microprocessor.

The computer-readable storage medium/Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a magnetic Random Access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM), and the like; but may also be various terminals such as mobile phones, computers, tablet devices, personal digital assistants, etc., that include one or any combination of the above-mentioned memories.

Here, it should be noted that: the above description of the storage medium and device embodiments is similar to the description of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be appreciated that reference throughout this specification to "some embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least one embodiment of the present application. Thus, the appearances of the phrase "in some embodiments" appearing in various places throughout the specification are not necessarily all referring to the same embodiments. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present application.

In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing an automatic test line of a device to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.

The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

The above description is only for the embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of data processing, the method comprising:

2. The method of claim 1, wherein the data computing middleware comprises a resource manager;

3. The method of claim 2, wherein the predetermined resource attribute is a user to which the resource belongs or a source of the resource.

4. The method of claim 2, wherein the data computing middleware further comprises at least one engine manager;

5. The method of claim 4, further comprising:

6. The method of claim 5, wherein the data computing middleware further comprises a gateway and at least one ingress node;

7. The method of claim 1, further comprising:

8. The method according to any one of claims 1-7 wherein the Python language interactive environment comprises a synchronization module; the method further comprises the following steps:

9. The method according to claim 8, wherein the synchronizing the to-be-processed data and/or the variable between the Python language interactive environment and the data computing middleware through data read-write operations on the storage system comprises:

10. The method of claim 9, wherein saving the data to be processed and/or the variables to a storage system using the data computing middleware comprises:

11. The method according to claim 8, wherein the synchronizing the to-be-processed data and/or the variable between the Python language interactive environment and the data computing middleware through data read-write operations on the storage system comprises:

12. A data processing apparatus, characterized in that the apparatus comprises:

13. An electronic device, characterized in that the electronic device comprises:

a memory for storing executable instructions;

a processor for implementing the data processing method of any one of claims 1 to 11 when executing executable instructions stored in the memory.

14. A computer-readable storage medium storing executable instructions for implementing the data processing method of any one of claims 1 to 11 when executed by a processor.