CN114697228A - Data integration method and device and related equipment - Google Patents

Data integration method and device and related equipment Download PDF

Info

Publication number
CN114697228A
CN114697228A CN202110523117.8A CN202110523117A CN114697228A CN 114697228 A CN114697228 A CN 114697228A CN 202110523117 A CN202110523117 A CN 202110523117A CN 114697228 A CN114697228 A CN 114697228A
Authority
CN
China
Prior art keywords
node
data
area
duration
source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110523117.8A
Other languages
Chinese (zh)
Inventor
韦启蒙
吕广林
蒋剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Cloud Computing Technologies Co Ltd
Original Assignee
Huawei Cloud Computing Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Cloud Computing Technologies Co Ltd filed Critical Huawei Cloud Computing Technologies Co Ltd
Priority to PCT/CN2021/141843 priority Critical patent/WO2022143583A1/en
Publication of CN114697228A publication Critical patent/CN114697228A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/12Discovery or management of network topologies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/12Network monitoring probes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/12Shortest path evaluation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/30Routing of multiclass traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the application provides a data integration method, a device and related equipment, which are applied to a data integration system comprising a plurality of area nodes, and comprise the following steps: acquiring a data integration task, determining a target path for sending source data from a source node to a target node based on the data integration task and transmission time, and executing the data integration task according to the target path; the data integration task comprises an identifier of a source node, a source data type of source data and an identifier of a target node, wherein the source node and the target node are any two area nodes in a plurality of area nodes; the transmission duration includes the duration of different types of data transmission between any two regional nodes. When cross-region data integration is carried out, the data integration system realizes dynamic planning of a path during cross-region data integration through the transmission time corresponding to each region node in the source node, the target node and the data integration system, obtains an efficient path, and can improve the efficiency of data integration.

Description

Data integration method and device and related equipment
Technical Field
The present application relates to the field of data integration, and in particular, to a data integration method, apparatus and related device.
Background
In the field of data integration and openness, data transmission between regions (regions) is generally performed in a manner of network direct connection. But there is a gap in network transmission capabilities between different regions and some regions are across wide area networks. The data transmission mode of network direct connection leads to low data transmission efficiency in open scenes (such as public cloud and hybrid cloud) of data integration among cross-regions. Therefore, how to determine an efficient transmission path between two areas requiring data integration is a technical problem to be solved urgently.
Disclosure of Invention
The embodiment of the application discloses a data integration method, a data integration device and related equipment, which can be used for planning an efficient path through acquired data such as transmission time of each regional node of a data integration system when data integration is carried out in a cross-region mode, and can improve the efficiency of data integration.
In a first aspect, an embodiment of the present application provides a data integration method, which is applied to a data integration system including a plurality of area nodes, where each area node is connected to a data source, and the data integration method includes: acquiring a data integration task, determining a target path for sending the source data from a source node to a target node based on the data integration task and transmission time, and executing the data integration task according to the determined target path; the data integration task comprises an identifier of a source node, a source data type of source data and an identifier of a target node, wherein the source node and the target node are any two area nodes in the area nodes; the transmission time length includes the time length for transmitting different types of data between any two regional nodes in the data integration system.
When cross-region data integration is needed, the data integration system plans an efficient path to transmit source data from the source node to the target node through the source node and the target node and the acquired transmission duration corresponding to each region node in the data integration system, so that dynamic planning of the path during cross-region data integration is realized, and the efficiency of data integration can be improved.
In a possible implementation manner, the transmission duration is obtained by each area node of the plurality of area nodes executing network probing, where network probing refers to that each area node sends data of different data types to other area nodes, and determines, by receiving response messages returned by other area nodes, a transmission duration for one area node to send data of different data types to another area node.
In a possible implementation manner, before determining a target path for sending source data from a source node to a target node based on a data integration task and a transmission duration, the method further includes: the method comprises the steps of obtaining the conversion duration and the forwarding duration of each regional node in a plurality of regional nodes, wherein the conversion duration comprises the duration of converting a source data type of source data into a target data type by the regional node, and the forwarding duration comprises the duration of forwarding received data by the regional node;
the determining a target path for sending the source data from the source node to the target node based on the data integration task and the transmission duration includes: and determining a target path for sending the source data from the source node to the target node based on the data integration task, the transmission time length, the conversion time length and the forwarding time length.
By obtaining the conversion time length and the forwarding time length corresponding to each area node, considering the time length of converting the source data from the source data type to the target data type in the area nodes and the time length required by forwarding the data by each area node when planning the target path, the more accurate time length for transmitting the data through each path can be obtained, and thus the more accurate target path can be determined.
In a possible implementation manner, the obtaining the conversion duration and the forwarding duration of each of the plurality of area nodes includes: receiving metadata sent by each regional node in a plurality of regional nodes, wherein the metadata comprises addresses of the regional nodes, conversion duration and forwarding duration.
In a possible implementation manner, before determining a target path for sending source data from a source node to a target node based on a data integration task, a transmission duration, a conversion duration, and a forwarding duration, the method further includes: transmitting an address of each regional node to the plurality of regional nodes; and receiving the transmission time length corresponding to each regional node uploaded by each regional node, wherein the transmission time length corresponding to each regional node is obtained by each regional node executing network detection based on the address of each regional node, and the transmission time length corresponding to each regional node comprises the transmission time length between each regional node and other regional nodes in the data integration system.
In a possible implementation manner, the data integration method is executed by a central node in the data integration system, and the central node is in communication connection with a plurality of area nodes.
The method comprises the steps that metadata such as node addresses of all regional nodes in a data integration system are managed and synchronized through a central node, so that the regional nodes can communicate with each other, network detection is carried out to obtain corresponding transmission time lengths, the corresponding transmission time lengths are reported to the central node, and when a data integration task is received by the central node, an efficient target path is determined through the central node in combination with conversion time length, forwarding time length, transmission time length and the like of each regional node, so that centralized management based on the central node is achieved, and data integration efficiency is improved. Meanwhile, when the regional nodes carry out network detection, network problems can be found in time and alarms can be given, and the reliability of the network is improved.
In a possible implementation manner, the obtaining of the transition duration and the forwarding duration of each of the plurality of area nodes includes: each regional node in the plurality of regional nodes sends corresponding metadata to a central node of the data integration system, wherein the metadata comprises conversion duration and forwarding duration; and receiving metadata which is sent to each area node by the central node. Each regional node in the data integration system can communicate with the central node, each regional node sends respective metadata to the central node, and the central node synchronizes the metadata of each regional node to other regional nodes, so that the regional nodes can communicate with each other, and further network detection is performed.
In a possible implementation manner, the metadata includes an address of each regional node; before determining a target path for sending source data from a source node to a target node based on a data integration task and a transmission duration, the method further includes: each regional node performs network detection according to the address of each regional node to obtain transmission time corresponding to each regional node, wherein the transmission time corresponding to each regional node comprises the transmission time between each regional node and other regional nodes in the data integration system; and each regional node sends the transmission duration corresponding to each regional node to other regional nodes in the data integration system according to the address of each regional node.
In a possible implementation manner, the data integration method is performed by any one of the plurality of area nodes.
When any regional node of the data integration system completes the data integration method, metadata such as node addresses, conversion duration, forwarding duration and the like of all regional nodes in the data integration system are managed and synchronized through a central node, so that network detection can be performed among the regional nodes to obtain transmission duration corresponding to each regional node, meanwhile, information such as transmission duration and the like of other regional nodes is obtained through broadcasting or through the central node by all the regional nodes, and when a data integration task exists, an efficient target path is determined through the regional nodes in combination with the conversion duration, the forwarding duration, the transmission duration and the like of each regional node, so that distributed scheduling management based on the central node and the regional nodes is realized, and the data integration opening efficiency is improved. Meanwhile, when the regional nodes carry out network detection, network problems can be found in time and alarms can be given, and the reliability of the network is improved.
In a possible implementation manner, the determining a target path for sending the source data from the source node to the target node based on the data integration task and the transmission duration further includes: determining a plurality of paths for transmitting the source data from the source node to the target node based on the data integration task, the transmission time length, the conversion time length and the forwarding time length; determining n time lengths for transmitting the source data from the source node to the target node through each path according to the transmission time length, the conversion time length and the forwarding time length, wherein when different regional nodes in each path convert the source data from the source data type to the target data type, the time lengths for transmitting the source data from the source node to the target node are different, n is the number of regional nodes corresponding to each path, and n is an integer greater than 1; determining the minimum time length in the n transmission time lengths of each path; and taking the path corresponding to the minimum value in the minimum duration corresponding to each path as a target path.
In a possible implementation manner, the determining n durations for transmitting the source data from the source node to the target node through each path includes: determining a first transmission time length before an ith area node and a second transmission time length after the ith area node in each path, wherein the ith area node is an area node for converting source data from a source data type to a target data type, the first transmission time length is the time length for transmitting the source data from the source node to the ith area node in the source data type, and the second transmission time length is the time length for transmitting the source data from the ith area node to the target node in the target data type; and adding the conversion time length corresponding to the ith area node, the first transmission time length and the second transmission time length to obtain the time length corresponding to the ith area node, wherein the value of i is an integer from 1 to n.
The method comprises the steps that transmission time lengths for sending source data from a source node to a target node through different paths are different, when the source data are converted into target data types by nodes in different areas, the transmission time lengths for transmitting the source data from the source node to the target node are also different, the transmission time lengths when the source data are converted in the nodes in different areas in the same path are calculated, the minimum transmission time length corresponding to each path is determined, the minimum time length corresponding to each path is compared, the path corresponding to the minimum value is used as the target path, an efficient mode for transmitting the source data to the target node can be determined, and the efficiency of data integration is improved.
In a possible implementation manner, the determining a target path for sending source data from a source node to a target node includes: determining, based on a source node and the target node, a plurality of paths for transmitting source data from the source node to the target node; determining the weight between any two adjacent region nodes according to the transmission time length between any two adjacent region nodes in each path and the reliability ratio between any two adjacent region nodes, wherein the reliability ratio indicates the probability of successful network connection between any two adjacent region nodes; determining the final weight corresponding to each path according to the weight between any two adjacent area nodes in each path; and taking the path corresponding to the minimum value in the final weights corresponding to the paths as a target path.
The reliability of network connection between two regional nodes is measured through the reliability ratio, and a target path is planned by combining the transmission time length between the two regional nodes and the reliability ratio, so that the dynamic scheduling of data can be realized, the efficiency of data integration and opening is improved, and the reliability of data integration opening is improved.
In a second aspect, an embodiment of the present application provides a data integration apparatus, where the data integration apparatus is applied to a data integration system, where the data integration system includes a plurality of area nodes, and the data integration apparatus includes a communication unit and a processing unit, and can be used to implement the method in the first aspect or any possible implementation manner of the first aspect.
In a third aspect, an embodiment of the present application provides a computing device, including a processor and a memory, where the memory is configured to store instructions, and the processor is configured to execute the instructions, and when the processor executes the instructions, the computing device performs the method according to the first aspect or any possible implementation manner of the first aspect.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the processor executes the method in the first aspect or any possible implementation manner of the first aspect.
In a fifth aspect, the present application provides a computer program product, where the computer program product includes instructions, and when the computer program product is executed by a computer, the computer may execute the method described in the foregoing first aspect or any possible implementation manner of the foregoing first aspect.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram of a data integration system;
FIG. 2 is a schematic diagram of a data integration system provided by an embodiment of the present application;
FIG. 3 is a schematic interaction diagram of a centralized scheduling management initialization of a data integration system according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a region node according to an embodiment of the present disclosure;
FIG. 5 is a schematic interaction diagram of a distributed scheduling management initialization of a data integration system according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of another area node provided in the embodiment of the present application;
fig. 7 is a schematic flowchart of a data integration method provided in an embodiment of the present application;
fig. 8 is a schematic diagram of a transmission path provided in an embodiment of the present application;
fig. 9 is a schematic structural diagram of a data integration apparatus according to an embodiment of the present application;
FIG. 10 is a schematic structural diagram of a computing device according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of a computing system according to an embodiment of the present application.
Detailed Description
Data integration refers to storing data with different sources, formats and characteristic properties into the same data source, so that comprehensive data sharing is provided for users. The data integration task reads source data from a source end through an integration system, and stores the source data to a target end after arranging the source data, so that data integration is completed.
Fig. 1 is a schematic diagram of a data integration system including a plurality of regions (regions) each connected via a network. Each zone includes zone nodes and data sources for storing data of various data types, such as: the data types comprise an oracle type, a mysql type and the like; each zone node includes one or more computing devices, such as servers, capable of reading data from a data source or converting received data for writing to a data source. The data source may be a storage resource pool, or a memory in the computing device of the region node; the data types also include other data types besides the oracle type and the mysql type, and the embodiments of the present application describe the solutions of the embodiments of the present application by taking the oracle type and the mysql type as examples, which should not be construed as being particularly limited.
When source data in a data source of one region needs to be integrated into a data source of another region, for example, data of region 1 in fig. 1 is integrated into region 4, a region node of region 1 reads the source data from the data source of region 1, and sends the source data to a region node of region 4, after the region node of region 4 receives the complete source data one or more times, the data in a message queue including the source data is converted into data of a target data type, for example, the source data is data of an oracle type, and the data of region 4 is data of a mysql type. The area node of the area 4 converts the received data in the data queue into mysql-type data, and then writes the converted data into the data source of the area 4 for storage.
When the source data is transmitted from the area 1 to the area 4, the related art generally plans a shortest distance path according to the address of the area node of the area 4, and transmits the source data to the area 4, as shown in fig. 1, and transmits the source data from the area 1 to the area 4 via the area 2. Because of the differences in network transmission capabilities between regions, it is not necessarily efficient to transmit data to a target region via a shortest distance path, for example, the duration is not necessarily shorter than the duration required by other paths, and some regions have dedicated transmission capabilities, which may be more efficient than shortest path transmission. Meanwhile, when the network between the two areas cannot reach directly, a user needs to configure integrated tasks of the multiple areas or add new agents to get through the network, and the mode is low in efficiency and easy to make mistakes. Therefore, how to determine an efficient transmission path between two areas requiring data integration is an urgent technical problem to be solved.
Fig. 2 is a schematic diagram of a data integration system provided in an embodiment of the present application, where the data integration system includes a central node and a plurality of areas, and each area includes a data source and an area node. The regional nodes of each region and the regional nodes and the central node of each region are connected by a network, and the central node and the regional nodes of each region each comprise one or more computing devices.
The data transmission method provided by the present application is applied to the data integration system as shown in fig. 2, and the data integration system includes two modes of centralized scheduling management and distributed scheduling management.
The system initialization process in centralized scheduling management is first described below.
Fig. 3 is an interaction schematic diagram of a centralized scheduling management of a data integration system according to an embodiment of the present application, as shown in fig. 3, in the centralized scheduling management, a central node includes a management module 310 and a scheduler 320, and the management module 310 is configured to manage metadata of each area node and synchronize information with each area node; the metadata includes an address, a transmission key, and an integration capability of the area node. The transmission key is used for authentication of the area node, for example, when the area node in the area 1 sends a message to the area node in the area 2, the transmission key of the area node in the area 2 needs to be carried, the area node in the area 2 verifies the key after receiving the message, and the area node in the area 2 confirms to receive the message only when the key verification passes; the integration capability is used to indicate the data type and conversion duration that the integration module 430 of the region node can convert; the scheduler 320 is configured to generate a target path when receiving the data integration task, and schedule data to be transmitted according to the target path, where the target path refers to an efficient path determined by the scheduler 320 for transmitting source data from one region to another region.
Fig. 4 is a schematic structural diagram of a region node according to an embodiment of the present application, that is, a schematic structural diagram of a region node in centralized scheduling management. The zone node of each zone includes a management module 410, a queue module 420, an integration module 430, a proxy (proxy) module 440, and an Application Programming Interface (API) gateway (gateway) 450. The management module 410 is used for managing metadata, network probes, data synchronized with a central node, network broadcasts, and the like; the queue module 420 is configured to receive and store a received message queue; the integration module 430 is configured to convert data in the message queue in the queue module 420, and write the converted data into a data source; the agent module 440 is configured to forward the received data to the area nodes in other areas; the API gateway 450 is used to provide an interface for accessing data sources.
Specifically, the system initialization process in the centralized scheduling management in the embodiment of the present application includes S301 to S304.
S301, reporting metadata to the central node by the area nodes of each area.
When centralized scheduling management is performed, the area nodes of each area report the metadata of each area node to the management module 310 of the central node through the management module 410, wherein the integration capability is used to indicate the data types and conversion durations that the integration module 430 of the area nodes can convert.
Taking the integration capability of the area 1 as an example, as shown in table 1, table 1 is a processing duration table of the area node 1, a source data type of table 1 represents a data type of data received by the area node, a target data type represents a data type to which the area node needs to convert the data of the source data type, the processing duration includes a conversion duration and/or a forwarding duration, the conversion duration represents a duration required by the area node to convert the received data of the source data type into the data of the target data type, and the forwarding duration refers to a duration for the area node to forward the received data or the converted data. For example, the time required for the region node of the region 1 to convert the mysql type data into the oracle type data is 100 milliseconds (ms), and the time required to convert the oracle type data into the mysql type data is 80 ms. The proxy (proxy) refers to that the area node only forwards the received data of the source data type without converting the data type, and the processing time length refers to the forwarding time length for forwarding by the area node. It should be understood that the processing time of the received data of different data types may be the same or different for the same regional node; the processing time lengths of different regional nodes on the same type of data may be the same or different, and the embodiment of the present application is not particularly limited.
Table 1 zone 1 treatment duration table
Region(s) Source data type Target data type Duration of treatment (ms)
Region 1 mysql oracle 100
Region 1 oracle mysql 80
Region 1 proxy proxy 20
The management module 310 can generate a processing time length table of the data integration system shown in table 2 below according to the integration capability in the metadata reported by each regional node.
Table 2 processing duration table of data integration system
Region(s) Source data type Target data type Duration of treatment (ms)
Region 1 mysql oracle 100
Region 1 oracle mysql 80
Region 1 proxy proxy 20
Region 2 mysql oracle 90
Region 3 oracle mysql 90
Region 3 proxy proxy 20
Region 5 oracle mysql 100
Region 5 proxy proxy 20
S302, the central node synchronizes metadata to the region nodes of each region.
After receiving the metadata reported by each region, the management module 310 of the central node synchronizes the address and the transmission key of the region node in the metadata received by the region node of each region, so that the region node of each region can obtain the node address and the transmission key of the node of other region, and further can communicate with each other. For example, fig. 3 includes five areas in total, and after receiving the metadata reported by the area nodes in the five areas, the central node sends the address and the transmission key of each area node to the area nodes in the other four areas.
And S303, carrying out network detection on the area nodes of each area to obtain a detection result.
The detection result comprises transmission time required for transmitting data of different data types in the network when one area node sends the data of different data types to other area nodes. After the area nodes of each area receive the metadata of the other areas, one area node can send data to the other area nodes in different data types through the management module 410, the other area nodes can return response messages to the area node, the area node can calculate and obtain transmission time required by the area node to send data of different data types to the other area nodes through the time for sending the data and the time for receiving the response messages, and the area node of each area can finally obtain the transmission time for sending data of different data types to other areas. For example, the area node in area 1 can obtain the transmission duration required for area 1 to send data of different data types to other four area nodes through network probing, as shown in table 3 below, table 3 is a transmission duration table for area 1 to send data to other areas, and in table 3, when the area node in area 1 sends oracle data to the area node in area 2, data needs 100ms to be transmitted from area 1 to area 2. It should be noted that the application layer communication protocols used for sending data of different data types are different, for example, the oracle communication protocol is used for sending data of oracle type, and the mysql communication protocol is used for sending data of mysql type.
Table 3 transmission duration table for area 1
Source region Region of interest Data type Transmission duration (ms)
Region 1 Region 2 oracle 100
Region 1 Region 2 mysql 100
Region 1 Region 3 oracle 120
Region 1 Region 5 oracle 1000
Region 1 Region 5 mysql 950
And S304, reporting the detection result to the central node by the area nodes of each area.
The area nodes in each area obtain the corresponding transmission time lengths shown in table 3 through network detection, the area nodes in each area report the obtained detection results to the central node, the management module 310 sorts the detection results reported in each area to generate the transmission time lengths of the whole data integration system, and the transmission time lengths of the data integration system include the transmission time lengths of data of different data types between two area nodes of the data integration system, as shown in table 4 below. It should be understood that the time duration for the a-region to send data to the B-region may be the same as or different from the time duration for the B-region to send the same type of data to the a-region.
After the central node generates the data in tables 2 and 4, the transmission time data of the data integration system is obtained, that is, the transmission time data includes the processing time of each area node for the data of different data types and the transmission time of the data of different data types between two area nodes. When a user needs to integrate data of one area into another area, the scheduler 320 of the central node can plan an efficient path according to the transmission time data of the above tables 2 and 4.
Table 4 transmission duration table of data integration system
Source region Region of interest Data type Transmission duration (ms)
Region 1 Region 2 oracle 100
Region 1 Region 2 mysql 100
Region 1 Region 5 oracle 1000
Region 1 Region 5 mysql 900
Region 5 Region 1 oracle 950
Region 5 Region 1 mysql 900
Region 5 Region 4 oracle 100
Region 5 Region 4 mysql 70
It should be noted that, due to network oscillation and different network performances at different times, the transmission duration between any two area nodes in table 4 may be updated at preset time intervals, for example, every half hour, the transmission duration for transmitting data of a data type between two area nodes is recalculated according to the number of times that two areas transmit data of the data type within half hour and the duration for transmitting data of the data type each time, and then the recalculated value is synchronized to the area nodes of other areas.
The method comprises the steps that information such as addresses of all regional nodes in a data integration system is managed and synchronized through a central node, so that the regional nodes can mutually perform network detection to obtain respective corresponding transmission time lengths, the regional nodes can report the respective corresponding transmission time lengths to the central node, and when a data integration task needs to be executed, dynamic scheduling is achieved through a central line node, and therefore centralized scheduling management based on the central node is achieved; or each regional node acquires information such as transmission time of other regional nodes through broadcasting or through the central node, and when a data integration task exists, dynamic scheduling is realized through the regional nodes, so that distributed scheduling management based on the central node and the regional nodes is realized. Meanwhile, when the regional nodes carry out network detection, network problems can be found in time and alarms can be given, and the reliability of the network is improved.
The following describes a data transmission method in distributed scheduling management.
Fig. 5 is an interaction schematic diagram of initialization of distributed scheduling management of a data integration system according to an embodiment of the present application, as shown in fig. 5, in the distributed scheduling management, a central node includes a management module 510, and the management module 510 is configured to manage metadata of area nodes of each area and synchronize the metadata with the area nodes of each area, where a function of the management module 510 is the same as a function of the management module 310. As shown in fig. 6, fig. 6 is a schematic structural diagram of another area node provided in this embodiment of the present application, where the area node of each area includes a management module 610, a queue module 620, an integration module 630, an agent module 640, an API gateway 650, and a scheduler 660. The management module 610, the queue module 620, the integration module 630, the agent module 640, and the API gateway 650 have the same functions as the corresponding modules in fig. 4, and the scheduler 660 has the same functions as the scheduler 320.
The system initialization process in the centralized scheduling management of the embodiment of the present application includes S501 to S504.
And S501, reporting respective metadata to the central node by the area nodes of the areas.
S502, the central node synchronizes metadata to the region nodes of each region.
And S503, carrying out network detection on the area nodes of each area according to the metadata of the area nodes to obtain detection results.
The execution processes of S501 to S503 are the same as the execution processes of S301 to S303, respectively, and will not be described in detail here.
And S504, synchronizing the detection results of the area nodes of the areas.
The area nodes in each area acquire the transmission time data shown in table 2 through network detection, the management module 610 of each area node broadcasts the acquired transmission time data to the area nodes in other areas in the integrated system, the management module 610 of each area node can acquire the network transmission delay of the data integrated system after receiving the transmission time length sent by each area node, and the data shown in table 3 and table 4 are generated according to the received transmission time length of other area nodes and the integration capability. When a user needs to integrate data of one region into another region, the scheduler 660 of the region node receiving the integration task can plan an efficient path according to the data transmission data of the above tables 3 and 4.
It should be noted that the area nodes in each area can upload the integration capability to the central node first, and the central node synchronizes to the area nodes in other areas, and after the central node synchronizes the node address of the area node and the transmission key to each area node, each area node synchronizes the respective integration capability to the other area nodes, which is not limited in this embodiment of the present application.
After the initialization of the data integration system, a user can perform data integration through the data integration system. In the embodiment of the application, the data integration system can provide data integration services for users in a cloud service mode, after the cloud services are purchased by the users on the cloud service platform, the cloud service platform provides data integration cloud services for the users, and terminal equipment used by the users can upload the data integration tasks to the cloud service platform through an Application Programming Interface (API) or through a webpage interface provided by the cloud service platform. After the cloud service platform receives the data integration task, the cloud service platform sends the data integration task to a scheduling node with a scheduler. It should be understood that the scheduling node in the present application is a central node or any one of the regional nodes having the function of a scheduler. If the data integration system adopts centralized scheduling management as shown in fig. 3, the cloud service platform sends a data integration task to the central node, namely the central node is used as a scheduling node; if the data integration system adopts distributed scheduling management as shown in fig. 5, the cloud service platform may send the data integration task to a region node, for example, a region node of a region closest to the cloud service platform, in which case, the region node serves as a scheduling node.
The data scheduling method provided by the embodiment of the present application is described in detail below with reference to fig. 3 to 6. As shown in fig. 7, fig. 7 is a schematic flowchart of a data integration method provided in the embodiment of the present application. The method comprises the following steps:
s701, the cloud service platform receives the data integration tasks and sends the data integration tasks to the scheduling node.
The scheduling node receiving the data integration task may be a central node of the management center, or may be an area node of any area. The data integration task comprises an Identifier (ID) of a source node, a source data type, an identifier of a target node, and a target data type, wherein the identifier of the source node indicates a starting point of source data, that is, an area node in which the source data is located for reading the source data, the identifier of the target node indicates an area node in an area to which the source data needs to be transmitted, the source data type refers to a type of the source data stored in a data source, and the target data type refers to a data type to which the source data needs to be converted, and the source data type and the target data type may be the same or different.
Specifically, the user may configure the data integration task through the user terminal, and after the user configures the data integration task, the user terminal imports the data integration task into the data integration system. For example, if the data integration task is to integrate the source data in the data source of region 1 into region 5, the source data of region 1 is of oracle type, and the data of region 5 is of mysql type, the data integration task may be in the following form:
Figure BDA0003064749180000091
Figure BDA0003064749180000101
s702, the scheduling node determines a target path according to the data integration task and the transmission time data.
Because the conversion time lengths of different area nodes are different, and the transmission time lengths of data of different data types between the same two area nodes are different, when the source data are converted into the target data type by the different area nodes, the transmission time lengths for transmitting the source data from the source node to the target node are also different. After receiving the data integration task, the scheduling node determines an efficient target path, such as a path that takes the shortest time, for transmitting the source data from the source node to the target node based on the transmission time data. The scheduling node determines which area node the source data is converted into the target data type when the source data is transmitted to according to the conversion time lengths of the different area nodes and the transmission time lengths of the data of the different data types transmitted between the area nodes, so that the time length of the source data transmitted from the source node to the target node is the shortest.
In one possible implementation, the scheduling node first determines m paths for transmitting source data from the source node to the target node according to the source node and the target node. And then sequentially selecting one path from the m paths, and respectively determining the transmission time length of the source data transmitted from the source node to the target node when each regional node of the n regional nodes of the path converts the source data from the source data type to the target data type according to the transmission time data, thereby obtaining n transmission time lengths corresponding to the path. Specifically, the scheduling node determines a first transmission duration before an ith node and a second transmission duration after the ith node in the path, wherein the ith node is a region node for converting the source data from the source data type to the target data type, the first transmission duration is a duration for transmitting the source data from the source node to the ith node in the source data type, and the second transmission duration is a duration for transmitting the source data from the ith node to the target node in the target data type; then adding the conversion time length for converting the source data type into the target data type, the first transmission time length and the second transmission time length corresponding to the ith node to obtain the transmission time length corresponding to the ith node; and taking 1 to n from i respectively, and enabling n regional nodes in each path to respectively convert the data type of the source data, thereby obtaining n transmission durations corresponding to each path.
After determining n transmission durations corresponding to each path in the m paths by the scheduling node through the method, determining the minimum duration in the n transmission durations of each path; and comparing the minimum duration corresponding to each path, and taking the path corresponding to the minimum as the target path. It should be noted that the value of n corresponding to each path may be the same or different.
For example, the data integration task is to integrate the data of the data source of the area 1 into the area 5, the source data of the area 1 is of an oracle type, the data of the area 5 is of a mysql type, and the area 2 receives the data integration task. The scheduler 660 of the area node of the area 2 determines a shortest destination path for transmitting data from the area 1 to the area 5, which includes at least the area node of the area 1 and the area node of the area 5, according to the source node (the area node of the area 1) and the destination node (the area node of the area 5) of the data integration task and the transmission time data.
Fig. 8 is a schematic diagram of a transmission path provided in an embodiment of the present application, and as shown in fig. 8, data may be transmitted from a data source of the area 1 to a data source of the area 5 through 3 paths, where the 3 paths and the time duration required for passing through each path are shown in table 5 below. Taking path 1 as an example, since the time for converting the oracle-type data into the mysql-type data in area 1 is 80ms, which is less than the time for converting the area nodes in area 3 and area 5, and the transmission time between area 1 and area 3, area 3 and area 5 is the same, the time required for transmitting the oracle-type data into area 5 after converting the oracle-type data into the mysql-type data in area 1 is the shortest. The integration module 630 of the area node of the area 1 reads source data from a data source, converts the oracle type data into a mysql type, needs 80ms, and then sends the data to the area 3, and the source data needs 100ms when being transmitted from the area 1 to the area 3 through a network; after the queue module 620 of the area node of the area 3 receives the source data, the proxy module 640 forwards the source data, which takes 20ms, and the source data needs 100ms to be transmitted from the area 3 to the area 5, so that the source data takes 320ms to be integrated from the data source of the area 1 to the data source of the area 5. From the data in table 5, the area node of area 2 takes the path with path number 1 in table 5 as the target path.
TABLE 5
Path numbering Area through which the path passes Duration (ms)
1 Region 1, region 3, and region 5 300
2 Region 1, region 5 1080
3 Region 1, region 2, region 4, and region 5 400
The transmission time lengths for transmitting data from one area to another area through different paths are different, when the source data are converted into the target data types by the nodes in different areas, the time lengths for transmitting the source data from the source node to the target node are also different, the scheduling node calculates the time lengths for transmitting the data in the network after the source data are converted by the nodes in different areas in the same path, determines the minimum time length corresponding to each path, compares the minimum time lengths corresponding to each path, and takes the path corresponding to the minimum value as the target path, so that a most efficient mode for transmitting the source data to the target node can be determined, and the efficiency of data integration and opening is improved.
In another possible implementation manner, the scheduler 660 may further measure a distance between two adjacent area nodes in one path by a weight (weight), the scheduler adds weight values included in any one path to obtain a final weight corresponding to each path, and takes the path with the minimum final weight as a target path.
Wherein the weight is the transmission duration/reliability ratio, the reliability ratio indicating the reliability of the probability of a successful network connection between two areas, the reliability ratio being a number greater than 0 and less than 1. For example, a reliability ratio of 50% indicates that there is a half-probability connection between the two regions. It should be noted that the transmission duration and the reliability ratio may be updated at preset time intervals, for example, every half hour, the transmission duration between two regions is recalculated according to the number of times that the two regions transmit data of the same data type in half hour and the transmission duration of each time, and then the recalculated value is synchronized to the region node of the other region.
For example, the transmission duration, the reliability ratio, and the weight corresponding to each region in 3 paths shown in fig. 8 in half an hour are shown in table 6, taking the data between the region 1 and the region 3 as an example, in half an hour, the transmission duration for the region node of the region 1 to transmit the data of the oracle type to the region node of the region 3 is 100ms, the reliability ratio of the network connection between the region 1 and the region 3 is 0.80, and the weight between the region 1 and the region 3 is 125/0.80. With the data of table 6, if the final weight is the smallest when the route with the route number 1 is transmitted with the oracle type data, the route with the route number 1 is transmitted with the data with the route number 1 when the oracle type data is transmitted from the area 1 to the area 3 before the weight between the areas is updated again.
TABLE 6
Figure BDA0003064749180000121
The reliability of network connection between two regional nodes is measured through the reliability ratio, and a target path is planned by combining the transmission time length and the reliability ratio between the two regional nodes, so that the dynamic scheduling of data can be realized, the efficiency of data integration and opening is improved, and the reliability of data integration opening is improved.
And S703, the scheduling node divides the data integration task into a plurality of subtasks according to the target path and sends each subtask to the area node of the corresponding area.
The scheduler 660 of the area node that receives the data integration task divides the data integration task into n subtasks according to the number n of the areas included in the target path, the area node of each area in the n areas executes one of the n subtasks, and each subtask includes an operation that the area node of the corresponding area needs to execute. Generally, the area nodes through which the source data passes include three types, namely a source node, a forwarding node and a target node, and the subtask corresponding to the source node reads the source data from the data source of the area where the source node is located and sends the source data to the next area in the target path; the subtask corresponding to the forwarding node is to forward the received data to the next area in the target path; and the subtask corresponding to the target node writes the received data into the corresponding data source. If the source data type of the source data in the data integration task is different from the target data type, a region node is also needed to convert the source data into the data of the target data type. The scheduler 660 splits the data integration task into a plurality of subtasks, and then sends each subtask to a corresponding area node. Illustratively, the source data is transmitted from the area 1 to the area 5, the source data type is an oracle type, the target data type is mysql, the target path determined according to the method passes through the area 1, the area 3 and the area 5, and since the time for converting oracle into mysql by the area 1 is shortest, the subtasks of the area node of the area 1 include reading the source data from the data source, converting the source data from the source data type into the target data type, and then sending the converted data to the node of the area 3; the subtask of the area node of the area 3 is to forward the received data, and the area 3 directly forwards the received data to the area node of the area 5 after receiving the data of the area node of the area 1. The subtask of the zone node of zone 5 is to write the received data to the corresponding data source.
And S704, executing corresponding operation by the area nodes of each area according to the received subtasks.
And after receiving the subtasks, the nodes in each area execute corresponding operations according to the respective received subtasks, and complete data integration scheduling. For example, according to the respective received subtasks, the area nodes of area 1, area 3, and area 5, the integration module 630 of the area node of area 1 reads the source data from the data source, converts the source data from the oracle type data into the mysql type, and sends the mysql type to area 3; after receiving the source data, the queue module 620 of the area node in the area 3 forwards the source data through the agent module 640; after receiving the source data, the queue module 620 of the area node of the area 5 writes the source data into the data source of the area 5, and completes the data integration task.
When cross-region data integration is required, a scheduling node of the data integration system plans an efficient path to transmit source data from a source node to a target node through the source node and the target node and the acquired transmission time data corresponding to each region node in the data integration system. The dynamic planning is realized by the scheduling node during the cross-region data integration, and the efficiency of data integration and opening can be improved.
The data integration system and the data integration method based on the data integration system provided by the embodiment of the present application are described in detail above with reference to fig. 1 to 8, and the related apparatus and the computing device for performing data integration provided by the embodiment of the present application are described below with reference to fig. 9 to 11. Referring to fig. 9, fig. 9 is a schematic structural diagram of a data integration apparatus provided in an embodiment of the present application, where the data integration apparatus 900 includes: a communication unit 910 and a processing unit 920, wherein,
a communication unit 910, configured to obtain a data integration task, where the data integration task includes an identifier of a source node, a type of source data, and an identifier of a target node, where the identifier of the source node indicates a source node where the source data needs to be transmitted, the identifier of the target node indicates a target node to which the source data needs to be transmitted, and the source node and the target node are any two area nodes in the multiple area nodes. The content specifically included in the data integration task may refer to the description in S701, and is not described herein again.
A processing unit 920, configured to determine a target path for sending source data from a source node to a target node based on the data integration task and a transmission duration, where the transmission duration includes a duration for transmitting different types of data between any two regional nodes; and executing the data integration task according to the target path.
Specifically, the method for implementing data scheduling by the communication unit 910 and the processing unit 920 in the data integration apparatus 900 may refer to the operation of data integration in the foregoing method embodiment, and is not described herein again.
The above units may perform data transmission through a communication path, and it should be understood that each unit included in the data integration apparatus 900 may be a software unit, a hardware unit, or a part of the software unit and a part of the hardware unit.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the apparatus and each module described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
Referring to fig. 10, fig. 10 is a schematic structural diagram of a computing device provided in an embodiment of the present application, where the computing device 100 includes: one or more processors 110, a communication interface 120, and a memory 130, the processors 110, the communication interface 120, and the memory 130 being interconnected by a bus 140, wherein,
specific implementations of the processor 110 to perform various operations may refer to specific operations in the above-described method embodiments. For example, the processor 110 is configured to perform the operations of S702 and S703 in fig. 7, which are not described herein again.
The processor 110 may have various specific implementations, for example, the processor 110 may be a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU), and the processor 110 may also be a single-core processor or a multi-core processor. The processor 110 may be a combination of a CPU and a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof. The processor 110 may also be implemented as a logic device with built-in processing logic, such as an FPGA or a Digital Signal Processor (DSP).
The communication interface 120 may be a wired interface, such as an ethernet interface, a Local Interconnect Network (LIN), or the like, or a wireless interface, such as a cellular network interface or a wireless lan interface, for communicating with other modules or devices. In this embodiment of the application, the communication interface 120 may be specifically configured to perform the operations of acquiring the data integration task in S701 and receiving the metadata in the system initialization process.
The memory 130 may be a non-volatile memory, such as a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash memory. Memory 130 may also be volatile memory, which may be Random Access Memory (RAM), that acts as external cache memory. By way of example, but not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate SDRAM, enhanced SDRAM, SLDRAM, Synchronous Link DRAM (SLDRAM), and direct rambus RAM (DR RAM).
The memory 130 may be used for storing program codes and data for the processor 110 to call the program codes stored in the memory 130 to execute the operation steps for implementing initialization and data integration in the above-described method embodiments. Moreover, computing device 100 may contain more or fewer components than shown in FIG. 10, or have a different arrangement of components.
The bus 140 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus 140 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 10, but this is not intended to represent only one bus or type of bus.
Optionally, the computing device 100 may further include an input/output interface 150, and the input/output interface 150 is connected with an input/output device for receiving input information and outputting an operation result.
Specifically, for the specific implementation of the computing device 100 to perform various operations, reference may be made to the specific operations performed by the scheduling node in fig. 7 in the foregoing method embodiment, and details are not described here again.
As shown in fig. 11, fig. 11 is a schematic structural diagram of a computing system according to an embodiment of the present application. Since the scheduling node provided herein may include one or more computing devices, and the modules included in the scheduling node may be distributively deployed on multiple computing devices in the same environment or different environments, the present application also provides a computing system as shown in fig. 11, which includes multiple computing devices 200, each computing device 200 including one or more processors 210, communication interfaces 220, and memories 230, wherein the processors 210, the communication interfaces 220, and the memories 230 are connected to each other through a bus 240, and the bus 240 may include a path for transmitting information among the components (e.g., the processors 210, the communication interfaces 220, the memories 230) of the computing devices 200. The specific form of the processor 210 may refer to the above description related to the processor 110 in the computing device 100, the specific form of the communication interface 220 may refer to the above description related to the communication interface 120 in the computing device 100, and the specific form of the memory 230 may refer to the above description related to the memory 130 in the computing device 100, which is not described herein again.
Optionally, the computing device 200 may further include an input/output interface 250, and the input/output interface 250 is connected with an input/output device for receiving input information and outputting an operation result.
A communication path is established between each of the above-mentioned computing devices 200 through a communication network. Each computing device 200 runs any one or more of a number of modules of a scheduling node. For example, when the scheduling node is one of a plurality of regional nodes, the scheduler 660 is deployed in one computing device, the management module 610 is deployed in a second computing device, and the queue module 620 and the integration module 630 are deployed in a third computing device. Any of the computing devices 200 may be a computer (e.g., a server) in a cloud service platform or a computer in an edge data center.
Embodiments of the present application further provide a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a processor, the method steps in the foregoing method embodiments may be implemented, and specific implementation of the processor of the computer-readable storage medium to execute the method steps may refer to specific operations of the foregoing method embodiments, which is not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on a computer, the procedures or functions according to the embodiments of the present invention are wholly or partially generated. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, digital subscriber line) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more collections of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium, or a semiconductor medium. The semiconductor medium may be a Solid State Drive (SSD).
The steps in the method of the embodiment of the application can be sequentially adjusted, combined or deleted according to actual needs; the modules in the device of the embodiment of the application can be divided, combined or deleted according to actual needs.
The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (26)

1. A data integration method applied to a data integration system, the data integration system comprising a plurality of regional nodes, comprising:
acquiring a data integration task, wherein the data integration task comprises an identifier of a source node, a source data type of source data and an identifier of a target node, and the source node and the target node are any two area nodes in the plurality of area nodes;
determining a target path for sending the source data from the source node to the target node based on the data integration task and a transmission time length, wherein the transmission time length comprises the time length for transmitting different types of data between any two regional nodes in the data integration system;
and executing the data integration task according to the target path.
2. The method of claim 1, wherein the transmission duration is obtained by the plurality of regional nodes performing network probing.
3. The method of claim 1 or 2, wherein before determining the target path for sending the source data from the source node to the target node based on the data integration task and the transmission duration, further comprising:
acquiring conversion duration and forwarding duration corresponding to an area node in the plurality of area nodes, wherein the conversion duration comprises the duration for converting the source data from the source data type into the target data type, and the forwarding duration comprises the duration for forwarding the received data by the area node;
the determining a target path for sending the source data from the source node to the target node based on the data integration task and the transmission duration comprises:
and determining a target path for sending the source data from the source node to the target node based on the data integration task, the transmission time length, the conversion time length and the forwarding time length.
4. The method of claim 3, wherein obtaining the transition duration and the forwarding duration for each of the plurality of regional nodes comprises:
and receiving metadata sent by each regional node in the plurality of regional nodes, wherein the metadata comprises the address of the regional node, the conversion time length and the forwarding time length.
5. The method of claim 4, wherein before determining the target path for sending the source data from the source node to the target node based on the data integration task, the transmission duration, the transition duration, and the forwarding duration, further comprising:
transmitting the address of each regional node to the plurality of regional nodes;
receiving the transmission duration corresponding to each area node uploaded by each area node, wherein the transmission duration corresponding to each area node is obtained by the area node executing network detection based on the address of each area node, and the transmission duration corresponding to each area node comprises the transmission duration between each area node and other area nodes in the data integration system.
6. The method of any one of claims 1 to 5, wherein the data integration method is performed by a central node of a data integration system, the central node being communicatively coupled to the plurality of area nodes.
7. The method of claim 3, wherein obtaining the transition duration and the forwarding duration for each of the plurality of regional nodes comprises:
each regional node in the plurality of regional nodes sends corresponding metadata to a central node of a data integration system, wherein the metadata comprises the conversion duration and the forwarding duration;
and receiving the metadata corresponding to each area node sent by the central node.
8. The method of claim 7, wherein the metadata further comprises an address of each of the regional nodes;
before determining a target path for sending the source data from the source node to the target node based on the data integration task and the transmission duration, the method further includes:
network detection is carried out on each area node according to the address of each area node, so that the transmission time length corresponding to each area node is obtained, and the transmission time length corresponding to each area node comprises the transmission time length between each area node and other area nodes in the data integration system;
and each regional node sends the transmission time length corresponding to each regional node to other regional nodes in the data integration system according to the address of each regional node.
9. The method according to any one of claims 1 to 3 and 7 to 8, wherein the data integration method is performed by any one of the plurality of area nodes.
10. The method of claim 6 or 9, wherein the data integration task further comprises a target data type,
the determining a target path for sending the source data from the source node to the target node based on the data integration task and the transmission duration comprises:
determining a plurality of paths for transmitting the source data from the source node to the target node based on the data integration task, the transmission duration, the conversion duration, and the forwarding duration;
determining n durations for transmitting the source data from the source node to the target node through each path according to the transmission duration, the conversion duration and the forwarding duration, wherein the durations for transmitting the source data from the source node to the target node are different when different regional nodes in each path convert the source data from the source data type to the target data type, n is the number of regional nodes corresponding to each path, and n is an integer greater than 1;
determining the minimum time length in the n transmission time lengths of each path;
and taking the path corresponding to the minimum value in the minimum duration corresponding to each path as the target path.
11. The method of claim 10, wherein determining n durations for transmitting the source data from the source node to the destination node over each path comprises:
determining a first transmission duration before an ith area node and a second transmission duration after the ith area node in each path, wherein the ith area node is an area node for converting the source data from the source data type to the target data type, the first transmission duration is the duration for transmitting the source data from the source node to the ith area node in the source data type, and the second transmission duration is the duration for transmitting the source data from the ith area node to the target node in the target data type;
and adding the conversion time length corresponding to the ith area node, the first transmission time length and the second transmission time length to obtain the time length corresponding to the ith area node, wherein the value of i is an integer from 1 to n.
12. The method of any of claims 6 or 9, wherein the determining a target path for sending the source data from the source node to the target node comprises:
determining, based on the source node and the target node, a plurality of paths for transmitting the source data from the source node to the target node;
determining the weight between any two adjacent region nodes according to the transmission time length between any two adjacent region nodes in each path and the reliability ratio between any two adjacent region nodes, wherein the reliability ratio indicates the probability of successful network connection between any two adjacent region nodes;
determining a final weight corresponding to each path according to the weight between any two adjacent region nodes in each path;
and taking the path corresponding to the minimum value in the final weights corresponding to the paths as the target path.
13. A data integration apparatus applied to a data integration system, the data integration system including a plurality of area nodes, comprising:
a communication unit, configured to acquire a data integration task, where the data integration task includes an identifier of a source node, a source data type of source data, and an identifier of a target node, and the source node and the target node are any two area nodes in the multiple area nodes;
a processing unit, configured to determine, based on the data integration task and a transmission duration, a target path for sending the source data from the source node to the target node, where the transmission duration includes a duration for transmitting different types of data between any two area nodes in the data integration system;
and executing the data integration task according to the target path.
14. The apparatus of claim 13, wherein the transmission duration is obtained by the plurality of regional nodes performing network probing.
15. The apparatus of claim 13 or 14,
the communication unit is further configured to: acquiring conversion duration and forwarding duration of area nodes in the plurality of area nodes, wherein the conversion duration comprises duration for converting the source data from the source data type into the target data type, and the forwarding duration comprises duration for forwarding the received data by the area nodes;
the processing unit is specifically configured to: and determining a target path for sending the source data from the source node to the target node based on the data integration task, the transmission time length, the conversion time length and the forwarding time length.
16. The apparatus according to claim 15, wherein the obtaining, by the communication unit, the transition duration and the forwarding duration of each of the plurality of area nodes specifically includes:
and receiving metadata sent by each regional node in the plurality of regional nodes, wherein the metadata comprises the address of the regional node, the conversion time length and the forwarding time length.
17. The apparatus of claim 16,
the communication unit is further configured to send the address of each area node to the plurality of area nodes;
receiving the transmission duration corresponding to each regional node uploaded by each regional node, wherein the transmission duration corresponding to each regional node is obtained by each regional node executing network detection based on the metadata, and the transmission duration corresponding to each regional node comprises the transmission duration between each regional node and other regional nodes in the data integration system.
18. The apparatus of any one of claims 13 to 17, wherein the data integration apparatus is located at a central node of a data integration system, the central node being communicatively coupled to the plurality of area nodes.
19. The apparatus according to claim 15, wherein the obtaining, by the communication unit, the transition duration and the forwarding duration of each of the plurality of area nodes specifically includes:
sending metadata corresponding to the area node to a central node of a data integration system, wherein the metadata comprises the conversion duration and the forwarding duration;
and receiving the metadata corresponding to each area node sent by the central node.
20. The apparatus of claim 19, wherein the metadata further comprises an address of each of the regional nodes;
the processing unit is further to: network detection is carried out according to the address of each area node to obtain the transmission time corresponding to the area node where the data integration device is located, wherein the transmission time corresponding to the area node where the data integration device is located comprises the transmission time between the area node where the data integration device is located and other area nodes in the data integration system;
the communication unit is further configured to: and sending the transmission time length corresponding to the area node where the data integration device is located to other area nodes in the data integration system according to the address of each area node.
21. The apparatus according to any one of claims 13 to 15 and 19 to 20, wherein the data integration apparatus is located in any one of the plurality of regional nodes.
22. The apparatus of claim 18 or 21, wherein the data integration task further comprises a target data type,
the processing unit determines, based on the data integration task and the transmission duration, a target path for sending the source data from the source node to the target node, and specifically includes:
determining a plurality of paths for transmitting the source data from the source node to the target node based on the data integration task, the transmission duration, the conversion duration, and the forwarding duration;
determining n durations for transmitting the source data from the source node to the target node through each path according to the transmission duration, the conversion duration and the forwarding duration, wherein the durations for transmitting the source data from the source node to the target node are different when different regional nodes in each path convert the source data from the source data type to the target data type, n is the number of regional nodes corresponding to each path, and n is an integer greater than 1;
determining the minimum time length in the n transmission time lengths of each path;
and taking the path corresponding to the minimum value in the minimum duration corresponding to each path as the target path.
23. The apparatus of claim 22, wherein the processing unit determines n durations for transmitting the source data from the source node to the destination node via each path, specifically comprising:
determining a first transmission duration before an ith area node and a second transmission duration after the ith area node in each path, wherein the ith area node is an area node for converting the source data from the source data type to the target data type, the first transmission duration is the duration for transmitting the source data from the source node to the ith area node in the source data type, and the second transmission duration is the duration for transmitting the source data from the ith area node to the target node in the target data type;
and adding the conversion time length corresponding to the ith area node, the first transmission time length and the second transmission time length to obtain the transmission time length corresponding to the ith area node, wherein the value of i is an integer from 1 to n.
24. The apparatus according to claim 18 or 21, wherein the processing unit determines a destination path for sending the source data from the source node to the destination node, and specifically comprises:
determining, based on the source node and the target node, a plurality of paths for transmitting the source data from the source node to the target node;
determining the weight between any two adjacent region nodes according to the transmission time length between any two adjacent region nodes in each path and the reliability ratio between any two adjacent region nodes, wherein the reliability ratio indicates the probability of successful network connection between any two adjacent region nodes;
determining a final weight corresponding to each path according to the weight between any two adjacent region nodes in each path;
and taking the path corresponding to the minimum value in the final weights corresponding to the paths as the target path.
25. A computing device comprising a processor and a memory, the memory for storing instructions, the processor for executing the instructions, the processor when executing the instructions performing the method of any of claims 1 to 12.
26. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, performs the method of any one of claims 1 to 12.
CN202110523117.8A 2020-12-30 2021-05-13 Data integration method and device and related equipment Pending CN114697228A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/141843 WO2022143583A1 (en) 2020-12-30 2021-12-28 Data integration method, apparatus, and related device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011628743 2020-12-30
CN2020116287435 2020-12-30

Publications (1)

Publication Number Publication Date
CN114697228A true CN114697228A (en) 2022-07-01

Family

ID=82136316

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110523117.8A Pending CN114697228A (en) 2020-12-30 2021-05-13 Data integration method and device and related equipment

Country Status (2)

Country Link
CN (1) CN114697228A (en)
WO (1) WO2022143583A1 (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9367368B2 (en) * 2012-08-30 2016-06-14 Nec Corporation Event processing control device, node device, event processing system, and event processing control method
CN113395210B (en) * 2016-06-29 2022-09-16 华为技术有限公司 Method for calculating forwarding path and network equipment
CN110391982B (en) * 2018-04-20 2022-03-11 伊姆西Ip控股有限责任公司 Method, apparatus and computer program product for transmitting data
CN111211980B (en) * 2019-12-17 2022-06-03 中移(杭州)信息技术有限公司 Transmission link management method, transmission link management device, electronic equipment and storage medium

Also Published As

Publication number Publication date
WO2022143583A1 (en) 2022-07-07

Similar Documents

Publication Publication Date Title
CN108712332B (en) Communication method, system and device
JP2018508137A (en) System and method for SDT to work with NFV and SDN
CN110601978B (en) Flow distribution control method and device
CN112039796B (en) Data packet transmission method and device, storage medium and electronic equipment
CN111431803A (en) Routing method and device
CN114553760B (en) Path weight distribution method and device
US11394800B2 (en) Systems and methods for remote network topology discovery
CN111447143B (en) Business service data transmission method and device, computer equipment and storage medium
CN115208812A (en) Service processing method and device, equipment and computer readable storage medium
CN108347377B (en) Data forwarding method and device
CN111092925B (en) Block chain capacity expansion processing method, device and equipment
WO2022164732A1 (en) System and method for network and computation performance probing for edge computing
US11290575B2 (en) Connecting computer processing systems and transmitting data
US11357020B2 (en) Connecting computer processing systems and transmitting data
US11405766B2 (en) Connecting computer processing systems and transmitting data
CN114697228A (en) Data integration method and device and related equipment
CN115277504B (en) Network traffic monitoring method, device and system
US11706097B2 (en) Task processing method applied to network topology, electronic device and storage medium
US20180255157A1 (en) Network service chains using hardware logic devices in an information handling system
CN112256714A (en) Data synchronization method and device, electronic equipment and computer readable medium
CN111314457B (en) Method and device for setting virtual private cloud
US20230232434A1 (en) Apparatus and method for performing ai/ml job
CN113472565B (en) Method, apparatus, device and computer readable medium for expanding server function
CN112422613B (en) Data processing method, data processing platform and computer readable storage medium
US20230171180A1 (en) Data processing method, packet sending method, and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination