CN112905854A

CN112905854A - Data processing method and device, computing equipment and storage medium

Info

Publication number: CN112905854A
Application number: CN202110245792.9A
Authority: CN
Inventors: 王海霖; 张灵星; 李嘉
Original assignee: Beijing Zhongjing Huizhong Technology Co ltd
Current assignee: Beijing Zhongjing Huizhong Technology Co ltd
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2021-06-04

Abstract

A data processing method, an apparatus, a computing device and a storage medium are provided. The method can comprise the following steps: loading a plurality of data into a memory of a distributed computing engine; partitioning the plurality of data such that the plurality of data is distributed into a plurality of partitions of a memory of the distributed computing engine; establishing a connection to a graph database for each partition; and writing the plurality of data in the plurality of partitions into respective storage areas of the graph database according to the connection of each of the plurality of partitions to the graph database.

Description

Data processing method and device, computing equipment and storage medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a data processing method and apparatus, a computing device, and a storage medium.

Background

A Graph Database (GDB) is a Database that semantically queries in a Graph structure and represents and stores data using vertices, edges, and attributes.

Disclosure of Invention

When data processing and data writing are performed on a graph database, especially in an application scenario where a large amount of data exists, processing efficiency is low, and it is difficult to meet business requirements. It would be advantageous to provide a mechanism that alleviates, mitigates or even eliminates one or more of the above-mentioned problems.

According to an aspect of the present disclosure, there is provided a data processing method including: loading a plurality of data into a memory of a distributed computing engine; partitioning the plurality of data such that the plurality of data is distributed into a plurality of partitions of a memory of the distributed computing engine; establishing a connection to a graph database for each partition; and writing the plurality of data in the plurality of partitions into respective storage areas of the graph database according to the connection of each of the plurality of partitions to the graph database.

According to another aspect of the present disclosure, there is provided a data processing apparatus including: a data loading unit configured to load a plurality of data into a memory of a distributed computing engine; a data partitioning unit configured to partition a plurality of data such that the plurality of data is distributed into a plurality of partitions of a memory of a distributed computing engine; a connection establishing unit configured to establish a connection to the graph database for each partition; and a data writing unit configured to write a plurality of data in the plurality of partitions into respective storage areas of the graph database according to a connection of each of the plurality of partitions to the graph database.

According to another aspect of the present disclosure, there is provided a computing device comprising: a memory, a processor and a computer program stored on the memory, the processor being configured to execute the computer program to implement the steps of the data processing method according to an embodiment of the present disclosure.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a data processing method according to an embodiment of the present disclosure.

According to yet another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, realizes the steps of a data processing method according to an embodiment of the present disclosure.

According to the embodiment of the disclosure, the writing efficiency of data to the graph database can be improved.

These and other aspects of the disclosure will be apparent from and elucidated with reference to the embodiments described hereinafter.

Drawings

Further details, features and advantages of the disclosure are disclosed in the following description of exemplary embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a schematic diagram of an example system in which various methods described herein may be implemented, according to an example embodiment;

FIG. 2 is a flow chart diagram of a data processing method according to an exemplary embodiment of the present disclosure;

FIG. 3 is a data flow diagram of a data processing method according to an exemplary embodiment of the present disclosure;

FIG. 4 is a flow chart diagram of a data processing method according to another exemplary embodiment of the present disclosure;

FIG. 5 is a data flow diagram of a data processing method according to another exemplary embodiment of the present disclosure;

fig. 6 is a flowchart of a configuration method of a data processing method according to an exemplary embodiment of the present disclosure;

FIG. 7 is a schematic block diagram illustrating a data processing apparatus according to an example embodiment; and

FIG. 8 is a block diagram illustrating an exemplary computer device that can be applied to the exemplary embodiments.

Detailed Description

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.

The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. As used herein, the term "plurality" means two or more, and the term "based on" should be interpreted as "based, at least in part, on". Further, the terms "and/or" and at least one of "… …" encompass any and all possible combinations of the listed items.

Exemplary embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram illustrating an example system 100 in which various methods described herein may be implemented, according to an example embodiment.

Referring to fig. 1, the system 100 includes a client device 110, a server 120, and a network 130 communicatively coupling the client device 110 and the server 120.

The client device 110 includes a display 114 and a client Application (APP)112 displayable via the display 114. The client application 112 may be an application that needs to be downloaded and installed before running or an applet (liteapp) that is a lightweight application. In the case where the client application 112 is an application program that needs to be downloaded and installed before running, the client application 112 may be installed on the client device 110 in advance and activated. In the case where the client application 112 is an applet, the user 102 can run the client application 112 directly on the client device 110 without installing the client application 112 by searching the client application 112 in a host application (e.g., by the name of the client application 112, etc.) or by scanning a graphical code (e.g., barcode, two-dimensional code, etc.) of the client application 112, etc. In some embodiments, client device 110 may be any type of mobile computer device, including a mobile computer, a mobile phone, a wearable computer device (e.g., a smart watch, a head-mounted device, including smart glasses, etc.), or other type of mobile device. In some embodiments, client device 110 may alternatively be a stationary computer device, such as a desktop, server computer, or other type of stationary computer device.

The server 120 is typically a server deployed by an Internet Service Provider (ISP) or Internet Content Provider (ICP). Server 120 may represent a single server, a cluster of multiple servers, a distributed system, or a cloud server providing an underlying cloud service (such as cloud database, cloud computing, cloud storage, cloud communications). It will be understood that although the server 120 is shown in fig. 1 as communicating with only one client device 110, the server 120 may provide background services for multiple client devices simultaneously.

Examples of network 130 include a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), and/or a combination of communication networks such as the Internet. The network 130 may be a wired or wireless network. In some embodiments, data exchanged over network 130 is processed using techniques and/or formats including hypertext markup language (HTML), extensible markup language (XML), and the like. In addition, all or some of the links may also be encrypted using encryption techniques such as Secure Sockets Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), internet protocol security (IPsec), and so on. In some embodiments, custom and/or dedicated data communication techniques may also be used in place of or in addition to the data communication techniques described above.

For purposes of the disclosed embodiments, in the example of fig. 1, the client application 112 may be a data processing application that may provide various functions for data processing, such as data selection, command configuration, runtime environment configuration, display of runtime results, and so forth. In particular, the client application 112 may be a task management platform or a runtime platform, etc., as described below. Accordingly, server 120 may be a server for use with data processing applications. The server 120 may receive user instructions from the client application 112 running in the client device 110 and provide various data processing services, such as data reading, data writing, data synchronization, knowledge graph presentation, etc., to the client application 112 running in the client device 110. Alternatively, the server 120 may also provide data to the client device 110, providing processing services such as management of data processing tasks or presentation of execution results, etc., by the client application 112 running in the client device 110, in accordance with the data.

Fig. 2 is a flow chart illustrating a data processing method 200 according to an exemplary embodiment. Method 200 may be performed at a client device (e.g., client device 110 shown in fig. 1). In some embodiments, method 200 may be performed at a server (e.g., server 120 shown in fig. 1). In some embodiments, method 200 may be performed by a client device (e.g., client device 110) in combination with a server (e.g., server 120). Hereinafter, each step of the data processing method 200 is described in detail by taking an execution subject as the server 120 as an example.

Referring to fig. 2, at step S201, a plurality of data are loaded into the memory of the distributed computing engine.

At step S202, the plurality of data is partitioned such that the plurality of data is distributed into a plurality of partitions of a memory of the distributed computing engine.

At step S203, a connection to the graph database is established for each partition.

At step S204, a plurality of data in a plurality of partitions is written into respective storage areas of a map database according to the connection of each of the plurality of partitions to the map database.

According to the method 200, by partitioning data into the memory of the distributed computing engine, each partition separately establishes connection and writes into the target database, so that distributed writing of a plurality of data into the target database can be realized, and the efficiency of data processing, particularly data writing, is improved. It is understood that the data processing method 200 may also be referred to as a data import method, a data storage method, a graph database warehousing method, or a data synchronization method, etc., and the present disclosure is not limited thereto.

The graph database is adapted to store a knowledge-graph maintained in the form of edges and vertices. The knowledge graph is a knowledge cluster organized in a graph form in knowledge engineering, and is formed by taking different types of entities as nodes and taking relationships as edges connecting the nodes. In a related scene, in order to solve the problem of performing service analysis on mass data through a knowledge graph, the mass data may need to be quickly generated into a knowledge graph system in the shortest time, and the construction of original data at the knowledge graph level is completed. For example, in certain application scenarios, the knowledge-graph data construction needs to be completed within 24 hours for billions of nodes and relationship data. Therefore, distributed processing and writing methods of data are particularly advantageous.

A modified example of the data processing method according to some other embodiments of the present disclosure is described below.

According to some embodiments, the writing of the plurality of data in the plurality of partitions to the respective storage areas of the graph database may be in parallel. Therefore, the parallel writing of the data can be realized by utilizing the advantages of the distributed computing engine, the data processing efficiency is increased, and the time required by data processing, particularly data storage, is saved.

According to some embodiments, the plurality of data may be structured stored data. Thus, a conversion from structured data to graph data can be achieved. For example, according to such embodiments, data transformation of structured data in a data warehouse to unstructured data in a graph database, for example, may be implemented. For example, the structurally stored data may be relationships that are vertices of a single entity and descriptions of relationships between entities that are stored in the form of tables. For example, examples of vertices may be zhang san, lie si, wang wu, etc., and the relationship may indicate a friendship between zhang san and lie si, a lending relationship between lie si and wang wu, etc.

According to some embodiments, partitioning the plurality of data may include: each data hash is mapped into a respective partition of a memory of the distributed computing engine based on an original storage address of each data of the plurality of data. Hash partitioning based on the data original storage address is equivalent to monitoring mapping by using the data original storage address, and is beneficial to the uniformity of the Hash partitioning. For example, as one example, the distributed computing engine may be an Apache Spark computing engine, which is a memory-based distributed computing engine. In this case, partitioning the plurality of data may use, for example, a hash partitioning algorithm inside Spark.

According to some embodiments, the plurality of data may be loaded from a distributed storage database, and the original storage address of each data comprises an original partition address of each data in the distributed storage database. Specifically, hash partitioning is performed in the distributed computing engine based on the original data storage fragmentation address, so that data distribution is more uniform and distribution performance is better due to partitioning.

For example, in the case where the original database from which the data was loaded is a distributed storage database (e.g., Hive), partitioning may be performed according to the data block address (i.e., the slice in Hive) at which the data precedes. For example, where the memory of the distributed computing engine is partitioned into 50 partitions, the data block address at which the data precedes may be modulo 50 and thus partitioned into different partitions of the memory.

According to some embodiments, establishing a connection to the graph database for each partition may include: establishing a connection instance to a graph database in each partition, wherein each connection instance is capable of invoking a computation engine of the graph database; and wherein writing the plurality of data in the plurality of partitions into respective storage areas of the graph database may comprise: a graph data processing operator of the graph database is invoked by the join instance of each partition to convert the format of the data in each partition to a graph data format, and the converted data is written into a corresponding storage area of the graph database. And the computation engine calling the graph database is connected through each graph database to write, so that the graph database can be written in parallel. Graph databases, for example, may support the TinkerPop Gremlin query language, which may help users quickly build applications based on highly connected datasets. For example, the graph data processing operator may be a Gremlin operator.

According to some embodiments, the graph database may be an HBase. HBase is a distributed, column-oriented open source database. As a distributed storage database, the HBase may have a mechanism to:

(1) RowKey: the data of the HBase is distributed by RowKey. In HBase, RowKey acts like a primary key in MySQL, Oracle, for marking unique rows, and data in HBase is ordered according to Rowkey's dictionary ordering;

(2) region: when writing data to an HBase, the data is written to different regions (storage areas) in the HBase in a columnar storage structure of the HBase. The data volume of each Region exceeds a certain size and can be automatically split into two different regions, which is also called as split mechanism of HBase.

In the case where the graph database may be an HBase, according to such an embodiment, the step of writing the plurality of data in the plurality of partitions into respective storage areas of the graph database may comprise: the data identity of each data is used as the RowKey for the HBase so that the data is mapped into the corresponding memory Region of the HBase, i.e. a different Region. The original identification of the data is used as the RowKey of the HBase partition, namely the primary key on which the data partition is stored, so that the original identification replaces the original identification of the HBase to automatically distribute the RowKey in the prior art, and the partition effect of the HBase can be adjusted by simply controlling the data ID. The ID of the data may be globally unique. When data is written, HBase maps the ID of each piece of data to the corresponding RowKey and writes the ID to the corresponding Region. For example, the mapping of an ID to a RowKey may be accomplished by hashing or modulo the ID and determining which Region the modulo remainder corresponds to.

HBase partition needs to depend on a RowKey mechanism, so that the partition uniformity of HBase can be controlled by the design of how to select the primary key. For example, by setting the rule for the primary key of the data, the data can be prevented from being too scattered, and the data amount of each partition after partitioning and the file size of the partition can be made as close or the same as possible.

For example, the data identification may remain unchanged from before being loaded into the distributed computing engine to when it is written to the graph database. That is, the data identification or data ID may be the data ID of the raw storage, i.e., the data ID of the data before it is loaded into the distributed storage engine (e.g., as stored in a raw database, such as a data warehouse). Therefore, the performance of the final partition can be controlled by maintaining the data identification in the original database, the distributed storage can be more uniform, and the cost of computing resources and manual maintenance required by additionally computing the primary key is reduced. In other words, rules may be formulated such that the data ID in the data repository and the RowKey of the HBase establish a correspondence, thereby completing the partitioning in the HBase. It will be appreciated that other embodiments are possible, for example the ID of the graph data may be generated from a graph database janusgraph.

According to some embodiments, the initial storage area number of the HBase may be determined based on the number of service nodes of the HBase. The initial Region number of the HBase is set based on the service node of the HBase, and the number of storage partitions can be regulated and controlled when early writing and the data volume are small in addition to a self Region splitting and data average distribution mechanism of the HBase, so that the distributed storage performance of the HBase is improved. For example, the initial number of storage areas of the HBase may be the maximum number of service nodes of the HBase. Specifically, the initial number of regions may be set to be the same as the number of Region servers of the HBase. As one non-limiting example, in the case of an HBase cluster with 50 machines, where 2 machines are masters and another 48 machines are regionservers, the number of regions may be initially set to 48, thereby to take maximum advantage of the distributed storage performance of the HBase.

The final number of regions in the HBase may be related to the amount of data imported. As a non-limiting example, in the case where one Region is 10G at maximum and there is already 10T of data in the HBase table, the number of regions is 10 × 1024/10 — 1024.

In addition, on the basis of the HBase partitioning mechanism, the partitioning can be optimized additionally. For example, in the case where the HBase has stored other data, there is a high possibility that there is a case where the partitions are not uniform, for example, the data size of each partition is 100M-1G in size. In this case, the data may be repartitioned by artificially merging smaller files in the HBase, for example, files below 500M are all merged into one large file, so that the data in the HBase is more uniformly distributed.

According to some embodiments, multiple data may come from Hive, and Hive and HBase may be located in the same cluster. Hive and HBase may be located within the same cluster, e.g., consisting of multiple machines, and in other words, Hive and HBase may be located on the same distributed file system. Thus, hardware maintenance can be facilitated, and the distributed storage can be facilitated by being located in the same distributed file system. For example, the distributed file system may be a Hadoop Distributed File System (HDFS). Hive is a data warehouse analysis system constructed based on Hadoop, and can provide rich SQL query modes to analyze data stored in a Hadoop distributed file system and provide good flexibility and expandability for data operation. It will be appreciated that the original database and the target graph database may take other forms, as long as loading of data from the original database and writing to the graph database is enabled.

According to some embodiments, the number of the plurality of partitions of the memory of the distributed computing engine may be determined according to a data amount of the plurality of data and a cluster computing capability of the distributed computing engine. Therefore, the partition number can be adjusted based on the data volume and the computing capacity, so that the optimal effect is achieved.

According to some embodiments, the number of partitions may be set equal to the parallelism of the distributed computing engine. The number of partitions is the same as the parallelism, so that the parallel computing power is maximized. An example of the number of partitions is given below. In an example scenario where the data size is 1 million, the number of cluster machines is 25, each machine has two cores (i.e., parallel 50), and the cluster memory is 50T, a typical number of partitions of 50 may be selected.

In other examples, the computations of the partitions may be run in parallel before in series, for example, where the number of partitions is greater than the number of degrees of parallelism, and the disclosure is not limited thereto. For example, a larger number of partitions may be selected in the case of fewer, e.g., 10, machines in the cluster. The number of partitions may be 10, 100, etc. The purpose of adding partitions in this case is to make the amount of data calculated per partition a single time small, reduce calculation errors, and make the performance more stable.

According to some embodiments, the plurality of Data may come from a cleaned Data layer of a Data Warehouse (DW or DWH). The data to be processed is subjected to data cleaning, so that the accuracy of data writing is ensured. As already mentioned above, the data warehouse may employ a distributed storage architecture, which may be Hive, for example.

According to some embodiments, the data processing method may further include: establishing a connection to an index database for each partition; and writing the indexes of the plurality of data in the plurality of partitions into corresponding storage areas of the index database according to the connection of each of the plurality of partitions to the index database. Thus, writing of the data index is also done by partitioning in the distributed compute engine, which can further increase computational efficiency. For example, the index database may be an elastic search (es) retrieval framework that supports distribution.

In the prior art, the process of synchronizing data from a data warehouse to a knowledge graph does not support distributed computation, the problem of single node exists, and the data storage efficiency is far from the expectation. For a large number of data entry scenarios, such as three billion data entry scenarios, performance/speed is quite poor. Through the embodiment described above, the data storage performance of the graph database can be improved, and particularly, the defect of data synchronization efficiency under single machine deployment and massive data, especially for the requirement of rapidly entering the graph database by billions of data, can be overcome. According to the distributed data processing method of the embodiment of the disclosure, a large amount of data, including data of databases of different sources, can be synchronized to a database.

Although the operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, nor that all illustrated operations be performed, to achieve desirable results. For example, the steps of the above-described embodiments may be performed in a different order or in parallel, or one or more of the steps may be omitted.

The data flow in a data processing method (e.g., method 200 or a variation thereof) according to an embodiment of the present disclosure is described below in conjunction with fig. 3. For example, the system may include a database (which may be referred to as a raw database) 310, a distributed computing engine 320, and a graph database (alternatively referred to as a target database) 330. The distributed computing engine 320 reads the raw data from the raw database 310 (e.g., corresponding to step S201). For example, the distributed computing engine 320 may be a Spark computing engine, and the raw database 310 may be Hive or MySQL, or the like. Subsequently, the distributed computing engine 320 partitions 325 the data into a plurality of memory partitions 321 (e.g., corresponding to step S202), for example, using a hash algorithm or other data partitioning method. It is to be understood that although 5 memory partitions are shown here, this is merely an example and the disclosure is not limited thereto.

In each partition 321-325, the relevant graph connections (i.e., connections to the graph database) are obtained in the partition. This step may be accomplished, for example, by reading a configuration file of the target graph database, such as a configuration file already stored on the HDFS. Thereafter, writing from the memory partition to a corresponding storage area of the graph database is completed in each partition. For example, as shown in FIG. 3, data in

memory partitions

321 and 322 are written to a first storage area 331 in a graph database, and data in

memory partition

323 and 325 are written to a second storage area 322 in the graph database.

It will be appreciated that such a number of storage regions and storage region allocation is merely an example. For example, a graph database may have more storage area. For example, data in

memory partitions

321, 323 may be written to a first storage area in a graph database, data in

memory partitions

322, 324 may be written to a second storage area in a graph database, and data in memory partition 325 may be written to a third storage area (not shown) in a graph database, and so on. Alternatively, for example, data in one memory partition may be written into a memory partition of a different graph database, and the present disclosure does not limit this. In addition, as has been described above, in the case of employing HBase as the map database, the partitioning in the form of the storage area 332 to the

storage areas

3321 and 3322 may be performed by the map database itself.

A data processing method 400 according to another embodiment of the present disclosure is described below in conjunction with fig. 4. The data processing method 400 may be viewed as generally consisting of three phases, namely partitioning of the original data, establishing the same graph connections in each partition, and performing the associated new data operations in each partition, and is described in more detail below.

Referring to FIG. 4, at step S401, raw data to be imported into a graph database is loaded in a distributed computing engine. As described above, the distributed computing engine may be Spark. The raw data may be stored in Hive. Hive may be located in HDFS (distributed file system) and the target graph database HBase may also be in HDFS. In other words, Hive and HBase may be located within the same cluster, e.g., consisting of multiple machines.

For example, where the distributed computing engine is Spark, based on determining that the input source is some node entity or relationship table in Hive, Spark reads the data (e.g., flushed data) in Hive and saves it to its own memory. The originally stored data to be imported into the graph database is structured data (e.g., stored in the form of a table). Data in the data warehouse is stored in a layered mode, for example, data of an ODS layer (original data) is cleaned, and a detailed layer is formed. In some cases, the detailed data layers in the data warehouse that are selected and cleaned to generate the knowledge-graph may be referred to as layers.

At step S402, data (e.g., raw data) is partitioned such that a plurality of data is distributed into a plurality of partitions of a memory of a distributed compute engine. For example, the memory may be partitioned by using a hash algorithm of Spark. As already described above, the number of partitions may be set by a user or the system may be set based on certain conditions (e.g., cluster computing power, etc.). For example, setting repartitionNum 5 means dividing data into 5 memory partitions. The data in each partition is distributed to a corresponding executor for execution.

At step S403, a connection to the graph database, e.g. to HBase, is established for each partition. "connect" corresponds to a key to open a graph database and may take the form of, for example, a java object. Establishing a connection may also sometimes be referred to as opening a connection or initializing a connection instance, etc., and the disclosure is not limited thereto. Each partition establishes a different connection instance, but may point to the same graph database (e.g., HBase). HBase is particularly advantageous because it supports highly concurrent distributed operations. Of course, other databases are also suitable.

Establishing a connection includes obtaining input parameters set by a user, for example, including database identification (e.g., the type of the database is HBase or other database) and database configuration (e.g., configuration of IP address, port, initialization partition, etc. of the target database), etc. Next, a target graph database (e.g., HBase) connection instance is established (e.g., initialized). Each partition establishes a separate instance of the graph database connection, and therefore, in the next step, different partitions can simultaneously or concurrently call the graph database for storage, and the storage results of different partitions can be distributed in different storage locations of the graph database.

At step S404, a connection is established to an index database (e.g., ES) for each partition. The connection establishment procedure to the index database is similar to the connection establishment procedure to the graph database and will not be described in detail here. ES databases support highly concurrent distributed operations and are therefore particularly advantageous in distributed computing. Of course, other databases are also suitable.

At step S405, a plurality of data in a plurality of partitions is written into respective storage areas of a map database according to the connection of each of the plurality of partitions to the map database. After the instance link for the graph data is successfully acquired in each partition (for example, after the HBase instance is initialized), the graph database is operated (for example, a server calling the janusgraph) through the instance link of the current graph library to call the new operation of the HBase (specifically, the new data operation of the janusgraph). This data conversion process of structured data to unstructured data is performed by the HBase, and in particular, the graph processing operator (e.g., gremlin operator) in the HBase. Different APIs of HBase may implement add, delete, check, change, etc. operations to the graph, and at this point, the add operation may be invoked to implement the write.

The stored data can be identified by a data ID. For example, the data ID may be a globally unique ID that each piece of data was previously stored in Hive, and the ID of the data in HBase remains unchanged, and when it is queried when needed, the query may be performed by using the ID as a primary index. In some embodiments, the data ID may be a partition primary key in a distributed graph database.

At step S406, a plurality of data in the plurality of partitions is written into respective storage areas of the index database according to the connection of each of the plurality of partitions to the index database. In each partition, the index-related field is written to the index database ES. What needs to be stored in the ES may be a secondary index, e.g., some fuzzy query fields such as person name, company name, etc.

After all partitions run, the distributed computing engine Spark may be shut down and the task ends. At this point, the data has been written to the graph database, and the corresponding index has also been written to the index database, and can be read or queried at a later time.

It will be appreciated that, although the operations are depicted in the drawings in a particular order, this should not be construed as requiring that such operations be performed in the particular order shown or in an orderly fashion, nor that all illustrated operations be performed to achieve desirable results. For example, step S404 may be performed before step S403, or concurrently with step S403. For another example, step S406 may be performed before step S405, or concurrently with step S405. Also for example, steps S404 and S406 may even be omitted.

The data flow in a data processing method (e.g., method 400 or a variation thereof) according to an embodiment of the present disclosure is described below in conjunction with fig. 5. For example, the system may include a database (which may be referred to as a raw database) 510, a distributed computing engine 520, and a graph database (alternatively referred to as a target database) 530, and an index database 540. The raw database 510, the distributed computation engine 520, and the target database 530 may be similar to the raw database 310, the distributed computation engine 320, and the graph data 330 described in conjunction with fig. 3, and similar descriptions will be omitted. It is to be appreciated that while 4

memory partitions

521 and 524 are shown here, this is merely an example and the disclosure is not so limited. Taking partition 521 as an example, in addition to establishing a connection to the graph database 530 in a partition, a connection to the index database 540 may be established. It will be readily appreciated that the two orders are not limited and may occur in parallel. It will be appreciated that each partition (including possibly a greater number of partitions not shown in the figure) may establish both connections to the graph database and connections to the index database. Different partitions use different connection instances, but these connection instances may point to the same graph database or index database.

Table 1 gives the acceleration effect of a data processing method according to an embodiment of the present disclosure on the warehousing of a graph database. Wherein the acceleration ratio is the time ratio of the single data import and the distributed data import, and is an intuitive reference value for comparing effects.

TABLE 1

Case(s)	Data volume	Single machine data import (hour)	Distributed data import (hour)	Acceleration ratio
					1	100w	1.2	0.05	24
2	1000w	12	0.1	120
					3	1E	120	0.8	150
4	10E	1200	7.5	160

In each of the above cases 1 to 4, Spark is used as a distributed computing engine, the number of Spark memory partitions is set to 100, 50 actuators are allocated in the cluster, each actuator starts 2 CPU cores to perform import, and thus the concurrency amount and the number of partitions are both 100, that is, the number of actuators numExecutors is 50, the number of actuator cores is 2, and there is a concurrency amount of numExecutors is 100.

Before the

method

200 or 400 is started, the operating environment and input parameters of the method may be configured. Such configuration may be accomplished, for example, on the client side (e.g., client 110 described in connection with fig. 1). For example, configuration inputs may include:

data source parameters: such as Hive, address, port and corresponding data id, etc. For example, a people information form (for creating entity vertices "people") or the like. For example, where the original database is MySQL, the data source may include a MySQL address, username, password, and the like. Alternatively, where the original database is Hive, the data source may include a Hive address or the like.

Target storage parameters: the target storage location may include a base configuration of a graph database (e.g., HBase) address, an index database (e.g., ES) address, and a corresponding database port, among other things.

The operating parameters: the operating parameters may include, for example, the number of Spark partitions, the number of parallels (how many actuators are allocated and the number of actuator cores), and the like. For example, when a package corresponding to a method is run in a data factory (DataFactory), the running parameter may be referred to as DataFactory configuration.

These parameters may be entered using a file configuration manager set in any manner. The settings may also be made in other ways, for example in the form of an upload profile.

For example, the

method

200 or 400 may be performed by the server side after the above configuration is completed. Alternatively, the

method

200 or 400 may be performed by a server in conjunction with a client, or the above-described configuration may occur at any suitable time node after the

method

200 or 400 begins.

A method 600 of configuring and operating a data processing method according to an embodiment of the present disclosure is described below in conjunction with fig. 6. The method 600 may include a pre-configuration step prior to each run: the program files in the above steps may be packaged into a jar package, for example referred to as a distributed data import jar package, and information (name, port, address, etc.) of the gallery to be created is written into a configuration file, which is then uploaded onto the file system HDFS, and a task is configured on the task management platform, thereby starting execution of the task. The method 600 may be performed on the client side, such as on the client application 112 shown in FIG. 1. Alternatively, the method 600 may be performed in part on the client side and in part on the server side.

At S601, database configuration information is acquired. Specifically, a pre-generated configuration file may be generated by a user, automatically generated, or read, or configuration information stored in other forms, where the graph database base configuration that needs to be output is stored. For example, the configuration file may include the following information:

hostname stores the domain name address of the backend HBase;

storing a table name corresponding to a back-end HBase in a storage.

Search. hostname index stores the domain name address of the backend ES; and

index.search.index-name index stores the table name corresponding to the backend ES.

At S602, configuration information and a package are loaded. For example, configuration files and packages (e.g., generated jar packages) may be uploaded onto the HDFS.

At S603, execution parameters of the task are configured. For example, the execution of tasks may be configured in a runtime environment or task management platform. A task may be for performing the

method

200 or 400 described above or a variant thereof. For example, the operating environment may be a data factory. At this step, the operating parameters as described above may be set. The operational parameters may include parameters for the distributed data import data source (e.g., corresponding parameters for the Hive data source, spark. querysql). The operating parameters may include parameters required for the partitions in the distributed computing engine (e.g., execution memory size per executor, execution core, number of executing processes numExecutors, and by these parameters, parallelism parallel may be calculated).

At S604, the configured task is invoked to perform synchronization of data, including the steps of implementing the

method

200 or 400 described above, or variations thereof, and so forth.

The method disclosed by the invention can be suitable for knowledge graph scenes in various fields. For example, the financial domain account nodes and the fund relationships are illustrated as examples of a knowledge graph. The data volume of the account node and the data volume of the fund relationship account for the largest proportion of the data volume of the financial field, and therefore are one of the most important entities and relationships in the process of knowledge graph analysis of the financial field. Since the relationship of funds is dependent on the storage account, an import sequence of importing into the storage account and then importing into the relationship of funds may be used. It is to be understood that the present disclosure is not limited to such application scenarios.

Fig. 7 is a schematic block diagram illustrating a data processing apparatus 700 according to an example embodiment. The data processing apparatus 700 may include a data loading unit 701, a data partitioning unit 702, a connection establishing unit 703, and a data writing unit 704. The data loading unit 701 may be configured to load a plurality of data into the memory of the distributed computing engine. The data partitioning unit 702 may be configured to partition the plurality of data such that the plurality of data is distributed into a plurality of partitions of the memory of the distributed computing engine. The connection establishing unit 703 may be configured to establish a connection to the graph database for each partition. The data writing unit 704 may be configured to write a plurality of data in a plurality of partitions into respective storage areas of a graph database according to a connection of each of the plurality of partitions to the graph database.

It should be understood that the various modules of the apparatus 700 shown in fig. 7 may correspond to the various steps in the method 200 described with reference to fig. 2. Thus, the operations, features and advantages described above with respect to method 200 are equally applicable to apparatus 700 and the modules included therein. Certain operations, features and advantages may not be described in detail herein for the sake of brevity.

Although specific functionality is discussed above with reference to particular modules, it should be noted that the functionality of the various modules discussed herein may be divided into multiple modules and/or at least some of the functionality of multiple modules may be combined into a single module. Performing an action by a particular module discussed herein includes the particular module itself performing the action, or alternatively the particular module invoking or otherwise accessing another component or module that performs the action (or performs the action in conjunction with the particular module). Thus, a particular module that performs an action can include the particular module that performs the action itself and/or another module that the particular module invokes or otherwise accesses that performs the action. For example, different modules may be combined into a single module in some embodiments, or a single module may be split into different modules. As used herein, the phrase "entity a initiates action B" may refer to entity a issuing instructions to perform action B, but entity a itself does not necessarily perform that action B.

It should also be appreciated that various techniques may be described herein in the general context of software, hardware elements, or program modules. The various modules described above with respect to fig. 7 may be implemented in hardware or in hardware in combination with software and/or firmware. For example, the modules may be implemented as computer program code/instructions configured to be executed in one or more processors and stored in a computer-readable storage medium. Alternatively, the modules may be implemented as hardware logic/circuitry. For example, in some embodiments, one or more of the data loading unit 701, the data partitioning unit 702, the connection establishing unit 703, and the data writing unit 704 may be implemented together in a System on Chip (SoC). The SoC may include an integrated circuit chip (which includes one or more components of a Processor (e.g., a Central Processing Unit (CPU), microcontroller, microprocessor, Digital Signal Processor (DSP), etc.), memory, one or more communication interfaces, and/or other circuitry), and may optionally execute received program code and/or include embedded firmware to perform functions.

According to an aspect of the disclosure, a computing device is provided that includes a memory, a processor, and a computer program stored on the memory. The processor is configured to execute the computer program to implement the steps of any of the method embodiments described above.

According to an aspect of the present disclosure, a non-transitory computer-readable storage medium is provided, having stored thereon a computer program which, when executed by a processor, implements the steps of any of the method embodiments described above.

According to an aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps of any of the method embodiments described above.

Illustrative examples of such computer devices, non-transitory computer-readable storage media, and computer program products are described below in connection with FIG. 8.

Fig. 8 illustrates an example configuration of a computer device 800 that may be used to implement the methods described herein. For example, the server 120 and/or the client device 110 shown in fig. 1 may include an architecture similar to the computer device 800. The data processing device/apparatus described above may also be implemented in whole or at least in part by a computer device 800 or similar device or system.

Computer device 800 may be a variety of different types of devices, such as a server of a service provider, a device associated with a client (e.g., a client device), a system on a chip, and/or any other suitable computer device or computing system. Examples of computer device 800 include, but are not limited to: a desktop computer, a server computer, a notebook or netbook computer, a mobile device (e.g., a tablet, a cellular or other wireless telephone (e.g., a smartphone), a notepad computer, a mobile station), a wearable device (e.g., glasses, a watch), an entertainment device (e.g., an entertainment appliance, a set-top box communicatively coupled to a display device, a gaming console), a television or other display device, an automotive computer, and so forth. Thus, the computer device 800 may range from a full resource device with substantial memory and processor resources (e.g., personal computers, game consoles) to a low-resource device with limited memory and/or processing resources (e.g., traditional set-top boxes, hand-held game consoles).

The computer device 800 may include at least one processor 802, memory 804, communication interface(s) 806, display device 808, other input/output (I/O) devices 810, and one or more mass storage devices 812, which may be capable of communicating with each other, such as through a system bus 814 or other appropriate connection.

Processor 802 may be a single processing unit or multiple processing units, all of which may include single or multiple computing units or multiple cores. The processor 802 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitry, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor 802 may be configured to retrieve and execute computer-readable instructions stored in the memory 804, mass storage device 812, or other computer-readable medium, such as program code for an operating system 816, program code for an application 818, program code for other programs 820, and so forth.

Memory 804 and mass storage device 812 are examples of computer-readable storage media for storing instructions that are executed by processor 802 to implement the various functions described above. By way of example, the memory 804 may generally include both volatile and non-volatile memory (e.g., RAM, ROM, etc.). In addition, mass storage device 812 may generally include a hard disk drive, solid state drive, removable media, including external and removable drives, memory cards, flash memory, floppy disks, optical disks (e.g., CD, DVD), storage arrays, network attached storage, storage area networks, and the like. Memory 804 and mass storage device 812 may both be referred to herein collectively as memory or computer-readable storage media, and may be non-transitory media capable of storing computer-readable, processor-executable program instructions as computer program code that may be executed by processor 802 as a particular machine configured to implement the operations and functions described in the examples herein.

A number of program modules may be stored on the mass storage device 812. These programs include an operating system 816, one or more application programs 818, other programs 820, and program data 822, and may be loaded into memory 804 for execution. Examples of such applications or program modules may include, for instance, computer program logic (e.g., computer program code or instructions) for implementing the following components/functions: client application 112, method 200, method 400, and/or method 600 (including any suitable steps of

methods

200, 400, and 600), and/or further embodiments described herein.

Although illustrated in fig. 8 as being stored in memory 804 of computer device 800,

modules

816, 818, 820, and 822, or portions thereof, may be implemented using any form of computer-readable media that is accessible by computer device 800. As used herein, "computer-readable media" includes at least two types of computer-readable media, namely computer storage media and communication media.

Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information for access by a computer device.

In contrast, communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism. Computer storage media, as defined herein, does not include communication media.

The computer device 800 may also include one or more communication interfaces 806 for exchanging data with other devices, such as over a network, direct connection, and so forth, as previously discussed. Such communication interfaces may be one or more of the following: any type of network interface (e.g., a Network Interface Card (NIC)), wired or wireless (such as IEEE 802.11 Wireless LAN (WLAN)) wireless interface, worldwide interoperability for microwave Access (Wi-MAX) interface, Ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, Bluetooth^TMAn interface, a Near Field Communication (NFC) interface, etc. The communication interface 806 may facilitate communication within a variety of networks and protocol types, including wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the Internet, and so forth. The communication interface 806 may also provide for communication with external storage devices (not shown), such as in storage arrays, network attached storage, storage area networks, and the like.

In some examples, a display device 808, such as a monitor, may be included for displaying information and images to a user. Other I/O devices 810 may be devices that receive various inputs from a user and provide various outputs to the user, and may include touch input devices, gesture input devices, cameras, keyboards, remote controls, mice, printers, audio input/output devices, and so forth.

While the disclosure has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative and exemplary and not restrictive; the present disclosure is not limited to the disclosed embodiments. Variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed subject matter, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps than those listed and the words "a" or "an" do not exclude a plurality. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

1. A method of data processing, comprising:

loading a plurality of data into a memory of a distributed computing engine;

partitioning the plurality of data such that the plurality of data is distributed into a plurality of partitions of a memory of the distributed computing engine;

establishing a connection to a graph database for each partition; and

writing a plurality of data in the plurality of partitions into respective storage areas of the graph database according to a connection of each of the plurality of partitions to the graph database.

2. The method of claim 1, wherein the writing of the plurality of data in the plurality of partitions to the respective storage areas of the graph database is a parallel writing.

3. The method of claim 1, wherein the plurality of data is structured stored data.

4. The method of claim 1, wherein partitioning the plurality of data comprises: hash each data of the plurality of data into a respective partition of a memory of the distributed computing engine based on an original storage address of each data.

5. The method of claim 4, wherein the plurality of data is loaded from a distributed storage database and the original storage address for each data comprises an original partition address for each data in the distributed storage database.

6. The method according to any of claims 1-5, wherein establishing a connection to a graph database for each partition comprises: establishing a connection instance to the graph database in each partition, wherein each connection instance is capable of invoking a compute engine of the graph database; and is

Wherein writing the plurality of data in the plurality of partitions into respective storage areas of the graph database comprises: invoking a graph data processing operator of the graph database through the connected instance of each partition to convert the format of the data in each partition to a graph data format, and writing the converted data into a corresponding storage area of the graph database.

7. The method according to any of claims 1-5, wherein the graph database is an HBase, and wherein writing the plurality of data in the plurality of partitions into respective storage areas of the graph database comprises:

the data identity of each data is used as a line key for the HBase so that the data is mapped into a corresponding storage area of the HBase.

8. The method according to claim 7, wherein the initial number of storage areas of the HBase is determined based on the number of service nodes of the HBase.

9. The method of claim 7, wherein the plurality of data is from Hive and the HBase are located in the same cluster.

10. The method of any of claims 1-5, wherein the number of the plurality of partitions of the memory of the distributed computing engine is determined according to a data volume of the plurality of data and a cluster computing capability of the distributed computing engine.

11. The method of claim 10, wherein the number of partitions is set equal to a parallelism of the distributed computing engine.

12. The method of any of claims 1-5, wherein the plurality of data is from a cleaned data layer of a data warehouse.

13. The method of any of claims 1-5, further comprising:

establishing a connection to an index database for each partition; and

writing an index of the plurality of data in the plurality of partitions into a corresponding storage area of the index database according to the connection of each of the plurality of partitions to the index database.

14. A data processing apparatus comprising:

a data loading unit configured to load a plurality of data into a memory of a distributed computing engine;

a data partitioning unit configured to partition the plurality of data such that the plurality of data is distributed into a plurality of partitions of a memory of the distributed computing engine;

a connection establishing unit configured to establish a connection to the graph database for each partition; and

a data writing unit configured to write a plurality of data in the plurality of partitions into respective storage areas of the graph database according to a connection of each of the plurality of partitions to the graph database.

15. A computing device, comprising:

a memory, a processor, and a computer program stored on the memory,

wherein the processor is configured to execute the computer program to implement the steps of the method of any one of claims 1-13.

16. A non-transitory computer readable storage medium having a computer program stored thereon, wherein the computer program when executed by a processor implements the steps of the method of any of claims 1-13.

17. A computer program product comprising a computer program, wherein the computer program realizes the steps of the method of any one of claims 1-13 when executed by a processor.