CN113868485A

CN113868485A - Data loading method and device

Info

Publication number: CN113868485A
Application number: CN202111126987.8A
Authority: CN
Inventors: 胡海波; 丛新法; 徐茂红; 毛聪; 王飞; 李团结; 晏学义; 张世波; 叶浩; 潘迎超
Original assignee: China United Network Communications Group Co Ltd
Current assignee: China United Network Communications Group Co Ltd
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2021-12-31

Abstract

The application provides a data loading method and a data loading device, wherein the method comprises the following steps: and acquiring an accurate query condition and a fuzzy query condition. And determining a first query code corresponding to the accurate query condition and a second query code corresponding to the fuzzy query condition, wherein the first query code and the second query code are query codes corresponding to a preset database. And performing query processing in a preset database according to the first query code, and performing query processing in the preset database according to the second query code to obtain target data. And loading the target data to a preset flow table. The data query is supported through the accurate query condition and the fuzzy query condition, and therefore the stream data query capability of the preset flow table in the preset database is improved.

Description

Data loading method and device

Technical Field

The present application relates to computer technologies, and in particular, to a data loading method and apparatus.

Background

As a distributed data flow computing system, the Flink flow table can execute any data program in a data parallel and pipeline mode. Loading data from the data storage system Hbase into the Flink flow table before data processing using the Flink flow table is one of important operations.

In the related art, the Flink flow table is usually queried in Hbase by an accurate query condition, and the queried data is loaded into the Flink flow table. The Hbase is only adapted to the precise query condition, that is, data corresponding to the precise query condition can only be queried from the Hbase according to the precise query condition.

However, in the process of adding stream data from Hbase to a Flink stream table in the prior art, since the stream processing system Flink can only query the stream data in Hbase in a single query mode of accurate query conditions, there is a problem that the query capability of the stream data is poor.

Disclosure of Invention

The application provides a data loading method and device, which are used for overcoming the problem that the data range loading is limited.

In a first aspect, the present application provides a data loading method, applied to a server, including:

acquiring an accurate query condition and a fuzzy query condition;

determining a first query code corresponding to the accurate query condition and a second query code corresponding to the fuzzy query condition, wherein the first query code and the second query code are query codes corresponding to a preset database;

performing query processing in the preset database according to the first query code, and performing query processing in the preset database according to the second query code to obtain target data;

and loading the target data to a preset flow table.

In one possible design, the determining the first query code corresponding to the precise query condition includes:

determining a computational expression corresponding to the accurate query condition, wherein the computational expression is used for determining the position of the data to be queried in the preset database;

and converting the calculation expression into a code in a preset format to obtain the first query code.

In one possible design, the determining the computational expression corresponding to the precise query condition includes:

obtaining a syntax tree corresponding to the accurate query condition, wherein the syntax tree comprises a plurality of function nodes, and the function nodes are elements in the accurate query condition;

acquiring the type of each function node in the syntax tree;

and determining the calculation expression in the plurality of function nodes according to the type of each function node in the syntax tree and the number of the function parameters corresponding to each function node.

In one possible design, the determining the computational expression in the plurality of function nodes according to the type of each function node in the syntax tree and the number of function parameters corresponding to each function node includes:

traversing each function node in the syntax tree, executing a stacking operation on the function node of the first type, and executing a popping operation on the function node of the second type until the traversal of each function node in the syntax tree is completed, and determining a sub-function node of a stack top element in the first stack as the computational expression;

wherein the push operation comprises: putting the function nodes of the first type into a first stack, and putting the number of function parameters corresponding to the function nodes of the first type into a second stack;

the pop operation comprises the following steps: and acquiring a second number N at the stack top of the second stack, and popping N function nodes in the first stack, wherein N is an integer greater than or equal to 1.

In one possible design, the converting the computational expression into a code in a preset format to obtain the first query code includes:

determining a conversion function of java codes;

and converting the calculation expression through the conversion function to obtain the first query code, wherein the first query code is java code.

In one possible design, the determining the second query code corresponding to the fuzzy query condition includes:

generating at least one column condition and query information corresponding to each column condition according to the fuzzy query condition, wherein the query information comprises column identifications in the preset database and column identifications in the preset flow table;

generating the second query code according to the at least one column condition.

In a possible design, the performing query processing in the preset database according to the first query code and performing query processing in the preset database according to the second query code to obtain target data includes:

determining the identification of the target data according to the first query code and the second query code;

and if the identification of the target data does not exist in the cache of the server, inquiring the target data in the preset database.

In one possible design, the performing query processing in the preset database according to the first query code includes:

determining identifiers of a plurality of data to be queried according to the first query code, and storing the identifiers of the data to be queried to a queue to be queried, wherein the queue to be queried is also used for storing identifiers of other data to be queried;

and performing query processing in the preset database according to the identifier of the data to be queried in the queue to be queried.

In a possible design, the performing query processing in the preset database according to the identifier of the data to be queried in the queue to be queried includes:

if the number of the identifiers of the data to be queried in the queue to be queried is greater than or equal to M, acquiring the identifiers of M data to be queried in the queue to be queried, and querying in the preset database according to the identifiers of the M data to be queried;

if the number of the identifiers of the data to be queried in the queue to be queried is less than M, and the time difference between the current time and the last query time of the preset database is greater than or equal to a first threshold, acquiring all the identifiers of the data to be queried in the queue to be queried, and querying the preset database according to all the identifiers of the data to be queried.

In a possible design, the performing query processing in the preset database according to the second query code includes:

determining the identifier of the second query code corresponding to the data to be queried in the preset database according to the second query code;

and performing query processing in the preset database according to the identifier of the data to be queried.

In one possible design, the preset flow table is a Flink flow table, and the Hbase database is an Hbase database.

In a second aspect, the present application provides a data loading apparatus, including:

the acquisition module is used for acquiring an accurate query condition and a fuzzy query condition;

the determining module is used for determining a first query code corresponding to the accurate query condition and a second query code corresponding to the fuzzy query condition, wherein the first query code and the second query code are query codes corresponding to a preset database;

the processing module is used for performing query processing in the preset database according to the first query code and performing query processing in the preset database according to the second query code to obtain target data;

and the storage module is used for loading the target data to a preset flow table.

In one possible design, the determining module is specifically configured to:

acquiring the type of each function node in the syntax tree;

In one possible design, the determining module is specifically configured to:

determining a conversion function of java codes;

In one possible design, the determining module is specifically configured to:

In one possible design, the processing module is specifically configured to:

In a third aspect, the present application provides a data loading device, including:

a memory for storing a program;

a processor for executing the program stored by the memory, the processor being adapted to perform the method as described above in the first aspect and any one of the various possible designs of the first aspect when the program is executed.

In a fourth aspect, the present application provides a computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method as described above in the first aspect and any one of the various possible designs of the first aspect.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

fig. 2 is a first flowchart of a data loading method according to an embodiment of the present application;

fig. 3 is a second flowchart of a data loading method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a data loading apparatus according to an embodiment of the present application;

fig. 5 is a schematic hardware structure diagram of a data loading device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In order to facilitate understanding of the technical solutions of the present application, first, related concepts related to the present application are introduced:

apache Flink (hereinafter referred to as "Flink") is a big data processing framework, and has the advantages of integration of stream data processing and batch data processing, multi-level Application Programming Interface (API), input and output consistency guarantee, flexible deployment, large-scale calculation, low delay and high throughput and the like. Where the stream data is a set of sequential, large, fast, continuous arriving data sequences. Batch data refers to a collection of data that contains a certain amount of data.

Apache Hbase (Hbase for short) is a nematic and telescopic distributed storage system, has the advantages of high reliability, high performance, low delay and the like, and can build a large-scale storage cluster on various servers by utilizing the HBase technology, store mass data and provide high-speed data query. In HBase, data is stored in a key-value format, and a line of data is uniquely determined by a key (key), which is generally called a line key (rowkey).

The open source parser (antler) is used for automatically generating a syntax tree according to input codes and displaying the syntax tree visually, and provides a processing framework for automatically constructing a custom Language through syntax description for languages including Java, C + +, C #.

Next, the background art, the prior art and the problems of the prior art to which the present application relates will be described:

in the data processing process of business application, the data volume is too large but the information density is too small. People tend to emphasize data processing on key business data, and in practice, the key business data are often collected through a message framework such as kafka and transmitted to a real-time stream processing system Flink for real-time data analysis. However, some of the attached attributes of these critical business data are stored as profile data in the distributed storage system Hbase. Therefore, loading stream data from the distributed storage system Hbase into the Flink stream table is one of the important operations before data analysis using the real-time stream processing system Flink.

Currently, in the prior art related to data loading, Flink generally performs a query in Hbase by using an accurate query condition, and loads a queried data stream into a Flink stream table. The Hbase is only adapted to the precise query condition, that is, only the data stream corresponding to the precise query condition can be queried from the Hbase according to the precise query condition.

Based on the existing problems, the application provides the following technical concepts: since the query condition includes an exact query condition and a fuzzy query condition. And respectively carrying out grammatical analysis on the accurate query condition and the fuzzy query condition, and generating java codes to obtain java codes corresponding to the accurate query condition and java codes corresponding to the fuzzy query condition. And performing query processing in the Hhase according to the java code corresponding to the accurate query condition and the java code corresponding to the fuzzy query condition, obtaining stream data after the query processing, and loading the stream data into the Flink stream table. The data query is supported through the accurate query condition and the fuzzy query condition, so that the query capability of the stream processing system Flink for querying stream data in Hbase is improved.

Next, an application scenario of the embodiment of the present application is described with reference to fig. 1.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application. Referring to fig. 1, the server 10 includes a preset flow table and a preset database.

The preset flow table is a distributed data flow processing framework or system. For example, Apache Flink (hereinafter referred to as "Flink") may be used. Flink executes arbitrary streaming data programs in a data parallel and pipelined manner.

The preset database is a distributed data storage system. For example, Apache Hbase (hereinafter, abbreviated as Hbase) can be used. Hbase is a highly reliable, high-performance, nematic, scalable, distributed storage system.

In the server, before data processing is performed on the preset flow table, the flow data is usually loaded from an external system. For example, the preset flow table loads the flow data from a preset database. The process of loading stream data from the preset database in the preset stream table comprises the following steps: the preset flow table queries data in the Hbase according to the query condition, and if the data can be queried in the Hbase, the data corresponding to the query condition is loaded into the Flink flow table.

The technical means shown in the present application will be described in detail below with reference to specific examples. It should be noted that the following embodiments may exist alone or in combination with each other, and description of the same or similar contents is not repeated in different embodiments.

Based on the technical concept described above, the data loading method provided by the present application is described in detail below with reference to fig. 2 and a specific embodiment, and fig. 2 is a first flowchart of the data loading method provided by the embodiment of the present application.

As shown in fig. 2, the method includes:

s201, obtaining an accurate query condition and a fuzzy query condition.

Before introducing the acquisition of the precise query condition and the fuzzy query condition, a configuration of a preset flow table for realizing connection with a preset database is explained.

In this embodiment, configuration information, such as query information, required in a process of connecting a preset flow table with a preset database is recorded in a database table manner, and a configuration page is designed for maintenance personnel at the same time. Therefore, on one hand, if a plurality of query messages need to be changed, the query messages can be updated in batch through database operation statements; on the other hand, detailed description information and the like are provided for maintenance personnel on the configuration page so as to guide the maintenance personnel to complete configuration and reduce the use difficulty of the preset flow table connected with the preset database. Meanwhile, the preset database connection password can be stored in the preset flow table in an encrypted manner, so that leakage of the plaintext password can be prevented.

In the following, taking the preset flow table as Flink and the preset database as Hbase as an example, the main configuration information for implementing the Flink connection Hbase is exemplarily described, for example, as shown in table 1:

TABLE 1

It should be noted that Zookeeper is a distributed, open source distributed application coordination service, and is an important component of Hbase. Zookeeper can provide software for Hbase with a consistency service, and the functions provided include: configuration maintenance, domain name service, distributed synchronization, group service, etc.

In addition, in this embodiment, in order to enable the Flink to connect with hbases of different versions, when the Flink connects with hbases of different versions, it is necessary to use an Hbase client Archive file (Java Archive, Jar) of a corresponding version and automatically adapt to an Application Programming Interface (API), which can avoid problems of API incompatibility, abnormal data analysis, and unstable connection. Specifically, when a Flink task is submitted, the Flink framework can automatically load Jar of the corresponding Hbase version by setting the running environment of the Flink.

The following describes the contents of the precise query condition and the fuzzy query condition.

In this embodiment, adding stream data from a preset database to a preset stream table is to be implemented. And after the connection between the preset flow table and the preset database is established, the preset flow table sends the query condition to the preset database. And performing data query in a preset database according to the query condition.

In the following, taking the predetermined database as Hbase as an example, first, the Hbase is introduced: one row of data is usually expressed by rowkey in Hbase, and the quality of the rowkey design directly determines the read-write performance of Hbase. The data in Hbase is globally ordered according to rowkey's ASCII dictionary order. When the rowkeys are sorted, whether the first bytes of the two rowkeys are the same or not is firstly compared; if the two first bytes are different, the two first bytes are sequenced according to the sequence of the ASCII dictionary, and then the rowkeys corresponding to the bytes sequenced at the front are arranged at the front; if the two bytes are the same, then the second byte is compared, and so on. For example, there are 5 rowkeys: "012", "0", "123", "234", "3", the result after sorting in an ASCII dictionary is: "0","012","123","234","3".

Generally, data is queried in Hbase according to query conditions. The query conditions for implementing the query in the Hbase include two types, namely an accurate query condition and a fuzzy query condition. Specifically, the syntax of the precise query condition is rowkey ═ rowkeyExpr ', and the syntax of the fuzzy query condition is rowkey like ' rowkeyExpr ' [ and other Conditions ]. Wherein rowkeyExpr is a rowkey computational expression, and supports common string operations such as string splicing, inversion and hashing. other Conditions refer to other query Conditions than rowkey (hereinafter referred to as non-rowkey query Conditions). The non-rowkey query condition may be, for example: and sex ═ man' and age ═ 20 and the like.

One possible implementation of obtaining the precise query and the fuzzy query is described below.

In one possible implementation, the query condition is obtained from the configuration information of the Flink-linked Hbase. And judging whether the query condition in the configuration information is the accurate query condition or the fuzzy query condition according to respective grammatical forms of the accurate query condition and the fuzzy query condition. If the query condition in the configuration information is the accurate query condition, obtaining the accurate query condition; and if the query condition in the configuration information is the fuzzy query condition, obtaining the fuzzy query condition.

S202, determining a first query code corresponding to the accurate query condition and a second query code corresponding to the fuzzy query condition, wherein the first query code and the second query code are query codes corresponding to a preset database.

And inquiring data in a preset database according to the inquiry code. The query code corresponding to the accurate query condition is a first query code, and the query code corresponding to the fuzzy query condition is a second query code.

It should be noted that data hot spots are often avoided in Hbase by using inversion, concatenation, hashing, and the like. Thus, rowkeyExpr in the query supports common string manipulation functions such as reverse, substring, concat, etc. The method comprises the following steps of converting a character string into a plurality of character strings, splicing the character strings, and connecting the character strings into a concat, wherein the converting is an operation function for realizing the reversal of the character string, the substring is an operation function for realizing the splicing of the character strings, and the concat is an operation function for realizing the connection of the character strings into one character string. To implement the above functions, a function parsing class is established for each function supported by rowkeyExpr. These parse classes all inherit a functional interface that includes two methods. One of which is used to set the parameters of the function and the other of which is used to generate the java code. For example, the two methods may be setChildren and codeGen, respectively, where setChildren is used to set function parameters, and codeGen is used to generate java code.

In this embodiment, a first query code corresponding to the precise query condition is determined according to the precise query condition, and a second query code corresponding to the fuzzy query condition is determined according to the fuzzy query condition. The first query code and the second query code are query codes corresponding to a preset database.

It should be noted that the precise query condition only includes the rowkey query condition, and the fuzzy query condition includes both the rowkey query condition and the non-rowkey query condition.

Therefore, in the process of obtaining the first query code according to the accurate query condition, only the rowkeyExpr corresponding to the rowkey needs to be analyzed, and the code obtained after the analysis is the first query code; in the process of obtaining the second query code according to the fuzzy query condition, the rowkey query condition and the non-rowkey query condition in the fuzzy query condition need to be analyzed respectively, and the query code obtained by analyzing the rowkey query condition and the query code obtained by analyzing the non-rowkey query condition form the second query code.

Next, an exemplary description is given of a possible implementation of determining a query code corresponding to rowkeyExpr from rowkeyExpr.

In one possible implementation, a standard Abstract Syntax Tree (AST) is generated from the antlr parsing Syntax definition file and rowkeyExpr. The antlr parsing grammar definition file describes the composition structure of each grammar component of the language. The generated abstract syntax tree comprises a plurality of functions, and the functions can comprise at least one sub-function or do not comprise the sub-function. Each sub-function comprises an entering function node and at least one leaving function node. Next, two empty stacks are newly created, for example, two stacks are: stack1 and stack 2. The stack1 is used for storing functions, and the stack2 is used for storing the number of sub-functions corresponding to the functions. After that, the AST tree is traversed sequentially. If the traversal enters a function node, the function is first stacked into stack1, and the number of sub-functions of the function is second stacked into stack 2. If the traversal leaves the function node, the first-out stack2 obtains the number n of sub-functions corresponding to the function. And then pop n functions from stack 1. And judging whether all function nodes of the AST are traversed or not, and if not, continuing to traverse the AST tree. If the traversal is completed, the top of stack function of stack1 is obtained, which is the expression function of rowkeyExpr. Next, a conversion method of the stack top function is called to obtain a query code corresponding to rowkeyExpr. The conversion method is used for converting the stack top function into the corresponding query code. For example, the conversion method may be a codeGen method and the type of query code may be Java code.

Next, an exemplary description is given of a possible implementation manner of determining the query code corresponding to the non-rowkey query condition according to the non-rowkey query condition.

In one possible implementation, a standard Abstract Syntax Tree (AST) is generated according to the antlr parsing Syntax definition file and the non-rowkey query condition. A List is created, followed by traversal of the abstract syntax tree nodes. And if the non-rowkey query condition node is traversed, constructing an analysis object corresponding to the non-rowkey query condition node, wherein the analysis object comprises a preset database column name, a comparison condition and a Flink column name, and adding the analysis object into the List List. The format of the analysis object corresponding to the non-rowkey query condition may be, for example: the preset flow list column name comparison condition preset database column names. For example, a non-rowkey query condition may be stream.name1 ═ name, where name1 is a preset stream list column name, name is a preset database column name, and "═ is a comparison condition. And judging whether all non-rowkey query condition nodes of the AST are traversed or not, and if not, continuing to traverse the AST tree. And if the traversal is finished, generating a query code response corresponding to the non-rowkey query condition according to the List. Specifically, traversing each analysis object in the List in sequence, calling a conversion method corresponding to each analysis object, and obtaining query codes corresponding to each analysis object according to the conversion method, that is, obtaining query codes corresponding to each non-Rowkey query condition. For example, the conversion method is codeGen, which is used to generate java code.

S203, performing query processing in a preset database according to the first query code, and performing query processing in the preset database according to the second query code to obtain target data.

Next, a possible implementation of the query processing in the preset database according to the first query code will be exemplarily described.

In a possible implementation manner, a query identifier corresponding to a first query code in a preset database is calculated according to the first query code, and each query identifier obtained by querying is added into a queue to be queried. And then, performing data query processing on a preset database according to the query identifier in the queue to be queried, wherein the data obtained after the data query processing is the target data. For example, according to the first query code, a rowkey corresponding to the first query code in the Hbase is calculated and each rowkey is stored in the queue to be queried. And then, performing data query processing in Hbase according to rowkey in the queue to be queried, wherein the data obtained after the data query processing is the target data.

Next, a possible implementation of the query processing in the preset database according to the second query code will be exemplarily described.

In a possible implementation manner, the query information corresponding to the second query code in the preset database is calculated according to the second query code. And then, acquiring data corresponding to the query information, namely target data, from a preset database according to the query information. For example, according to the second query code, the rowkey prefix and the non-rowkey query condition corresponding to the second query code are queried in the Hbase. Setting a query return column according to an Hbase dimension list in a configuration table; and querying data obtained by Hbase in a Scan mode according to the rowkey prefix, the non-rowkey query condition and the Hbase dimension list, wherein the data is the target data.

And S204, loading the target data to a preset flow table.

After the target data is obtained based on the step S203, the target data is loaded into the preset flow table.

The data loading method provided by the embodiment of the application comprises the following steps: and acquiring an accurate query condition and a fuzzy query condition. And determining a first query code corresponding to the accurate query condition and a second query code corresponding to the fuzzy query condition, wherein the first query code and the second query code are query codes corresponding to a preset database. And performing query processing in a preset database according to the first query code, and performing query processing in the preset database according to the second query code to obtain target data. And loading the target data to a preset flow table. The query can be carried out in two modes, namely an accurate query mode and a fuzzy query mode, so that the data loading mode is not limited by the query mode, and the data query capability is improved.

Based on the foregoing embodiment, the data loading method provided by the present application is further described below with reference to a specific embodiment, and is described with reference to fig. 3, where fig. 3 is a second flowchart of the data loading method provided by the embodiment of the present application.

As shown in fig. 3, the method includes:

s301, obtaining an accurate query condition and a fuzzy query condition.

The step S301 is similar to the step S201, and is not described herein again.

S302, a grammar tree corresponding to the accurate query condition is obtained, wherein the grammar tree comprises a plurality of function nodes, and the function nodes are elements in the accurate query condition.

The expression rowkey in the precision query condition is rowkeyExpr.

In this embodiment, a standard abstract syntax tree is generated according to the antlr parsing syntax definition file and rowkeyExpr. The antlr parsing grammar definition file describes the composition structure of each grammar component of the language.

And a plurality of function nodes are included in the grammar tree corresponding to the accurate query condition. Wherein each function node is a key element in the precision query condition.

S303, obtaining the type of each function node in the syntax tree.

S304, traversing each function node in the syntax tree, executing a stacking operation on the function node of the first type, and executing a popping operation on the function node of the second type until the traversal of each function node in the syntax tree is completed, and determining the sub-function node of the stack top element in the first stack as a computational expression.

Next, step S303 and step S304 will be explained together.

In this embodiment, the node types of the function nodes in the syntax tree are obtained, where the node types of the function nodes include a first type and a second type. And when the type of the function node is the first type, performing a stack pushing operation on the function node, and correspondingly performing a stack pulling operation on the function node of the second type.

In this embodiment, the function nodes in the syntax tree are traversed sequentially. When a function node of a first type is traversed, stacking the function node; and when the function node of the second type is traversed, the function node is popped.

And after traversing all function nodes of the syntax tree, determining sub-function nodes of the stack top element in the first stack as a computational expression.

Wherein the push operation comprises: and putting the function nodes of the first type into a first stack, and putting the number of the function parameters corresponding to the function nodes of the first type into a second stack.

The pop operation comprises the following steps: and acquiring a second number N at the stack top of the second stack, and popping the N function nodes in the first stack, wherein N is an integer greater than or equal to 1.

S305, determining a conversion function of the java code.

S306, converting the calculation expression through the conversion function to obtain a first query code, wherein the first query code is java code.

Next, step S305 and step S306 will be collectively explained.

In the present embodiment, the conversion function of java code is first determined. Wherein the conversion function is used to convert the computational expression into query code. For example, the transfer function may be codeGen. After the conversion function is determined, conversion processing is carried out on the calculation expression through the conversion function, and a first query code after conversion processing is obtained, wherein the first query code is java code corresponding to the calculation expression.

S307, generating at least one column condition and query information corresponding to each column condition according to the fuzzy query condition, wherein the query information comprises column identifications in a preset database and column identifications in a preset flow table.

In this embodiment, when the query condition is an ambiguous query condition, a possible implementation of generating query information according to the ambiguous query condition will be described below. The query information includes column identifiers in a preset database and column identifiers in a preset flow table.

It should be noted that the fuzzy query conditions include rowkeyExpr and non-rowkey query conditions.

The specific implementation method for generating the java code corresponding to rowkeyExpr according to the conversion function and rowkeyExpr is similar to the specific implementation method from step S302 to step S306, and is not described herein again.

In this embodiment, a standard abstract syntax tree is generated according to the antlr parsing syntax definition file and the non-rowkey query condition. The antlr parsing grammar definition file describes the composition structure of each grammar component of the language.

Next, a possible implementation method of generating query information based on a syntax tree corresponding to a non-rowkey query condition will be described.

In one possible implementation, a List1 is created. And then, sequentially traversing the syntax trees corresponding to the non-rowkey query conditions, and generating at least one column condition and corresponding query information of each column condition when the traversed nodes are the nodes of the non-rowkey query conditions. For example, sequentially traversing syntax trees corresponding to the non-rowkey query conditions, and when the traversed nodes are non-rowkey query condition nodes, constructing parsing objects corresponding to the non-rowkey query condition nodes, where the parsing objects include preset database column names, comparison conditions, and Flink column names, and adding the parsing objects into the List. The format of the analysis object corresponding to the non-rowkey query condition may be, for example: the preset flow list column name comparison condition preset database column names. For example, a non-rowkey query condition may be stream.name1 ═ name, where name1 is a preset stream list column name, name is a preset database column name, and "═ is a comparison condition.

S308, generating a second inquiry code according to at least one column condition.

The specific implementation manner of step S308 is similar to that of step S307, and is not described here again.

S309, determining the identification of the target data according to the first query code and the second query code.

In this embodiment, a possible implementation manner of determining the identifier of the target data according to the first query code and the second query code is first described.

In a possible implementation manner, the identifier corresponding to the target data to be queried is calculated according to the first query code and the second query code. Wherein, the label can be rowkey of Hbase. For example, if the query condition is rowkey like stream.name and stream.age, the second query code is generated according to the query condition, and the corresponding stream table data is determined according to the generated second query code, for example, if the corresponding stream table data name is zhangsan and age is 10 according to the second query code, the rowkey corresponding to the query condition is [ zhangsan,10 ].

And S310, if the identification of the target data does not exist in the cache of the server, inquiring the target data in a preset database.

In this embodiment, whether the identifier of the target data exists in the cache of the server is determined according to the identifier of the target data determined by the first query code and the second query code.

And if the identification of the target data exists in the cache of the server, inquiring the target data in the cache of the server according to the identification of the target data.

And if the identification of the target data does not exist in the cache of the server, inquiring the target data in a preset database according to the identification of the target data.

Next, when the identifier of the target data does not exist in the cache of the server, the specific method for querying the target data in the preset database according to the identifier of the target data may refer to steps S311 to S315.

Steps S311 to S313 are specific implementation methods for querying the target data according to the first query code, and steps S314 to S315 are specific implementation methods for querying the target data according to the second query code.

311. And determining the identifiers of a plurality of data to be queried according to the first query code, and storing the identifiers of the data to be queried to a queue to be queried, wherein the queue to be queried is also used for storing the identifiers of other data to be queried.

Based on the above steps S302 to S306, a first query code corresponding to the accurate query condition is obtained, and then, in this embodiment, according to the first query code, identifiers of a plurality of data to be queried are determined, and the identifiers of the data to be queried are stored in a queue to be queried, where the queue to be queried is further used for storing identifiers of other data to be queried.

For example, Hbase rowkey needing to be queried is calculated according to the first query code and stored in a queue to be queried, and the system independently maintains data of a daemon thread and a consumption queue to schedule Hbase query tasks.

S312, if the number of the identifiers of the data to be queried in the queue to be queried is larger than or equal to M, acquiring the identifiers of M data to be queried in the queue to be queried, and querying in a preset database according to the identifiers of the M data to be queried.

Wherein M is a positive integer greater than or equal to 1.

In a possible implementation manner, the identifier of the data to be queried is queried according to the number of the identifiers of the data to be queried in the queue to be queried. Specifically, when the number of identifiers of the data to be queried in the queue to be queried is smaller than M, the identifiers of the M data to be queried are temporarily not queried; when the number of the identifiers of the data to be queried in the queue to be queried is greater than or equal to M, the identifiers of the M data to be queried are queried, which is beneficial to improving the query throughput and query efficiency of the system.

And carrying out query processing in a preset database according to the identifiers of the M data to be queried.

The value of M may be set according to a requirement, or the size of M may be determined according to the maximum asynchronous concurrency amount of the system, for example, refer to formula one.

Where parsint () is a rounding function, poolSize is the maximum asynchronous concurrency of the system, and β is a number representing the magnitude of the concurrency.

S313, if the number of the identifiers of the data to be queried in the queue to be queried is less than M and the time difference between the current time and the last query time of the preset database is greater than or equal to a first threshold, acquiring all the identifiers of the data to be queried in the queue to be queried, and querying in the preset database according to all the identifiers of the data to be queried.

The first threshold is a numerical value representing time.

In a possible implementation manner, the identifier of the data to be queried is queried according to the number of the identifiers of the data to be queried in the queue to be queried and a time difference between the number of the identifiers of the data to be queried and the last query time of the preset database. Specifically, when the number of the identifiers of the data to be queried in the queue to be queried is less than M and the time difference between the current time and the last query time of the preset database is greater than or equal to a first threshold, acquiring all the identifiers of the data to be queried in the queue to be queried, and performing query processing in the preset database according to all the identifiers of the data to be queried.

The size of the first threshold value can be set according to requirements.

And S314, determining the identifier of the second query code corresponding to the data to be queried in the preset database according to the second query code.

In this embodiment, the identifier in the preset database is calculated according to the second query code corresponding to the fuzzy query condition. For example, according to the second query condition, a rowkey prefix corresponding to the second query condition in Hbase is calculated, and according to the second query condition, a non-rowkey query condition corresponding to the second query condition in Hbase is calculated. In addition, according to a preset Hbase query return column configured in advance, queried data is screened.

And S315, performing query processing in a preset database according to the identifier of the data to be queried.

In one possible implementation, data is queried in a preset database in a scan mode according to the identifier of the data to be queried, and target data is queried in a preset data sink. The scan mode is a preset database query mode and is used for querying data in preset data.

And S316, loading the target data to a preset flow table.

In one possible implementation, the target data is parsed, for example, compressed, and the data in the preset data format is stored and loaded into a field of the preset flow table. For example, taking the predetermined database as Hbase for example, the data formats supported by Hbase generally include two major types, i.e., Hbase multi-column data format and Hbase single-column data format. The single-column data format comprises csv, avro, protobuf and the like. The format for storing the target data may be selected according to the requirement, and is not limited in this regard. Taking the example that the preset flow table is Flink and the stored data format is csv as an example, the target data of Hbase is compressed, the processed data is stored in the csv format, and the stored data is loaded into the Flink flow table field.

The data loading method provided by the embodiment of the application comprises the following steps: and acquiring an accurate query condition and a fuzzy query condition. And acquiring a syntax tree corresponding to the accurate query condition, wherein the syntax tree comprises a plurality of function nodes, and the function nodes are elements in the accurate query condition. And acquiring the type of each function node in the syntax tree. And traversing each function node in the syntax tree, executing a stacking operation on the function node of the first type, and executing a popping operation on the function node of the second type until the traversal of each function node in the syntax tree is completed, and determining the sub-function node of the stack top element in the first stack as a computational expression. A conversion function of the java code is determined. And converting the calculation expression through a conversion function to obtain a first query code, wherein the first query code is java code. And generating at least one column condition and query information corresponding to each column condition according to the fuzzy query condition, wherein the query information comprises column identifications in a preset database and column identifications in a preset flow table. A second query code is generated based on at least one column condition. And if the identification of the target data does not exist in the cache of the server, inquiring the target data in a preset database. And determining the identifiers of a plurality of data to be queried according to the first query code, and storing the identifiers of the data to be queried to a queue to be queried, wherein the queue to be queried is also used for storing the identifiers of other data to be queried. If the number of the identifiers of the data to be queried in the queue to be queried is larger than or equal to M, acquiring the identifiers of M data to be queried in the queue to be queried, and querying in a preset database according to the identifiers of the M data to be queried. If the number of the identifiers of the data to be queried in the queue to be queried is less than M, and the time difference between the current time and the last query time of the preset database is greater than or equal to a first threshold value, acquiring all the identifiers of the data to be queried in the queue to be queried, and querying in the preset database according to all the identifiers of the data to be queried. And determining the identifier of the second query code corresponding to the data to be queried in the preset database according to the second query code. And carrying out query processing in a preset database according to the identifier of the data to be queried. And loading the target data to a preset flow table.

Fig. 4 is a schematic structural diagram of a data loading apparatus according to an embodiment of the present application. As shown in fig. 4, the apparatus 400 includes: an acquisition module 401, a determination module 402, a processing module 403 and a storage module 404.

An obtaining module 401, configured to obtain an accurate query condition and a fuzzy query condition;

a determining module 402, configured to determine a first query code corresponding to the precise query condition and a second query code corresponding to the fuzzy query condition, where the first query code and the second query code are query codes corresponding to a preset database;

a processing module 403, configured to perform query processing in the preset database according to the first query code, and perform query processing in the preset database according to the second query code, so as to obtain target data;

a storage module 404, configured to load the target data into a preset flow table.

In one possible design, the determining module 402 is specifically configured to:

acquiring the type of each function node in the syntax tree;

determining a conversion function of java codes;

In one possible design, the processing module 403 is specifically configured to:

The apparatus provided in this embodiment may be used to implement the technical solutions of the above method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.

Fig. 5 is a schematic diagram of a hardware structure of a data loading device according to an embodiment of the present application, and as shown in fig. 5, the data loading device 500 according to the embodiment includes: a processor 501 and a memory 502; wherein

A memory 502 for storing computer-executable instructions;

the processor 501 is configured to execute the computer-executable instructions stored in the memory to implement the steps performed by the data loading method in the foregoing embodiments. Reference may be made in particular to the description relating to the method embodiments described above.

Alternatively, the memory 502 may be separate or integrated with the processor 501.

When the memory 502 is provided separately, the data loading apparatus further includes a bus 503 for connecting the memory 502 and the processor 501.

An embodiment of the present application provides a computer-readable storage medium, where a computer executing instruction is stored in the computer-readable storage medium, and when a processor executes the computer executing instruction, the data loading method executed by the data loading apparatus is implemented.

An embodiment of the present application further provides a computer program product, where the program product includes: a computer program, stored in a readable storage medium, from which at least one processor of the electronic device can read the computer program, the at least one processor executing the computer program causing the electronic device to perform the solution provided by any of the embodiments described above.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application.

It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.

The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.

The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.

The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A data loading method is applied to a server, and the method comprises the following steps:

acquiring an accurate query condition and a fuzzy query condition;

and loading the target data to a preset flow table.

2. The method of claim 1, wherein determining the first query code corresponding to the precise query condition comprises:

3. The method of claim 2, wherein the determining the computational expression corresponding to the precise query condition comprises:

acquiring the type of each function node in the syntax tree;

4. The method of claim 3, wherein determining the computational expression among the plurality of function nodes according to the type of each function node in the syntax tree and the number of function parameters corresponding to each function node comprises:

5. The method according to any one of claims 2 to 4, wherein the converting the computational expression into a code in a preset format to obtain the first query code comprises:

determining a conversion function of java codes;

6. The method of claim 2, wherein the determining the second query code corresponding to the ambiguous query comprises:

7. The method according to any one of claims 1 to 4, wherein the performing query processing in the preset database according to the first query code and performing query processing in the preset database according to the second query code to obtain target data comprises:

8. The method according to any one of claims 1 to 4, wherein the performing query processing in the preset database according to the first query code comprises:

9. The method according to claim 8, wherein the performing query processing in the preset database according to the identifier of the data to be queried in the queue to be queried includes:

10. The method according to any one of claims 1 to 4, wherein the performing query processing in the preset database according to the second query code comprises:

11. The method according to any one of claims 1 to 4, wherein the preset flow table is a Flink flow table, and the Hbase database is included in the preset database.

12. A data loading device is applied to a server, and the device comprises:

13. A data loading apparatus, comprising:

a memory for storing a program;

a processor for executing the program stored by the memory, the processor being configured to perform the method of any of claims 1 to 11 when the program is executed.

14. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 11.