CN115917532A - DAG aware indexing for data stream execution engines - Google Patents

DAG aware indexing for data stream execution engines Download PDF

Info

Publication number
CN115917532A
CN115917532A CN202080101558.2A CN202080101558A CN115917532A CN 115917532 A CN115917532 A CN 115917532A CN 202080101558 A CN202080101558 A CN 202080101558A CN 115917532 A CN115917532 A CN 115917532A
Authority
CN
China
Prior art keywords
dag
data
plan
index
specified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202080101558.2A
Other languages
Chinese (zh)
Inventor
西奥多罗斯·格孔图瓦斯
吴宁
王勇
唐洪亮
汤志豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN115917532A publication Critical patent/CN115917532A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

By implementing Directed Acyclic Graph (DAG) aware indexing, an efficient architecture and method for data analysis is provided. In the DAG aware index, an index is maintained that corresponds to data for an execution pipeline of the compute DAG plan. The index is propagated through the DAG plan in the data stream. An index cache may be implemented to store the index for use with a plurality of queries associated with the DAG plan. Variations of DAG aware indexing may be implemented to determine whether to cache or delete an index, or to combine DAG aware indexing with DAG aware caching. Variations of DAG-aware indexing may include converting an original DAG plan to a DAG plan that reduces operations relative to an initial query that generated one or more indexes stored in an index cache. DAG-aware indexing may be implemented to avoid unnecessary overhead in computation, memory, and input or output (I/O).

Description

DAG aware indexing for data stream execution engines
Technical Field
The present invention relates to data analysis, and more particularly, to a method and apparatus for a data stream execution engine.
Background
Data analysis queries typically use large data sets with millions/billions of records. However, the results may only depend on a small portion of the original recording. For example, a data set including records acquired by a temperature sensor over a large range at a particular frequency can be very large. However, the results of the query may be affected by only a small portion of the records. A typical example of such a query is as follows:
context.Load("*.csv").Project(…).Filter(lambda x:x.temp≥100).Count(),
a comma separated value (csv) refers to a text file with separators that separates values using commas. The projections define a map of the type of data to be acquired. For example, a given projection may define fields in a data structure to be considered for a given query. Only records that include fields specified by the fields of a given projection operation are retained for further processing in the execution pipeline of the query. Filtering limits the data type to a defined value or characteristic of the data. The count is the operation of performing a query on the data set. For example, the count operation may return a numerical count of data set records deemed to satisfy the conditions established by the projection and filtering. In some cases, the count operation may return an identification of the record of the query or the records that satisfy the query, and a numerical count. The data flow execution engine may use some operations (such as, but not limited to, filtering operations) to prune the initial data provided or collected by the larger factor.
For example, the above query filters out all records with temperatures below 100 ° F. Only records in areas with warm climates will affect the final count. The filtering operation may eliminate most of the initial records. Ideally, we would like to load only relevant data, not the complete data set. The ideal solution is to fetch only the CSV files that affect the result from the storage layer. This will result in minimal storage input/output (I/O), network I/O between storage and data analysis/computation layers, and computation/memory overhead.
Many data stream execution engines (e.g. Spark) TM ) An immature approach is taken to handle workloads, such as the exemplary query above. The compute node takes all the records and then filters out most of them for each query. However, this strategy can result in unnecessarily heavy computation, network I/O overhead, and storage I/O overhead. Some data stream execution engines attempt to reduce unnecessary computation and increase utilization of network and storage resources through Directed Acyclic Graph (DAG) aware caching techniques. The DAG is a finite directed graph without directed loops. It consists of a finite number of vertices and edges, where each edge points from one vertex to another, so that there is no consistently pointing sequence of edges that starts at a given vertex and loops back to the given vertex. The DAG may be viewed as a directed graph having a topological order, with a sequence of vertices such that each edge points from an earlier vertex to a later vertex in the sequence.
In a DAG-aware caching approach, a user manually defines the operations whose data results are to be cached. These cached results may be used in another query with a common DAG plan, starting with the operation whose data results are cached, thus reducing some operations for another query. By caching the intermediate results, the data stream execution engine avoids not only the I/O overhead, but also computational overhead, in this example, the projection and filtering operations are not performed in additional time. While DAG-aware caching provides many advantages for tasks that run on top of the same data, there are related problems with this approach. Caching multiple intermediate results may be inefficient due to limited storage space. In typical DAG-aware caching, users are part of a process of manually annotating data to be cached, and while there are some methods that attempt to automate this process, they have some drawbacks. Multiple queries in the DAG-aware cache can work using the same common sub-path. For example, if the projections of two queries are slightly different, the DAG-aware caching approach will generally not work, although the two results are only affected by a given data block. As data in data analysis queries increases, the functionality of the data stream execution engine should be enhanced to allow data analysis techniques to advance.
Disclosure of Invention
It is an object of various embodiments to provide an efficient architecture and method for data analysis. DAG planning, implemented as a DAG-aware indexing process, provides an efficient architecture and method for data analysis. In DAG-aware indexing, the DAG-aware index maintains an index, i.e., a DAG plan, corresponding to data used to compute the execution pipeline, rather than caching the data itself. The index is propagated through the DAG plan in the data stream. An index cache is implemented to store the index for use with a plurality of queries associated with the DAG plan. Variations of DAG aware indexing may be implemented by driver modules that decide whether to cache or delete an index, or to combine DAG aware indexing with DAG aware caching. Variations of DAG-aware indexing may include converting an original DAG plan to a DAG plan that reduces operations relative to an initial query that generates one or more indexes stored in an index cache. DAG perceptual indexing can be implemented to avoid unnecessary overhead in computation, memory and I/O; no user is involved outside of initiating a data query; nor does it utilize large blocks of memory as in other data analysis processes. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
According to a first aspect of the invention, a computer-implemented method of executing a data stream is provided. The method comprises the following steps: searching an index cache for a specified Directed Acyclic Graph (DAG) plan; identifying an index stored in the index cache for the specified DAG plan; loading data of a partition, wherein the partition is identified by the index; performing the operations of the specified DAG plan on the load data.
In a first implementation form of the method according to the first aspect, the method comprises: and executing the query operation on the result.
In a second implementation form of the method according to the first aspect as such or any of the preceding implementation forms of the first aspect, the executing the query includes: performing aggregation on data in the result or performing counting on records in the result.
In a third implementation form of the method according to the first aspect as such or any of the preceding implementation forms of the first aspect, prior to searching the index cache, loading a plurality of data partitions according to another query associated with the specified DAG plan; identifying that the partition has data corresponding to the specified DAG plan; storing the index of the partition associated with the specified DAG plan in the index cache.
In a fourth implementation form of the method according to the first aspect as such or any of the preceding implementation forms of the first aspect, the identification of the partitions is performed according to performing an operation of eliminating records of the plurality of data partitions.
In a fifth implementation form of the method according to the first aspect as such or any of the preceding implementation forms of the first aspect, the performing the operation on the specified DAG plan on the load data comprises: restricting execution of operations of the specified DAG plan to those operations that affect an index space of the index cache in the index cache associated with the specified DAG plan.
In a sixth implementation form of the method according to the first aspect as such or any of the preceding implementation forms of the first aspect, the plurality of data partitions is loaded according to another query with an associated DAG plan. Operating on a plurality of data partitions according to the associated DAG plan. Determining that data obtained by operating the plurality of partitions meets the condition of storing the data in the index cache. And storing the data obtained by operating the plurality of partitions and the associated DAG plan in the index cache according to the condition.
In an eighth implementation form of the method according to the first aspect as such or any of the preceding implementation forms of the first aspect, the data resulting from the operations on the plurality of partitions and the associated DAG plan are stored in the index cache without storing an index corresponding to one or more of the plurality of data partitions.
According to a second aspect of the invention, there is provided a system for executing a data stream, the system comprising a memory storing instructions and one or more processors in communication with the memory. The one or more processors execute the instructions to: searching an index cache for a specified Directed Acyclic Graph (DAG) plan; identifying an index stored in the index cache for the specified DAG plan; loading data of a partition, wherein the partition is identified by the index; performing the operation of the specified DAG plan on the load data; providing a result of performing the operation.
In a first implementation of the system according to the second aspect, the one or more processors execute the instructions to perform operations of querying, including performing aggregation on data in the results or performing counting on records in the results.
In a second implementation of the system according to the second aspect as such or any of the preceding implementations of the second aspect, the one or more processors execute the instructions to, prior to searching the index cache: loading a plurality of data partitions in accordance with another query associated with the specified DAG plan; identifying that the partition has data corresponding to the specified DAG plan; storing the index of the partition associated with the specified DAG plan in the index cache.
In a third implementation of the system according to the second aspect as such or any of the preceding implementations of the second aspect, the one or more processors execute the instructions to: and according to the operation of eliminating the records of the plurality of data partitions, performing the identification of the partitions.
In a fourth implementation of the system according to the second aspect as such or any of the preceding implementations of the second aspect, the one or more processors execute the instructions to: restricting execution of operations of the specified DAG plan to those operations that affect an index space of the index cache in the index cache associated with the specified DAG plan.
In a fifth implementation form of the system according to the second aspect as such or any of the preceding implementation forms of the second aspect, the one or more processors execute the instructions to: loading a plurality of data partitions according to another query having an associated DAG plan; operating on a plurality of data partitions according to the associated DAG plan. Determining that data obtained by operating the plurality of partitions meets a condition for storing the data in the index cache. And storing data obtained by operating the plurality of partitions and the associated DAG plan in the index cache according to the condition being met.
In a sixth implementation of the system according to the second aspect as such or any of the preceding implementations of the second aspect, the one or more processors execute the instructions to: storing data resulting from operating on the plurality of partitions with the associated DAG plan in the index cache without storing an index corresponding to one or more of the plurality of data partitions.
According to a third aspect of the invention, there is provided a non-transitory computer-readable medium storing instructions for executing a data stream, which when executed by one or more processors, cause the one or more processors to perform operations. The operations include: searching an index cache for a specified Directed Acyclic Graph (DAG) plan; identifying an index stored in the index cache for the specified DAG plan; loading data of a partition, wherein the partition is identified by the index; performing the operations of the specified DAG plan on the load data; providing results of performing the operations of the specified DAG plan.
In a first implementation form of the non-transitory computer readable medium according to the third aspect, the instructions, when executed, cause the one or more processors to: performing aggregation on data in the result or performing counting on records in the result.
In a second implementation of the non-transitory computer readable medium according to the third aspect, the instructions, when executed, cause the one or more processors to: loading a plurality of data partitions in accordance with another query associated with the specified DAG plan prior to searching the index cache. The operations include: identifying that the partition has data corresponding to the specified DAG plan. The operations further include: storing the index of the partition associated with the specified DAG plan in the index cache.
In a third implementation of the non-transitory computer readable medium according to the third aspect, the instructions cause the one or more processors to: executing the specified DAG plan operations on the load data includes limiting execution of the specified DAG plan operations to those operations that affect an index space of the index cache associated with the specified DAG plan in the index cache.
In a fourth implementation of the non-transitory computer readable medium according to the third aspect, the instructions cause the one or more processors to: the multiple data partitions are loaded according to another query having an associated DAG plan. The operation comprises the following steps: operating on the plurality of data partitions according to the associated DAG plan. The operations further include: determining that data obtained by operating the plurality of partitions meets a condition for storing the data in the index cache; and storing the data obtained by operating the plurality of partitions and the associated DAG plan in the index cache according to the condition.
Any of the foregoing examples may be combined with any one or more of the other foregoing examples to produce new embodiments in accordance with the present invention.
Drawings
The drawings illustrate generally, by way of example, and not by way of limitation, various embodiments described herein.
FIG. 1 illustrates an example of an immature approach in which an acyclic graph is used to submit two different queries associated with various embodiments.
Fig. 2 illustrates an example of a directed acyclic graph aware caching process using a query example similar to fig. 1 provided by various embodiments.
FIG. 3 illustrates an exemplary directed acyclic graph-aware indexing process provided by an exemplary embodiment.
FIG. 4 illustrates an example of a directed acyclic graph-aware indexing process provided by an example embodiment, in which multiple queries may cache results using directed acyclic graph-aware without performing the same operation.
FIG. 5 is a flow diagram of features of an exemplary method for data flow of a data flow execution engine provided in an exemplary embodiment.
FIG. 6 is a flow diagram depicting features of an exemplary method for generating an index for use by a data stream execution engine, according to an exemplary embodiment.
FIG. 7 is a flowchart illustrating features of an exemplary method for a data stream execution engine to utilize index caching in accordance with an illustrative embodiment.
FIG. 8 is a block diagram of system components that implement an algorithm and perform a method for directed acyclic graph-aware caching, according to an example embodiment.
Detailed Description
In the following description, reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration specific embodiments which may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the embodiments, and it is to be understood that other embodiments may be utilized and that structural, logical, mechanical and electrical changes may be made. The following description of example embodiments is, therefore, not to be taken in a limiting sense.
The functions or algorithms described herein may be implemented in software in the embodiments. The software may include computer-executable instructions stored in a computer-readable medium or computer-readable storage device, such as one or more non-transitory memories or other types of hardware-based local or network storage devices. Further, the functions correspond to modules, which may be software, hardware, firmware, or any combination thereof. Various functions may be performed in one or more modules as desired, and the described embodiments are merely illustrative. The software may be executed on a digital signal processor, an application-specific integrated circuit (ASIC), a microprocessor, or other type of processor running on a computer system, such as a personal computer, server, or other computer system, to thereby turn such computer system into a specifically programmed machine.
Non-transitory computer readable media include all types of computer readable media, including magnetic storage media, optical storage media, and solid state storage media, particularly excluding signals. It should be understood that the software may be installed in and sold with a device that processes DAG-aware indexes for a data stream execution engine as taught herein. Alternatively, the software may be acquired and loaded into such devices, including acquiring the software through optical disk media or from any form of network or distribution system, including, for example, acquiring the software from a server owned by the software author or from a server not owned but used by the software author. For example, the software may be stored on a server for distribution over a network.
FIG. 1 illustrates an example of an immature method 100 in which a DAG is constructed for submitting two different queries: one with an operation of aggregation 102 and another with a count 104. The operations referenced in fig. 1 and other figures herein may be performed using electronic tools, such as, but not limited to, one or more processing devices and one or more storage devices storing associated instructions to perform the respective operations. The two queries of FIG. 1 have a common portion in the projection 105-2 and the filter 105-3 after loading 105-1 from sources such as filesets 101-1, 101-2, 101-3, and 101-4. Although four data sources are shown, the number of data sources may be more or less than four. Furthermore, the source may be from an electronic source distributed over a network or stored locally. The source may be a partition. Partitioning is the division of a logical database, or its constituent elements, into distinct, independent parts. A database is an organized collection of data that can be stored and accessed electronically from a computer system, which can be implemented as a distributed processing system. Each partition may be distributed across multiple nodes, where local transactions are performed on the partitions. One example of a partition is a file system. In the example of FIG. 1, a partition is four filesets 101-1, 101-2, 101-3, and 101-4 of one or more files, where each file may have multiple records.
The steps taken to accomplish these two queries in the immature method 100 are shown in FIG. 1, with the aggregation 102 of the first query beginning with data loading 105-1, projection 105-2 of the data, filtering 105-3 of the mapped data, and aggregation 102 on the filtered data 106 along the operational paths specified by steps (1), (2), (3), (4), (5), (6), and (7) for each of the filesets 101-1, 101-2, 101-3, and 101-4. The count 104 of the second query begins with a data load 105-1 of each of the file sets 101-1, 101-2, 101-3, and 101-4, another projection 105-2 of the data, another filtering 105-3 of the mapped data, and a count 104 of the re-filtered data 106 along the operation path specified by steps (8), (9), (10), (11), (12), (13), and (14). For each query, load 105-1 provides data sets 103-1, 103-2, 103-3, and 103-4 in filesets 101-1, 101-2, 101-3, and 101-4, respectively. Projection 105-2 provides data sets 107-1, 107-2, 107-3, and 107-4 mapped from data sets 103-1, 103-2, 103-3, and 103-4, respectively. The data sets 107-1, 107-2, 107-3, and 107-4 are provided to the filter 105-3, the filter 105-3 generating filter data 106, the filter data 106 being a first object operated on by the aggregation 102 and a second object operated on by the count 104. Since these two queries have a common portion that is duplicated in the computation, data flow, and data storage, it is clear that computation and use of network and storage resources is unnecessary.
Fig. 2 illustrates an example of a DAG aware caching method 200 using a query example similar to fig. 1. In the method of FIG. 2, a user defines two queries that run on top of the same dataset. As shown in FIG. 1, an exemplary final operation includes an aggregation of one query and a count of the second of the two queries. The user manually defines the operation results to be cached. In this example, the user selects the results of the filtering 205-3 to be cached in the data cache 210. The first time a task is performed, assume for this example that the query has an aggregation 202, the result data 206 of the filtering operation and the DAG plan are cached in a data cache 210. The steps taken to complete the query with aggregation 202 are represented in this figure, the first query with aggregation 202 is made from data load 205-1 along the operational path specified by steps (1), (2), (3), (4), (5), (6), and (7). The load 205-1 provides the data sets 203-1, 203-2, 203-3, and 203-4 in the filesets 201-1, 201-2, 201-3, and 201-4, respectively, that may have multiple records per fileset. Although four data source files are shown as the set of files, the number of data sources may be more or less than four. The source of the load 205-1 is not limited to a set of files and may be one or more partitions of data objects. Projection 205-2 provides data sets 207-1, 207-2, 207-3, and 207-4 mapped from data sets 203-1, 203-2, 203-3, and 203-4, respectively. The data sets 207-1, 207-2, 207-3, and 207-4 are provided to the filter 205-3, the filter 205-3 generates filter data 206, and the filter data 206 is the object operated on by the aggregation 202. When the filter result is computed for the first time in step 6, the data block 206 is stored in the data cache 210 along with the filter operation and its lineage (DAG plan) in step 7, and the aggregation 202 is performed in step 8. The next query finds the results of the filtering operation 205-3 from the data cache 210 (step 9). When executing a query with a count 204, the data execution engine loads the cache data of the common sub-path (the cache data generates a query with an aggregation 202) and performs only the non-common path, i.e., the count operation. By caching such intermediate results, the data stream execution engine avoids not only I/O overhead, but also computational overhead, in this example, projection and filtering operations are not performed at additional times. However, caching multiple intermediate results may be inefficient due to limited storage space. Furthermore, DAG-aware caching techniques are typically arranged with users, where multiple queries in the DAG-aware caching work using the same common subpaths as part of the process of manually annotating the data to be cached.
In various embodiments, DAG-aware indexing techniques of a data stream execution engine may be implemented for data analysis. Index caching is implemented in the data stream execution engine rather than caching the data itself. The DAG-aware index maintains an index corresponding to data used for the computation. Using DAG-aware indexes in data analysis may provide a technique to prune large amounts of data from queries, which may improve performance and cost of data analysis. DAG aware indexing allows avoiding unnecessary computation overhead, memory overhead, and I/O overhead. After providing the query, the user may engage in data analysis by using the DAG-aware indexed data stream execution engine. In addition, DAG-aware indexing may be implemented without utilizing large blocks of memory, as used in DAG-aware caching techniques. DAG aware indexing solutions to improve data analysis performance and cost may be applied to any data flow execution platform, such as but not limited to Dryad, tensorFlow TM 、Spark TM And the like.
Implementing a DAG-aware index may include many operations. The DAG-aware index may include an index to compute each record when executing a load command. For example, if a query is executed against one or more partitions, the data stream execution engine may maintain or generate an identification of each object in the partition. The or each partition may be distributed over a plurality of nodes, with local transactions being performed on the partition at each node. Alternatively, multiple nodes may be physically co-located. One example of a partition is a file system. In another example, if the query is executed over a file system, the execution engine may retain the filename of a file in the file system as an index to the file.
The DAG-aware indexing may include computing an initial index, i.e., a DAG plan, corresponding to each intermediate result that the pipeline creation is performed on. The index is propagated through the DAG plan in the form of index tags in the data stream. This calculation may be performed by performing an operation to dynamically track which initial files affect each intermediate result. If there is a significant reduction in the overall index space from input to output operation (e.g., filtering), e.g., more than a threshold percentage, the output index of this operation will be stored in the index cache. If another query performs an operation, the index cache contains the output index space for that operation corresponding to the DAG plan for the other query, then only the data in the source corresponding to that output index space is loaded, and all operations from the beginning simply redo that data. At the beginning of execution of another query, the index cache is first accessed to determine if the index cache contains an index of a partition element resulting from execution of the DAG plan until a final operation is performed to another query. If so, the operation of the data stream execution engine begins with the element of the partition identified by the index accessed in the index cache.
Fig. 3 illustrates an embodiment of an exemplary DAG aware indexing process 300. In this non-limiting example, two queries similar to the aggregation 302 and counting 304 of fig. 1 and 2 are considered, where the partition source is again treated as a collection of one or more files. When executing the aggregated query, immediately prior to the aggregation operation, one or more files are identified based on the results of the data stream execution engine operating on the DAG plan, with their filenames stored as an index in index cache 315 along with the filter operation and its lineage (DAG plan). Subsequent queries associated with the DAG plan may use the stored index.
The steps taken to complete the query with aggregation 302 are represented in FIG. 3, the first query with aggregation 302 is made from data load 305-11 along the operational path specified by steps (1), (2), (3), (4), (5), (6), (7), and (8). The load 305-11 provides the data sets 303-1, 303-2, 303-3, and 303-4 in the filesets 301-1, 301-2, 301-3, and 301-4, respectively, each of which may have multiple records. Although four data source files are shown as the set of files, the number of data sources may be more or less than four. The source of the load 305-11 is not limited to a set of files and may be one or more partitions of data objects. The projections 305-21 provide data sets 307-1, 307-2, 307-3, and 307-4 mapped from data sets 303-1, 303-2, 303-3, and 303-4, respectively. The data sets 307-1, 307-2, 307-3, and 307-4 are provided to the filters 305-31, the filters 305-31 generate filtered data 306, and the filtered data 306 is the object operated on by the aggregation 302. When the filtering result is calculated for the first time in step 6, the file name of the filtering result as the index of the determined file is stored in the index cache 315 together with the filtering operation and its lineage (DAG plan) in step 7, and the aggregation 302 is performed in step 8.
For queries received at the data stream execution engine from some entity originating from a user entity or the like, the data stream execution engine may execute stored instructions to access index cache 315. If the query is received after execution of the initial query having a DAG plan associated with the initial post-query, the access of the index cache 315 may retrieve the index of the elements of the one or more partitions from the operation of the initial query from memory in the index cache 315. For the example of FIG. 3, the aggregate query is an initial query and the count query is a subsequent query, where elements of the one or more partitions are files in a set of files whose filenames are the stored indices. After finding the stored index or indices, the data stream execution engine loads data only from the file or files, respectively, having the stored index or indices, in step 9. In the example of fig. 3, in steps 10 to 16, all subsequent operations relate only to data in one or more files, each having one or more indices stored.
The steps taken to complete the query with count 304 are represented in FIG. 3, with subsequent queries with count 304 proceeding from data loads 305-12 along the operational paths specified by steps (9), (10), (11), (12), (13), (14), (15), and (16). In the example of FIG. 3, initial execution of the operation of aggregating queries results in a single index identifying a single file. In step 9, shown for a count query, the index is found in index cache 315 and the single file is identified as file 311. The loads 305-12 provide a data set 313 in the file 311. The projections 305-22 provide a data set 317 mapped from a data set 313. The data set 317 is provided to the filters 305-32, and the filters 305-32 generate filter data 316, the filter data 316 being the object operated on by the count 304. The load 305-12, projection 305-22, filter 305-32 may be implemented as a load 305-11, projection 305-21, filter 305-31 operating on a reduced data set launched from an index stored in the index cache 415.
When comparing the DAG-aware cache of fig. 2 with the DAG-aware index of fig. 3, the DAG-aware index does not completely eliminate disk and network I/O overhead on hits. DAG-aware indexing also does not eliminate the computation of projection and filtering operations that are performed again in subsequent queries, but with significantly less work than DAG-aware caching. Index caching of the DAG sensing index does not store data; it only holds the index, i.e., effectively, the DAG-aware index may have a greater hit rate for caches of similar size as the DAG-aware cache. However, the DAG-aware index automatically determines the cached index and does not use user annotations or intervention.
Using DAG-aware indexing may reduce disk and network I/O overhead as well as computational and memory resources, providing lower cost compared to immature approaches. Using a DAG-aware index may reduce the execution time of queries involving previously computed operations, which may significantly reduce the index space associated with the DAG-aware index. Furthermore, the DAG-aware index does not involve more users than necessary, which occurs when a user initiates a query. In addition, DAG-aware indexes do not rely on a large amount of memory resources for caching, as opposed to DAG-aware caching.
Features that may be used to compare data analysis techniques may include network I/O, disk I/O, computing, memory, and user input, where the user input is considered when performing data analysis techniques, rather than requesting data. Immature methods consume relatively much in network I/O, disk I/O, and computing, but can be implemented with little or no memory usage or user input. The DAG-aware caching approach is relatively more computationally expensive when network I/O, disk I/O, and operations result in misses, but relatively less computationally expensive when network I/O, disk I/O, and operations result in hits. The implementation of the DAG sensing caching method enables the memory usage amount and the user input to be relatively large. The DAG-aware indexing approach is relatively more computationally expensive when network I/O, disk I/O, and operations result in misses, but relatively less computationally expensive when network I/O, disk I/O, and operations result in hits. The implementation of the DAG aware indexing method results in relatively less memory usage and no user input.
In various embodiments, DAG aware indexing may be combined with DAG aware caching. A driver module having storage instructions executable by a processor may be implemented that combines DAG-aware indexing with DAG-aware caching. This combination may provide better results in many data analysis instances. If the data size is relatively small, the data that performs some DAG planning may be cached directly, rather than caching the index. The benefit of this approach to this example is that disk I/O, network I/O, and computational overhead can be eliminated altogether or significantly. If at some point during execution of the query pipeline, i.e., DAG planning, the total index space is greater than a threshold percentage of the index's total data size in cache, the index may be discarded and the data may be cached directly. A driver module having stored instructions executable by a processor may be implemented to decide whether to cache or discard an index. The driver module may have instructions to analyze the data size involved and adjust the threshold percentage used. This approach combines DAG-aware indexing and DAG-aware caching optimization approaches.
Fig. 4 illustrates an embodiment of an example of a DAG aware indexing process 400 in which multiple queries may use DAG aware indexing results without performing the same operations. There are operations that alter the data that do not have any effect on the index maintained by the index cache, such as, but not limited to, projection and classification. Thus, the index cache may maintain a reduced lineage for operations. Thus, two seemingly different queries may benefit from the same cache index. In the example of FIG. 4, the projection operation is eliminated from counting queries, which are subsequent queries to the initial aggregated query. Elimination of the projection operation does not alter the result. After eliminating the projection operation, the count query does not benefit from the DAG-aware caching, but it can still benefit from the DAG-aware indexing.
The steps taken to complete the first query with the aggregation 402 are represented in FIG. 4, with the first query with the aggregation 402 proceeding from data loads 405-11 along the operational path specified by steps (1), (2), (3), (4), (5), (6), (7), and (8). The load 405-11 provides data sets 403-1, 403-2, 403-3, and 403-4 in filesets 401-1, 401-2, 401-3, and 401-4, respectively, each of which may have multiple records. Although four data source files are shown as the fileset, the number of data sources may be more or less than four. The source of the load 405-1 is not limited to a fileset and may be one or more partitions of data objects. The projection 405-21 provides data sets 407-1, 407-2, 407-3, and 407-4 mapped from data sets 403-1, 403-2, 403-3, and 403-4, respectively. Data sets 407-1, 407-2, 407-3, and 407-4 are provided to filters 405-31, filters 405-31 generate filter data 406, filter data 406 being the object of operations by aggregation 402. When the filtering result is calculated for the first time in step 6, the file name of the filtering result as the index of the determined file is stored in the index cache 415 together with the filtering operation and its lineage (DAG plan) in step 7, and the aggregation 402 is performed in step 8.
For the example of FIG. 4, the aggregate query is an initial query and the count query is a subsequent query. After finding the stored index or indices, the data stream execution engine loads data only from the file or files having the stored index or indices, respectively, in step 9. In the example of fig. 4, in steps 10 to 14, all subsequent operations relate only to data in one or more files, each having one or more indices stored.
The steps taken to complete the query with count 404 are represented in FIG. 4, with subsequent queries with count 404 proceeding from data loads 405-12 along the operational path specified by steps (9), (10), (11), (12), (13), and (14). In the example of FIG. 4, initial execution of the operation of aggregating queries results in a single index identifying a single file. In step 9 of counting queries, the index is accessed from index cache 415 and a single file is identified as file 411. The load 405-12 provides a data set 413 in a file 411. Since no projection operations are used in the count query, which is a subsequent query to the initial query provided, identified by file 411, the data set 413 is operated on by filters 405-32, and filters 405-32 generate filtered data 416, which filtered data 416 is the object of the operation by count 404. The load 405-12, filter 405-32 may be implemented as a load 405-11 and filter 405-31 operating on a reduced data set initiated from an index stored in the index cache 415. As described above, this approach can remove operations on data without affecting the result, making DAG-aware indexing suitable for more situations than DAG-aware caching. This approach provides for converting the original DAG plan to a simplified DAG plan for DAG aware indexing.
FIG. 5 is a flow diagram of features of an embodiment of an exemplary method 500 of data flow for a data flow execution engine. Method 500 may be implemented by one or more processors executing instructions stored in one or more memory storage devices. At 510, the specified DAG plan is searched from the index cache. At 520, the indexes stored in the index cache are identified for the specified DAG plan. At 530, data for a partition is loaded, wherein the partition is identified by an index. At 540, operations specifying the DAG plan are performed on the load data. At 550, the results of the execution operation are provided. Based on the provided results, the data stream execution engine may perform a query operation on the results. Performing a query operation may include performing aggregation on data in the results or performing counting on records in the results.
The method 500 or variations similar to the method 500 may include many different embodiments that may be combined depending on the application of the methods and/or the architecture of the system implementing the methods. These methods may include: loading a plurality of data partitions according to another query associated with the specified DAG plan prior to search index caching; identifying that the partition has data corresponding to the specified DAG plan; the index of the partition associated with the specified DAG plan is stored in an index cache. The identification of the partitions is performed in accordance with performing an operation to eliminate the records of the plurality of data partitions. The stored index may serve as a starting point for executing other requested data streams corresponding to the specified DAG plan. Differences between requests using the same storage index may be achieved by differences in the final operation that operates on the results provided by executing the specified DAG plan prior to the final operation of the request.
Variations of method 500 or methods similar to method 500 may include: the operations specifying the DAG plan are performed on the load data identified by the index, the operations limiting performance of the operations specifying the DAG plan to those operations affecting an index space in an index cache of an index cache associated with the specified DAG plan. This may reduce the number of operations to perform the data stream using the identified index.
Variations of method 500 or methods similar to method 500 may include: loading a plurality of data partitions in accordance with another query having an associated DAG plan; operating on the plurality of data partitions according to the associated DAG plan; determining that data obtained by operating a plurality of partitions meets the condition of storing the data in an index cache; and storing the data obtained by operating the plurality of partitions and the associated DAG plan in an index cache according to the satisfied condition. The condition may be a threshold amount of data size corresponding to operating on multiple partitions. The condition for storing the data may be that the data resulting from operating on the plurality of partitions is less than a threshold amount. The data satisfying the condition resulting from operating on the multiple partitions may be stored with the associated DAG plan in an index cache without storing an index corresponding to one or more of the multiple data partitions. Upon reaching the index cache of the associated DAG plan, subsequent execution of the request may return the stored data without using the index. Variations of the method 500 may include features of other methods and processes taught herein.
In various embodiments, a non-transitory machine-readable storage device, such as a computer-readable non-transitory medium, may include instructions stored thereon, which when executed by a machine, cause the machine to perform operations including one or more features similar or identical to features of the methods and techniques described with respect to method 500 and variations thereof, and/or features of other methods taught herein in connection with fig. 1-8, among others. The physical structure of these instructions may be operated on by one or more processors. Execution of these physical structures may cause a machine to perform operations. A non-transitory computer-readable medium storing computer instructions for executing a data stream, the instructions, when executed by one or more processors, cause the one or more processors to perform operations comprising: searching an index cache for a specified Directed Acyclic Graph (DAG) plan; identifying an index stored in an index cache for the specified DAG plan; loading data of a partition, wherein the partition is identified by the index; performing an operation specifying a DAG plan on the loaded data; the result of the execution operation is provided. The instructions may include a number of operations, such as performing the operations listed in the query after a DAG plan included in the query. The operations may include performing an aggregation of data in the results of the execution of the operations that specify the DAG plan, or performing a count of records in the results of the execution of the operations that specify the DAG plan. Other operations performed may include join operations or other operations that may be performed on a set of records.
Prior to searching the index cache, the instructions may cause the one or more processors to: loading a plurality of data partitions according to another query associated with the specified DAG plan; identifying that the partition has data corresponding to the specified DAG plan; the index of the partition associated with the specified DAG plan is stored in an index cache.
Variations of the instructions may include many different embodiments, which may be combined depending on the application of the instructions and/or the architecture of the system implementing the instructions. The instructions may direct the one or more processors to perform operations on the load data that specify DAG planning, including: the execution of operations that specify a DAG plan is limited to those operations that affect the index space of the index cache associated with the specified DAG plan in the index cache.
Variations of the instructions may include: loading a plurality of data partitions according to another query having an associated DAG plan; operating the plurality of data partitions according to the associated DAG plan; determining that data obtained by operating a plurality of partitions meets the condition of storing the data in an index cache; and storing the data obtained by operating the plurality of partitions and the associated DAG plan in an index cache according to the satisfied condition.
FIG. 6 is a flow diagram of features of an embodiment of an exemplary method 600 of generating an index for use by a data stream execution engine. Method 600 may be implemented by one or more processors executing instructions stored in one or more memory storage devices. At 610, a plurality of data partitions are loaded according to the query associated with the specified DAG plan. At 620, the identified partition has data corresponding to the specified DAG plan. At 630, the index of the partition associated with the specified DAG plan is stored in an index cache. The index associated with the specified DAG plan may be used for subsequent requests associated with the specified DAG plan, such as discussed with respect to method 500.
In various embodiments, a non-transitory machine-readable storage device, such as a computer-readable non-transitory medium, may include instructions stored thereon, which when executed by a machine, cause the machine to perform operations including one or more features similar or identical to features of the methods and techniques described with respect to method 600 and variations thereof, and/or other methods taught herein in connection with fig. 1-8. The physical structure of these instructions may be operated on by one or more processors. Execution of these physical structures may cause a machine to perform operations. A non-transitory computer-readable medium storing computer instructions for generating an index, the instructions, when executed by one or more processors, cause the one or more processors to perform operations comprising: loading a plurality of data partitions according to a query associated with a specified DAG plan; identifying that the partition has data corresponding to the specified DAG plan; the index of the partition associated with the specified DAG plan is stored in an index cache. The index associated with the specified DAG plan may be used for subsequent requests associated with the specified DAG plan, such as discussed with respect to method 500 or instructions for performing operations associated with method 500.
FIG. 7 is a flow diagram of features of an embodiment of an exemplary method 700 for a data stream execution engine to use index caching. Method 700 may be implemented by one or more processors executing instructions stored in one or more memory storage devices. At 710, a plurality of data partitions are loaded according to a query having an associated DAG plan. At 720, the plurality of data partitions are operated on according to the associated DAG plan. At 730, it is determined that data resulting from operating on the plurality of partitions satisfies a condition for storing the data in the index cache. At 740, the data resulting from operating on the plurality of partitions and the associated DAG plan are stored in an index cache according to a condition being satisfied. Upon reaching the index cache of the associated DAG plan, subsequent execution of the request may return the stored data without using the index.
In various embodiments, a non-transitory machine-readable storage device, such as a computer-readable non-transitory medium, may include instructions stored thereon that when executed by a machine, cause the machine to perform operations that include one or more features that are similar or identical to features of the methods and techniques described with respect to method 700 and variations thereof, and/or features of other methods as taught herein in connection with fig. 1-8, among others. The physical structure of these instructions may be operated on by one or more processors. Execution of these physical structures may cause a machine to perform operations. A non-transitory computer-readable medium storing computer instructions for using index caching in data stream execution, which when executed by one or more processors, cause the one or more processors to: loading a plurality of data partitions according to a query having an associated DAG plan; operating the plurality of data partitions according to the associated DAG plan; determining that data obtained by operating a plurality of partitions meets the condition of storing the data in an index cache; and storing the data obtained by operating the plurality of partitions and the associated DAG plan in an index cache according to the satisfied condition. Upon reaching the index cache of the associated DAG plan, subsequent execution of the request may return the stored data without using the index.
FIG. 8 is a block diagram of components of a system 800 that implement an algorithm and perform a method structured to perform data stream execution using DAG-aware indexing. System 800 may include one or more processors 870 that may execute the store instructions to perform DAG-aware indexing in data analysis from one or more partitions 885. The DAG aware indexing may be implemented by a data stream engine, which may be implemented as one or more modules using index cache 880. System 800 may perform DAG-aware indexing as taught herein.
The system 800 with one or more storage devices may operate as a standalone system or may be networked to other systems, for example. In a networked deployment, the system 800 may operate in the capacity of a server machine, a client machine, or both, in server-client network environments. In one example, the system 800 may act as a peer machine in a peer-to-peer (P2P) (or other distributed) network environment. System 800 may be a Personal Computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a mobile phone, a network appliance, an IoT device, an automobile system, or any system capable of executing instructions (sequential or otherwise) that specify actions to be taken by the system. Further, while only a single system is illustrated, the term "system" shall also be taken to include any collection of systems, such as cloud computing, software as a service (SaaS), other computer cluster configurations, that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein. Exemplary system 800 may be arranged to operate with one or more storage devices to perform the structure of a DAG aware index as taught herein.
Examples as described herein may include, or be operable by, logic, components, devices, packages, or mechanisms. A circuit is a collection (e.g., set) of circuits implemented in tangible entities that include hardware (e.g., simple circuits, gates, logic, etc.). The circuit members may be flexible over time and potential hardware variability. Circuits include members that, when operated, can perform particular tasks either individually or in combination. In one example, the hardware of the circuit may be designed unchanged to perform certain operations (e.g., hardwired). In one example, the hardware of the circuit may include variably connected physical components (e.g., execution units, transistors, simple circuits, etc.) including physically modified computer readable media (e.g., magnetic, electrical, movable placement of constant mass particles, etc.) to encode instructions for a particular operation. When physical components are connected, the underlying electrical properties of the hardware components are changed, for example, from an insulator to a conductor, and vice versa. The instructions enable participating hardware (e.g., execution units or loading mechanisms) to create members of a circuit in the hardware through variable connections to perform portions of specific tasks in operation. Thus, when the device is operational, the computer readable medium is communicatively coupled to other components of the circuit. In one example, any physical component may be used in multiple members of multiple circuits. For example, in operation, an execution unit may be used in a first circuit of a first circuit at one point in time and reused by a second circuit in the first circuit or by a third circuit in the second circuit at a different time.
System (e.g., computer system or distributed computing system) 800 may include one or more hardware processors 870 (e.g., CPUs, graphics Processing Units (GPUs), hardware processor cores, or any combination thereof), a main memory 873, and a static memory 875, some or all of which may communicate with each other via a communication link 879. Communication link (e.g., bus) 879 may be implemented as a bus, local link, network, other communication path, or combination thereof. The system 800 may also include a display device 881, an alphanumeric input device 882 (e.g., a keyboard), and a User Interface (UI) navigation device 883 (e.g., a mouse). In one example, the display device 881, the alphanumeric input device 882, and the UI navigation device 883 may be a touch screen display. The system 800 may include an output controller 884, such as a Serial (e.g., universal Serial Bus (USB), parallel or other wired or wireless (e.g., infrared (IR), near Field Communication (NFC), etc.) connection to communicate or control one or more peripheral devices (e.g., printer, card reader, etc.).
System 800 may include a machine-readable medium 877 on which is stored one or more sets of data structures or instructions 878 (e.g., software or data) embodying or used by system 800 to perform any one or more techniques or functions for which system 800 is designed, including DAG perceptual indexing. Instructions 878 or other data stored on machine-readable media 877 may be accessed by the main memory 873 for use by the one or more processors 870. The instructions 878 can also reside, completely or at least partially, within the instructions 874 of the main memory 873, within the instructions 876 of the static memory 875, or within the instructions 872 of the one or more hardware processors 870.
While the machine-readable medium 877 is shown to be a single medium, the term "machine-readable medium" can include one or more media (e.g., a centralized or distributed database, or associated caches and servers) that store the instructions 878 or data. The term "machine-readable medium" can include any medium that can store, encode, or carry instructions for execution by system 800 and that cause system 800 to perform any one or more of the techniques for which system 800 is designed, or that can store, encode, or carry data structures used by or associated with such instructions. Non-limiting examples of machine readable media may include solid state memory and optical and magnetic media. In one example, a high capacity machine readable medium includes a machine readable medium having a plurality of particles with an invariant (e.g., static) mass. Thus, a mass machine-readable medium is not a transitory propagating signal. Specific examples of the mass machine-readable medium may include: non-volatile memory, such as semiconductor memory devices (e.g., EPROM, EEPROM) and flash memory devices; magnetic disks, such as internal hard disks and removable disks; magneto-optical disks; compact disc-ROM (CD-ROM) and digital versatile disc-read only memory (DVD-ROM) disks.
Data from or stored in the machine-readable medium 877 or the main memory 873 may be transmitted or received over a communication network using a transmission medium using the network interface device 890, the network interface device 890 utilizing any number of transmission protocols (e.g., frame relay, internet, etc.)A protocol (IP), a Transmission Control Protocol (TCP), a User Datagram Protocol (UDP), a hypertext transfer protocol (HTTP), and the like. Exemplary communication networks can include Local Area Networks (LANs), wide Area Networks (WANs), packet data networks (e.g., the Internet), mobile Telephone networks (e.g., cellular networks), traditional Telephone (POTS) networks, and wireless data networks (e.g., institute of Electrical and Electronics Engineers (IEEE) 802.11 series of standards known as Wi-Fi (Wireless Fidelity) networks
Figure BDA0003970172030000131
The IEEE 802.16 series of standards is called
Figure BDA0003970172030000132
) Ieee802.15.4 series of standards, peer-to-peer (P2P) networks, etc. In one example, the network interface device 890 may include one or more physical jacks (e.g., ethernet, coaxial, or telephone jacks) or one or more antennas to connect to a communication network. In one example, the network interface device 890 may include multiple antennas to wirelessly communicate using at least one of single-input multiple-output (SIMO), multiple-input multiple-output (MIMO), or multiple-input single-output (MISO) technologies. The term "transmission medium" shall be taken to include any tangible medium that is capable of carrying instructions to and being executed by system 800, and includes means for propagating digital or analog communications signals to facilitate communication of such instructions, which may be implemented in software.
Index cache 880 provides a data store for DAG aware indexing according to various embodiments discussed herein. Index cache 880 may be located in the server's allocated memory. The contents of index cache 880 in the server may be accessed by a remote server, for example, using communication link 879 and network interface device 890. Index cache 880 may be distributed as a memory allocation in machine-readable medium 877, main memory 873, or other data store of system 800. The components of system 800 may be distributed in a similar manner.
The components of the illustrative apparatus, systems, and methods used in accordance with the described embodiments may be implemented at least partially in digital electronic circuitry, analog electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. For example, these components may be implemented as a computer program product (e.g., a computer program, program code, or computer instructions) tangibly embodied in a machine-readable storage device for execution by, or to control the operation of, data processing apparatus (e.g., a programmable processor, a computer, or multiple computers).
The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an ASIC, a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but, in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other similar configuration.
Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. Elements of a computer include a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. Information carriers suitable for embodying computer program instructions and data include all forms of non-volatile memory, including, for example, semiconductor memory devices, such as electrically programmable read-only memory or Electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), flash memory devices, data storage disks (e.g., magnetic disks, internal hard disks, or removable disks, magneto-optical disks, CD-ROMs, and DVD-ROM disks). The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
As used herein, a "machine-readable medium (or computer-readable medium)" refers to a device capable of storing instructions and data, either temporarily or permanently, and may include, but is not limited to, random-access Memory (RAM), read-Only Memory (ROM), buffer Memory, flash Memory, optical media, magnetic media, cache Memory, other types of Memory (e.g., erasable Programmable Read-Only Memory (EEPROM)), and/or any suitable combination thereof. The term "machine-readable medium" should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, or associated caches and servers) that are capable of storing the processor instructions. The term "machine-readable medium" shall also be taken to include any medium (or combination of media) that is capable of storing instructions for execution by one or more processors and that, when executed by the one or more processors, cause performance of any one or more of the methodologies described herein. Accordingly, "machine-readable medium" refers to a single storage apparatus or device, as well as a "cloud" based storage system or storage network comprising a plurality of storage apparatuses or devices. The term "machine-readable medium" as used herein does not include a signal per se.
In various embodiments, a system may be implemented to operate on a data stream using DAG-aware indexes. Such a system may include a memory having instructions and one or more processors in communication with the memory. The one or more processors may execute the instructions to: searching the index cache for the specified DAG plan; identifying an index stored in an index cache for the specified DAG plan; loading data of a partition, wherein the partition is identified by an index; performing an operation specifying a DAG plan on the loaded data; the result of the execution operation is provided. One or more processors may execute instructions to perform query operations, including performing aggregation on data in the results or performing counting on records in the results. Prior to searching the index cache, the one or more processors may execute instructions to load the plurality of data partitions according to another query associated with the specified DAG plan. The one or more processors can execute the instructions to identify that the partition has data corresponding to the specified DAG plan, and store an index of the partition related to the specified DAG plan in an index cache. The one or more processors perform the identification of the partition in accordance with performing the operation of eliminating the record for the plurality of data partitions.
Variations of such systems or similar systems may include many different embodiments, which may be combined depending on the application of such systems and/or the architecture in which such systems are implemented. Such a system may include one or more processors to execute instructions to restrict execution of operations specifying a DAG plan to those operations affecting an index space of an index cache in the index cache associated with the specified DAG plan.
Variations of such a system may include one or more processors executing instructions to: loading a plurality of data partitions according to another query having an associated DAG plan; operating on the plurality of data partitions according to the associated DAG plan; determining that data obtained by operating a plurality of partitions meets the condition of storing the data in an index cache; and storing the data obtained by operating the plurality of partitions and the associated DAG plan in an index cache according to the satisfied condition. The one or more processors execute the instructions to store data resulting from the operations on the plurality of partitions and the associated DAG plan in an index cache without storing an index corresponding to one or more of the plurality of data partitions.
In various embodiments, a system may be implemented to perform data streaming. Such a system may include: searching the index cache for a module specifying a DAG plan; a module to identify an index stored in an index cache for a specified DAG plan; a module to load data of a partition, wherein the partition is identified by an index; a module to perform operations specifying a DAG plan on the loaded data; a module that provides results of performing the operations. Such a system may include a module to perform a query operation that includes performing aggregation on data in the results or performing counting on records in the results. Other operations listed in the query may be performed by the module performing the query operation.
Such a system may include a module that generates an index. The module for generating an index may be for: loading a plurality of data partitions according to another query associated with the specified DAG plan prior to search index caching; identifying that the partition has data corresponding to the specified DAG plan; the index of the partition associated with the specified DAG plan is stored in an index cache. The module for generating an index may be configured to perform identification of the partition based on performing an operation to eliminate records of the plurality of data partitions.
Such a system can include a module that restricts execution of operations of a specified DAG plan to those operations that affect an index space of an index cache associated with the specified DAG plan in the index cache.
Such a system may include a module that controls the storage of entities in an index cache. The module for controlling entity storage in the index cache may be configured to: loading a plurality of data partitions according to another query having an associated DAG plan; operating on the plurality of data partitions according to the associated DAG plan; determining that data obtained by operating a plurality of partitions meets the condition of storing the data in an index cache; and storing the data obtained by operating the plurality of partitions and the associated DAG plan in an index cache according to the satisfied condition. The data satisfying the condition resulting from operating on the multiple partitions may be stored with the associated DAG plan in an index cache without storing an index corresponding to one or more of the multiple data partitions. The condition may include a data size of data operated on by the plurality of partitions being less than a threshold amount of data storable in the index cache.
Although specific embodiments have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that any arrangement which is calculated to achieve the same purpose may be substituted for the specific embodiments shown. Various embodiments use permutations and/or combinations of the embodiments described herein. It is to be understood that the above description is intended to be illustrative, and not restrictive, and that the phraseology or terminology employed herein is for the purpose of description. Combinations of the above embodiments and other embodiments will be apparent to those of skill in the art upon studying the above description.

Claims (20)

1. A computer-implemented method for executing a data stream, the computer-implemented method comprising:
searching an index cache for a specified Directed Acyclic Graph (DAG) plan;
identifying an index stored in the index cache for the specified DAG plan;
loading data of a partition, wherein the partition is identified by the index;
performing the operations of the specified DAG plan on the load data;
providing a result of performing the operation.
2. The computer-implemented method of claim 1, comprising the operation of performing a query on the results.
3. The computer-implemented method of claim 2, wherein performing the operation of the query comprises: performing aggregation on data in the result or performing counting on records in the result.
4. The computer-implemented method of any of claims 1 to 3, wherein prior to searching the index cache, the method comprises:
loading a plurality of data partitions in accordance with another query associated with the specified DAG plan;
identifying that the partition has data corresponding to the specified DAG plan;
storing the index of the partition associated with the specified DAG plan in the index cache.
5. The computer-implemented method of claim 4, wherein the method comprises: and according to the operation of eliminating the records of the plurality of data partitions, performing the identification of the partitions.
6. The computer-implemented method of any of claims 1-5, wherein performing the operation of the specified DAG plan on the load data comprises: restricting execution of operations of the specified DAG plan to those operations that affect an index space of the index cache in the index cache associated with the specified DAG plan.
7. The computer-implemented method of any of claims 1 to 3, wherein the method comprises:
loading a plurality of data partitions according to another query having an associated DAG plan;
operating on the plurality of data partitions according to the associated DAG plan;
determining that data obtained by operating the plurality of partitions meets a condition for storing the data in the index cache;
and storing the data obtained by operating the plurality of partitions and the associated DAG plan in the index cache according to the condition.
8. The computer-implemented method of claim 7, wherein the method comprises: storing data resulting from operating on the plurality of partitions with the associated DAG plan in the index cache without storing an index corresponding to one or more of the plurality of data partitions.
9. A system for executing a data stream, the system comprising:
a memory to store instructions;
one or more processors in communication with the memory, wherein the one or more processors execute the instructions to:
searching an index cache for a specified Directed Acyclic Graph (DAG) plan;
identifying an index stored in the index cache for the specified DAG plan;
loading data of a partition, wherein the partition is identified by the index;
performing the operations of the specified DAG plan on the load data;
providing a result of performing the operation.
10. The system of claim 9, wherein the one or more processors execute the instructions to perform operations of querying, including performing aggregation on data in the results or performing counting on records in the results.
11. The system of claim 9 or 10, wherein prior to searching the index cache, the one or more processors execute the instructions to:
loading a plurality of data partitions in accordance with another query associated with the specified DAG plan;
identifying that the partition has data corresponding to the specified DAG plan;
storing the index of the partition associated with the specified DAG plan in the index cache.
12. The system of claim 11, wherein the one or more processors execute the instructions to: and according to the operation of eliminating the records of the plurality of data partitions, the identification of the partitions is executed.
13. The system of any one of claims 9 to 12, wherein the one or more processors execute the instructions to: limiting execution of operations of the specified DAG plan to those operations that affect an index space of the index cache associated with the specified DAG plan in the index cache.
14. The system of claim 9 or 10, wherein the one or more processors execute the instructions to:
loading a plurality of data partitions in accordance with another query having an associated DAG plan;
operating on the plurality of data partitions according to the associated DAG plan;
determining that data obtained by operating the plurality of partitions meets a condition for storing the data in the index cache;
and storing the data obtained by operating the plurality of partitions and the associated DAG plan in the index cache according to the condition.
15. The system of claim 14, wherein the one or more processors execute the instructions to: storing data resulting from operating on the plurality of partitions with the associated DAG plan in the index cache without storing an index corresponding to one or more of the plurality of data partitions.
16. A computer-readable medium storing computer instructions for executing a data stream, wherein the instructions, when executed by one or more processors, cause the one or more processors to:
searching an index cache for a specified Directed Acyclic Graph (DAG) plan;
identifying an index stored in the index cache for the specified DAG plan;
loading data of a partition, wherein the partition is identified by the index;
performing the operation of the specified DAG plan on the load data;
providing results of performing the operations of the specified DAG plan.
17. The computer-readable medium of claim 16, wherein the operation comprises performing aggregation on data in the result or performing counting on records in the result.
18. The computer-readable medium of claim 16 or 17, wherein prior to searching the index cache, the instructions cause the one or more processors to:
loading a plurality of data partitions according to another query associated with the specified DAG plan;
identifying that the partition has data corresponding to the specified DAG plan;
storing the index of the partition associated with the specified DAG plan in the index cache.
19. The computer-readable medium of any of claims 16-18, wherein performing the operation on the loaded data to specify the DAG plan comprises: limiting execution of operations of the specified DAG plan to those operations that affect an index space of the index cache associated with the specified DAG plan in the index cache.
20. The computer-readable medium of claim 16 or 17, wherein the instructions cause the one or more processors to:
loading a plurality of data partitions according to another query having an associated DAG plan;
operating on the plurality of data partitions according to the associated DAG plan;
determining that data obtained by operating the plurality of partitions meets a condition for storing the data in the index cache;
and storing the data obtained by operating the plurality of partitions and the associated DAG plan in the index cache according to the condition.
CN202080101558.2A 2020-06-11 2020-06-11 DAG aware indexing for data stream execution engines Pending CN115917532A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2020/037329 WO2021251978A1 (en) 2020-06-11 2020-06-11 Dag-aware indexing for data flow execution engines

Publications (1)

Publication Number Publication Date
CN115917532A true CN115917532A (en) 2023-04-04

Family

ID=71950751

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202080101558.2A Pending CN115917532A (en) 2020-06-11 2020-06-11 DAG aware indexing for data stream execution engines

Country Status (2)

Country Link
CN (1) CN115917532A (en)
WO (1) WO2021251978A1 (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9031933B2 (en) * 2013-04-03 2015-05-12 International Business Machines Corporation Method and apparatus for optimizing the evaluation of semantic web queries
US11281706B2 (en) * 2016-09-26 2022-03-22 Splunk Inc. Multi-layer partition allocation for query execution
US10445321B2 (en) * 2017-02-21 2019-10-15 Microsoft Technology Licensing, Llc Multi-tenant distribution of graph database caches

Also Published As

Publication number Publication date
WO2021251978A1 (en) 2021-12-16

Similar Documents

Publication Publication Date Title
US10007699B2 (en) Optimized exclusion filters for multistage filter processing in queries
EP2659377B1 (en) Adaptive index for data deduplication
US8898120B1 (en) Systems and methods for distributed data deduplication
US9342849B2 (en) Near-duplicate filtering in search engine result page of an online shopping system
US9048862B2 (en) Systems and methods for selecting data compression for storage data in a storage system
RU2663358C2 (en) Clustering storage method and device
US10127276B2 (en) Method and system for dynamically optimizing client queries to read-mostly servers
US11036707B2 (en) Aggregate, index-based, real-time verification of node contents
US20140344287A1 (en) Database controller, method, and program for managing a distributed data store
CN106339181B (en) Data processing method and device in storage system
US20190155922A1 (en) Server for torus network-based distributed file system and method using the same
CN103902735A (en) Application perception data routing method oriented to large-scale cluster deduplication and system
EP3848815B1 (en) Efficient shared bulk loading into optimized storage
US10621158B2 (en) Transaction log tracking
US11314432B2 (en) Managing data reduction in storage systems using machine learning
CN115917532A (en) DAG aware indexing for data stream execution engines
Malhotra et al. Second Order Mutual Information based Grey Wolf Optimization for effective storage and de-duplication
CN110168513B (en) Partial storage of large files in different storage systems
Balcan et al. Distributed clustering on graphs
EP3559797A1 (en) Meta-join and meta-group-by indexes for big data
US8943058B1 (en) Calculating aggregates of multiple combinations of a given set of columns
US11971856B2 (en) Efficient database query evaluation
Naeem Optimization and Extension of Stream-Relation Joins
Alikhan et al. Dingo optimization based network bandwidth selection to reduce processing time during data upload and access from cloud by user
JP6312685B2 (en) Leapfrog Tree Join

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination