SE1951327A1

SE1951327A1 - System and method for relational database query answering and optimization based on natural algebra of k-relations

Info

Publication number: SE1951327A1
Application number: SE1951327A
Authority: SE
Inventors: Andrey Andreev; Daniil Skatov; Henrik Sjöström
Original assignee: Starcounter Ab
Priority date: 2019-11-01
Filing date: 2019-11-18
Publication date: 2021-05-02

Abstract

Embodiments presented herein offer the advantage of representing, transforming, and answering all kinds of queries against relational databases (such as those expressed in LINQ, SQL, and their object/graph oriented extensions) as algebraic expressions, by providing formal infrastructure for representing relational data model and the operations over it. All such operations are derived from the two basic ones: natural join and natural union, so their algebraic properties are also derived. This allows a uniform representation of database queries, and more general and powerful query optimization, among other things.This disclosure presents the system and method for representing, transforming, and answering, database queries based on the said formalisms for k-relations and operations on them, and suggests implementations of these formalisms in the embodiments of the invention. In particular, the examples presented in this disclosure encompass the translation of the full spectrum of features and functionality of the relational database queries (SQL examples) as well as the essential functionality of the object-relational and graph database queries (Starcounter QL examples). The algebraic expressions resulting from the said translation can be further rewritten using the transformation rules, and eventually mapped to an executable form for any given embodiment.

Description

SYSTEM AND |\/IETHOD FOR RELATIONAL DATABASE QUERY ANSWERING ANDOPTIMIZATION BASED ON NATURAL ALGEBRA OF K-RELATIONS TECHNICAL FIELD The present disclosure relates to a device, system and method for answering relationaldatabase queries, object-relational database queries, and graph database queries.

BACKGROUND Relational model is historically and practically the most common way of expressing allkinds of data which are stored and accessed by software applications. A number ofquery languages exist, most notably different versions and dialects of SQL, and LINQ.Key-value stores and RDF stores can be regarded as particular cases of the relationalmodel, and can be queried with certain useful (albeit not necessary) extensions of thesequery languages. General-purpose object stores and graph databases also exhibit manyfeatures of the relational model, and can be mapped to it.

The theoretical foundation of the relational model is the Relational Algebra. lt wasoriginally formulated by Edgar F. Codd in 1970-es, defining relations as sets of tuples,and employing a number of basic operators, such as unions, joins, projections, etc. Animportant problem, however, is that most software applications prefer to deal withmultisets - often because it better reflects the application semantics (duplicate tuples domatter), and anyway because removing duplicates is expensive, and so better avoidedunless really needed.

SUMMARY Embodiments presented herein offer the advantage of representing, transforming, andanswering all kinds of queries against relational databases (such as those expressed inLINQ, SQL, and their object/graph oriented extensions) as algebraic expressions, byproviding formal infrastructure for representing relational data model and the operationsover it, and representing different query optimization techniques as sets of transformation rules applicable to the said algebraic expressions.

According to the first aspect, there is provided a system (100) for representing, 1 optimizing, and answering queries against relational databases, or object-oriented databases, or graph databases.

The system comprises specification of tuples and k-relations (111), where rows, objectinstances, or graph nodes and edges are represented by tuples, and stored collectionsof rows, object instances, or graph nodes and edges are represented by k-relations,wherein the collections represented are either sets or multisets. Besides being the basisof the system (100), this additional allows common data representation for multi-modeldatabases.

The system further comprises specification of two basic algebra operators: natural join(113) and natural union (114), which allow deriving one or more other database algebraoperators as a composition of these two basic operators over the stored k-relations(111), empty k-relations, and generated sets (112), which are infinite k-relations finitelycomputable in the context of the query being processed. This allows reasoning about thealgebraic properties of the derived operators in a uniform way, defining transformationswhich are known to produce equivalent forms, and optionally defining the saidtransformations only in terms of the two basic operators.

The system further comprises means for representing database queries as naturalalgebra expressions, which are expressions comprising one or more of a natural joinoperator (113), a natural union operator (114), and other derived operators (120) overstored or generated k-relations. This algebraic representation is general enough toexpress queries translated from a number of different query languages, includingrelational, object-relational, and graph query languages, and further allows for differentforms of query interoperability and multi-query optimization across languages.

The system further comprises means for transforming the said expressions bysystematically applying transformation rules, including specification of these rules, andsets of rules for separate query processing tasks such as simplification, normalization,logical and physical optimization. This constitues a framework for more general andpowerful query optimization than what is available in database management systemsbased on algebras involving greater number of independently defined operators, and insystems which do not employ algebraic representations of database queries.

According to a further aspect, there is provided a method (200) for relational, object-relational, or graph query answering based on a sequence of algebraic representations (221,241,242,243) The method comprises translating (211 - 219) a database query (201 - 209), to a NaturalAlgebra representation (221), thus obtaining a general and language-independentequivalent of the said query, which allows reasoning about its properties, and serves asa basis for the next step.

The method further comprises systematically applying (230) transformation rules (231-233) in order to transform the Natural Algebra representation (241) of the query beingprocessed to a logically optimized representation (242), and eventually to a physicallyoptimized executable representation (243). This unifies a number of query optimizationtechniques within a single framework, where each step or approach is expresseddeclaratively as a set of transformation rules.

The method further comprises evaluating (280) the physically optimized executablerepresentations against a database instance (270) in order to obtain a query answer(290). Evaluating an executable Natural Algebra expression, independent of the querylanguage, further allows having a single execution mechanism for a number of high-leveldata models and query languages, which are translatable to Natural Algebra. lt further allows cross-query runtime optimization across data models.

The method (200) assumes that at least the final algebraic query representation (243) isexecutable in each particular embodiment. For this purpose, an embodiment shouldinclude an execution engine (280) which accepts an algebraic query representationwhere all operators are physical operators. The execution engine is still free to performcertain transformations (such as cost-based join ordering). Though the particular designof the execution engine (280) is out of the scope of this disclosure, the method (200)dictates that the database data (270) can only be accessed at the query answering stage(280) when the executable NA translation (243) is already obtained. This allowsdecoupling the query execution part of query answering from query optimization, thelatter being facilitated by algebraization (210) and transformations (230). ln contrast,database schema (250), index catalog and statistics (260) are available during the queryoptimization stages.

According to a further aspect there is provided a computer program loadable into amemory communicatively connected or coupled to at least one data processor, comprising software for executing the method according to any of the embodiments presented herein when the program is run on the at least one data processor.

According to yet another aspect there is provided processor-readable medium, having aprogram recorded thereon, where the program is to make at least one data processorexecute the method according to of any of the embodiments presented herein when theprogram is |oaded into the at least one data processor.

BRIEF DESCRIPTION OF THE DRAWINGS The invention is now to be explained more closely by means of preferred embodiments,which are disclosed as examples, and with reference to the attached drawings.

Figure 1 shows a schematic overview of a system according to one or more embodiments; Figure 2 is a flow chart of a method according to one or more embodiments.

DETAILED DESCRIPTIONIntroduction The present disclosure describes a system and method for answering all kinds ofqueries against relational databases, or any databases which expose (are mapped to)the relational data model. This is done by specifying algorithms data structures forrepresenting relational data model and the operations over it, and by translating anySQL, LINQ, and other relationaltransformable and executable within this infrastructure. queries to algebraic expressions, which are We start by describing natural algebra in terms of its types (data structures) andoperators, and then cover the related practical aspects of query processing, such asalgebraization and execution. We then continue with an extended set of examples inSection 2 each of these examples is designed to demonstrate one of the basic featuresof the practical database queries, and their handling within our system (100), by applyingthe steps of the method (200).

A more step-by-step overview of the algebraization method is given in Section 3 usingSQL as an example language. The framework for rule-based query optimization, ispresented in Section 4. Section 5 provides more examples in application to the object/graph query language Starcounter QL. Section 6 discusses different options forthe execution environment, where Natural Algebra expressions are used to obtain queryresults. 1. Natural Algebra 1.1 Tuples and k-relations ln this formal infrastructure, the queriable data is represented by k-relations, which are multisets of tuples of named values. Tuples map named attributes to the values of therespective domains, for example, t=(name ->'John',age->32) (despite the angularbrackets notation, the order of the mappings does not matter), and k-relations map a setof tuples to numeric-valued multiplicities, for example, A = {z * 1, u * 3}. The multiplicity ofa tuple t in a k-relation A is denoted as kA(z) We say that a k-relation 'contains' a tuple, te A if and only if its multiplicity is non-zero: kA (1): 0.

An equivalent way of defining k-relations would be to attribute multiplicities k(t) to tuplesper se, thus defining k-tuples, (a -> a(t), b -> b(t), k(t)>, and k-relations would be sets of such k-tuples. All the definitions given below can be re-reformulated based on suchformalism as well, but theorem proofs would become more complex, since they will have to deal with multiple equivalent set representations of the same k-relation.

The header H(z) of a tuple is the set of all attributes which it maps to values, but the header H(A) of a k-relation is always explicitly given as part of its definition. For example, if a k-relation represents a table in a relational database, the header of such k-relation is the set off its column names and the associated data types. This also allows empty k-re/ations øX with a specified headerX. A tuple may also be empty, as denoted by , and a k-relation containing only an empty tuple with multiplicity n is called a sca/ark-relation: * n ln some embodiments the 'stored' or 'basic' k-relations may be required to contain onlynon-negative integral mutiplicities of tuples, which covers the needs of all themainstream relational databases, and all datasets which can be mapped to such databases (relational databases, object stores, RDF, graph databases, etc.) However,this is not a requirement for the Natural Algebra, it merely makes certain traditional queryoptimization techniques readily applicable. Other embodiments might representprobabilistic data, fuzzy sets, or other continuous measures of existence by utilizing real- valued multiplicities.

Also, as demonstrated in Example 11 (set difference) and Example 15 (antijoin),negative multiplicities are useful, and k-relations containing negative multiplicities oftuples may be created as intermediate results, or used to represent updates to the database. 1.2 Natural join One basic operation on k-relations is natural join M which produces a k-relationconsisting of joined pairs of the compatible tuples from the argument k-relations. Two tuples are compatible t<->u if and only if they map the same attributes to the same values, for example, t = (name ->'J0hn', age -> 32) and v=(name->'John',ssn->1234). By joining compatible tuples we obtain a tuple containing union of the mappings: t-v = (name ->'J0hn', age -> 32, ssn ->1234>. ln particular, two tuples with non-overlapping headers are always compatible, and theirjoinis always defined. Also, since mappings in the tuple are not ordered, tuple join is commutative: t-v = v - t.

Natural join operation of two k-relations joins all pairs of the compatible tuples from itsleft and right arguments, and multiplies their respective multiplicities: De*A MB = {t-u*kA(t)-kß(u): te AAu e BAt <-> u}The header of the result of a natural join is defined as union of headers of its operands: H(A> Natural join operation is commutative and associative, and uses scalar k-relation as the identity element. Therefore Natural Algebra is a commutative monoid with respect to the natural join operation.

Different embodiments may include different implementations of the natural joinoperation, depending on the internal representation of k-relations, available accesspaths to tuples with specified values (i.e. database indexes), and other technicalchoices. The most common examples are nested-loop join (either using scans or index lookups), hash join, and sort-merge join, More optimized implementations of join operation may be included for the specializedcases, such as joining under many-to-one guarantee, which is typically provided by non-nullable foreign keys in the relational databases, or for joining sets of nodes / objectswhich directly link one to another, as is usually the case in graph databases / object stores. 1.3 Natural union Another basic operation is natural union + producing a k-relation containing the'common' parts of all tuples in the argument k-relations, with the multiplicities of the matching tuples summed up.DefA+B I whereXis the common set of attributes of the operand k-relations, which is also the header of the result of the natural union: X = H(A + Ef: H(A)m H(B) For example, the natural union of k-relations each containing a single instance of tuple t and v from the previous example, would contain two instances of tuple mapping the only common attribute: {t *1}+ {v *1}= kname ->'J0hn'> * 2} Natural union operator is commutative and associative, and uses empty k-relation øgwith the universal header Q as the identity element. Therefore Natural Algebra is acommutative monoid with respect to the natural union operation. We do not claim theinversibility property of the natural union operator (which would make Natural Algebra an abelian group w.r.t. this operator). Even though for each k-relation A there exists an 'inverse' k-relation -AD=ef{()*-1}>< A, the result of natural union A+(- A)=®H(A) depends on the header ofA, and does not produce the aforementioned identity element øQ.

The domain D0m(X) ie the (petentieiiy infinite) set ef en possible tuples with heederX.

Though the definition above implies running t through an infinite domain, anyembodiment will have to include a finite-time and finite-space implementation of thenatural union. This is certainly possible, since parts of only those tuples which arecontained in at least one of the operands will get a non-zero multiplicity in the result of the natural union.

The above notation assumes that attributes are identified by name and domain, i.e. therecannot be attributes with same names but different domains in headers of k-relations.However, this is not the case in the classical relational databases, as one can definetables with homonymic columns of different types, and then try to compute a relationalunion of such tables. Embodiments which are designed to be compatible with existingdatabase management systems supporting SQL need a specific implementation ofnatural union, which would emulate the behavior of SQL union operation in this aspect-either throwing an error if the domains are different, or trying to construct common widerdomain for the respective attribute in X. 1.4 Relationship between natural join and natural union Natural join and natural union are symmetrical in a certain way. Natural join collects onlymatching tuples, while widening header, and natural union collects all the tuples, butnarrows the header.

The distributivity of natural join across natural union requires that the set of attributesparticipating in the natural join remains the same: vAvBvC; H(B)mH(A)=H(C)mH(A) 3 A><(B+C)=A> The conditions for the distributivity of natural union across natural join are much morerestrictive - it is required that all headers are the same, and the k-relation A is flat (i.e.contains each distinct tuple with multiplicity 1) and disjoint from the others (i.e. produces empty natural join with either of them): vAvBvC; H(A)=H(B)=H(C) AAMA =A ^A><(B+C) = øHwà 8 :>A+B><(A+C) Since the distributivity laws do not hold in general, we do not claim Natural Algebra beingany sort of a ring-like algebraic structure. 2. Example of translating SQL queries to Natural Algebra The examples 0-21 below show practical usage of Natural Algebra for representingqueries in standard SQL. Further examples, featuring a query language extended forquerying graph databases and object stores are provided in Section 5. Every NaturalAlgebra expression is executable in any embodiment, implying that every operator iseither defined in terms of other operators (ultimately, in terms of natural join and natural union), or has at least one implementation in that embodiment.

Also, any Natural Algebra expression is easily transformable to equivalent expressionsusing transformation rules. Some embodiments may define physical (executable)operators which correspond to complex patterns of Natural Algebra subexpressions, such as Rangeàoaxlﬂß) in the note after Example 7, together with transformation rules which introduce such physical operators into the natural algebra expression. This kind ofrule-based rewrites is an essential part of query optimization. ln the preferred embodiment any Natural Algebra expression is stored and processed inthe machine memory as an executable S-expression consisting of operators and their arguments. Such representations are provided in every example below.

Example 0. Header tracking ln order to accommodate different kinds of equi-joins, theta-joins, and subqueries inSQL, every table / class listed in the FROM clause of an SQL query is translated toNatural algebra along with its alias in the query (or the table name itself acting as thedefault alias). A simplest SQL query SELECT * FROM A is translated to Natural Algebra as Q I PA. (A) where pR is the prefixing operator, which transforms the header of its argument relation, by prefixing each attribute name with the specified prefix P. We can also express the header of the result of this query, by prefixing the header expression:H(Q) = A_H(A). meaning that for every attribute a, e H(A) there will be a corresponding attribute Au, in the result.

The formal definition of pp, which ultimately builds on natural join and natural union is provided in the Example 4 below. However, since this operator does not touch the actualdata, any embodiment is likely to have a simple implementation for it. An expression pA_(A) is also a good target for partial computation, since it only depends on the table alias (coming from the query) and table columns (coming from database metadata),which are available before query execution.

A preferred embodiment, representing Natural Algebra expressions as executable LispS-expression, thus providing full compositionality, also defines PREFIX operator for thispurpose. The S-expression equivalent of the above translation would be (PREFIX A "A") As we will see later, theta-joins, equi-joins and Cartesian products, which always need todistinguish between the attributes coming from left and right operands, provide anotational facility/ 'syntactic sugar' for prefixing.

Example 1. Selection Most of the work on the relational algebra so far is only concerned with so-called SPJqueries, i.e. queries which contain only select, project, and join operators of therelational algebra. The following SQL query is an isolated example of the selectionoperation: SELECT * FROM A WHERE x > 0 Here a predicate P(t)=(x(t)>0), is applied to every tuple teA. We represent such query with a Natural Algebra expression Q I /ÛA.(A) M{A°x > 0}or an executable S-expression in the preferred embodiment: (N-JOIN (PREFIX A "A") (FILTER (NP.> A.X O))) ln this representation, {A.x> 0} is a generated set- a possibly infinite k-relation with header {A.x} and a tuple with each possible value forx such that x > 0 is true. ln the preferred embodiment, the constructor macro (FILTER (NP.> A.x 0)) creates such a generated k-relation which, receiving the known values of x through thenatural join, either produces a compatible tuple (Ax) or not. The scalar null- propagatating function NP. > will not raise an error if x is Null, returning Null as a result instead, which in this case would be effectively same asfalse.

Example 2. Projection Another typical SQL query is a selection of some of the columns from each tuple in arelation, known as the projection operation: SELECT X FROM A The projection operator H in Natural Algebra can be expressed as a natural union withan empty k-relation with the specified header, hence two options of the representation Q I H{A.x}(IÛA.(A)) I ø{A.x} + 10A. (A) ln the preferred embodiment there are both 'operator-rich' and 'operator-homogenic' equivalent configurations of the same expressions, so the equivalent representations are (PROJECT (PREFIX A "A") #(A.x))and(N-UNIoN (EMPTY #(A.x)) (PREFIX A "A")) Note on transformation rules Any embodiment supporting alternative configurations of the equivalent Natural Algebraexpressions can be equipped with transformation rules to perform such rewrites in at 11 least one of the directions, for example H X (A):>® X + A which is a context-free rule with symbols A and X. A reverse transformation, øX + A _>HX(A) is also context-free, however, it needs to handle associative cases of the n-ary naturalunion + operator.

The whole rule-based transformation framework is part of the system and method foroptimizing Natural Algebra representations of the relational database queries, describedin Section 4 below and is part of the patent claims.

Example 3. Extension SQL allows giving output columns specific names. The simplest case is renaming a projected column: SELECT x AS z FROM A A direct Natural Algebra representation would include an extension with a generated set:Q = H{Z}(PA.(A)W {Z I AWD or, in the preferred embodiment, as an executable S-expression (PROJECT (N-JOIN (PREFIX A "A") (EXTEND Z A.X)) #(Z)) Here the constructor macro (EXTEND z A.x) creates such a generated k-re/ation which, receiving the known values of x through the natural join, always produces a compatible tuple (x, z) such that the value for z is provided by the scalar expression A.x.

Any valid scalar expression that would appear before AS in the SELECT clause of anSQL query can be translated to an S-expression which comes as the last argument to EXTEND.

Note on multidirectional generated sets 12 ln the most general case, a generated set G = having the header Z, based onpredicate P(Z) which is dependent on the set of attributes Z, consists of a number ofcomputable directions d, =(Z,,p,,g,). Here each Z, g Z is a set of bound variablesrequired for computation, predicate p,(Z,) defines whether a tuple with given values forZ, should be present in the generated set, and if so, function g,(Z,) generates a wholetuple with header Z, directly copying the values for the attributes in Z, and computing the values for the remaining attributes in Z\Z,.

By default, a generated set constructed with (FILTER PGO) has single directiond = (X, P(X), IX), and a generated set constructed with (ExTEND y f(X)) has a singledirection d =(Xtrue,tX -(y->f(X))). ln this notation, input tuple IX with headerXisextended with an extra mapping (y -> f(X)).

However, in certain important cases more directions can be included. For example, if afiiter predicate has a form P(X)=(y= f(X')), where X'=X\{y}, ah additiehaidirection d'=(X',true,tX.-(y->f(X'))> is included into the generated set, whicheffectively means that a 'filter' is always true. By transforming the scalar expressionsinside the predicate, more equivalent forms of P(X) can be derived, with differentvariables singled out on one side of the equality - hence more computable directions.

A natural join A M G of a finite k-relation A with a generated set G = {P(Z)} is finitely eerhputahie if ahd ehiy if there exists a direetieh (z,, 19,, å.) e G sueh that z, g H(A).

For example, consider a Natural Algebra expression A M B M {x=y}, where xeH(A)\H(B) and yeH(B)\H(A). lf the generated set were produced by a FILTER constructor (N-JOIN A B (FILTER (EQUAL X y) )) it would by default have the only direction with the set of bound variables {x,y}, so the joins can only be ordered as (A M B) M {x=y}. However, if a direction true, (y -> x» is added, then the join order (A M {x = y}) M B, is also possible, 13 and is likely to be preferred by the query optimizer, since it offers one more join attribute for B, which means betterjoin selectivity and more opportunities for index utilization.

Example 4. Equi-join ln English, the following query could be phrased as "return all columns from tables A andB and all rows where value in column x in A matches the value in column y in B". Thisquery involves an equi-join operation, which is very common in queries of themainstream relational database management systems, but is not fully supported by theexisting algebraic approaches so far.

SELECT * FROM A JOIN B ON A.x = B.y This query translates to Natural Algebra using the corresponding equi-join operator:Q I pA.(A)(A.x)M(B.y)1ÛQ.(B) or, in the preferred embodiment, as a computable S-expression with E-JoIN: (E-JOIN (PREFIX A "A") (PREFIX B "B") #(A.X) #(B.y)) As was mentioned earlier, we have only two basic operators in Natural Algebra: naturaljoin M and natural union +. An equi-join, in the general case, is defined for the aligned vectors X= v” participate in the equality requirement: i:1 Def nAXMY-B = A M B w {/\a,=b,} and is only defined if H(A)OH(B)= ø, so that the result of an equi-join contains allattributes from both of its arguments. This condition is typically provided by prefixing operators inserted into the translation.

The above query can be rewritten, according to the definitions of equi-join, as:Q = pA.(A)><{A-X = B-y} or, as an executable S-expression in the preferred embodiment: (N-JOIN (PREFIX A "A") (PREFIX B "B") (FILTER (EQUAL A.X B.y))) 14 Here {A.x = B.y} is a generated set with columns A.x and B.y and all possible values for A.x and B.y from the respective domain(s), such that the equality holds.

The renaming operator p (corresponding to the PREFIX function in the S-expression) is also ultimately defined via natural join, natural union (as projection), and a filtering generated set: Def/ÛP.(A) I HP.H(A)(AN{ aeQ/UP-a I 6%) The query from this example can be expressed in terms of basic operators andgenerated sets as: x BJ» = b}))><{,4_x = B_y} beHUe) A A.a = a}))M(®B_H(B) + (B><{ aeH(A) Q ï (ø/LHM) + (A M{ Such verbose representation, however, is not required - neither for internal representation nor for computation of the query as Natural Algebra expression.

Example 5. Natural join Not the most common, but the most basic (i.e. without any explicit conditions) way to jointwo tables in SQL is to join them on all matching attributes -the so-called natural join: SELECT * FROM A NATURAL JOIN B We do not map this to the natural join operator in the Natural Algebra, since we still wantto track the provenance of the columns in the result of this query. lnstead, we translate SQL natural join as a particular case of an equi-join:Q I IÛA.(A)A.XMB.X 10313) where X = H(A)mH(A) is the set of common attributes of the tables joined, and vectorX contains these attributes in a certain arbitrary order.

A corresponding executable S-expression in the preferred embodiment is, naturally: (E-JoIN (PREFIx A "A" (PREFIx B "B")(H-PREFIx (H-INTERsEcT (H A) (H 13)) "A")(H-PREFIx (H-INTERsEcT (H A) (H 13)) "B")) lt should be noted that the natural join operation is rarely directly useful in queries. Themain reason is that the result of a query containing natural join operation is sensitive toschema changes, so the same query would return different sets of tuples (albeit thesame header) if a column with a matching name is added to or removed from one of theoperand tables. ln some SQL dialects, such as Microsoft T-SQL, natural join operation is not available at all.

Example 6. Cartesian product Though discouraged in practice, Cartesian product operation is allowed by most querylanguages, as an extreme form of a join, where no join predicate is supplied or equivalently, if the join predicate is constant true: SELECT * FROM A, B This query translates to Natural Algebra using the Cartesian product operator:Q=pÅAWpA3) or, in the preferred embodiment, as a computable S-expression with E-JoIN, and emptyvectors of matching attributes: (E-JOIN (PREFIX A "A") (PREFIX B "B") #() #()) As can be seen from the latter representation, Cartesian product is the particular case ofan equi-join, where vectors X and Y are empty, i.e. no equalities are required: Def A> o oB=AWB which is only defined if H(A)mH(B)=®, which is typically provided by inserting rename operators into the translation of an SQL query.

Example 7. Theta-join ln SQL, two tables can be joined on any logical condition, not only equality. Consider a slight modification of the query from Example 4: SELECT * FROM A JOIN B ON A.X > B.y 16 which translates to an invocation of the theta-join operator: Q I IÛA.(A)M A.x>B.y1ÛB.(B) ln the preferred embodiment, this corresponds to an executable S-expression joining aCartesian product with a filtering generated set: (N-JOIN (E-JOIN (PREFIX A "A") (PREFIX B "B") #() #()) (FILTER (NP.> A.X B.Y))) ln Natural Algebra, theta-join with condition 6l(P.X,Q.Y), dependent on attributesXfrom the left operand and attributes Yfrom the right operand, is defined as: DefA m QUQÛB = (A >< BM {9(X, Y)}where {6?(X,Y)} is a generated set corresponding to the join predicate. This definitioninherits the Cartesian product's requirement of disjoint headers H(A)mH(B)=®, which is typically provided by inserting rename operators into the translation of an SQL query. Technically Note on join predicate rewrites ln the preferred embodiment, equi-join which is not a Cartesian product has its ownimplementation, since it can avoid the exhaustive iteration across its both operands: byutilizing the indexes: using the equality filter, it can 'bind' the variables for the generatedsets, as shown in note to Example 3.

This is not the case for theta-join that, in general, relies on the Cartesian product withjoin predicate applied to filter its results. However, different embodiments might involve arange of rewrites and other optimizations, in order for theta-join to utilize, for example, range indexes.

Query from this example, represented in Natural Algebra as a natural join of two tableswith disjoint headers, and one generated set: Q = pA.(A)><{A-X > B-y} cannot join pA_(A) with {A.x > B.y} first, since the generated set has a single direction 17 where both A.x and B.y need to be bound. However, some embodiments may implementa physical operator Rangeåwyxﬂß), based on a range index for attributey in k-relation B, so that for any given value ofx it selects all tuples from B such that y < x. Algebraically, this query would be rewritten toQ I pA.(A)N/ÛB.(Rangeåoa/LXÄB» Formally, the expression Rangef ~oo,A.x )(B) is a generated set which only requires {A.x} as a set of bound attributes. Such a rewrite avoids the computation of the Cartesian product pA_(A)MpB_(B), and, if the range index lookup happens to be selective, drastically reduces the time complexity of answering the query.

Example 8. Grouping and aggregationA simple example ofgrouping and aggregation query in SQL SELECT y, sUM(x) As z FROM A GROUP BY y translates to an invocation of groping and aggregation operator E : Q = 5{A.y}(SUm{{:fx}»P/1.(A))ln the preferred embodiment this corresponds to an executable S-expression (AGGREGATE (PREFIX A "A") (COMBINE-AGG (SUM A.x Z)) #(A.y)) ln Natural Algebra we define grouping and aggregation as a natural union over tuplesacross the domain of the grouping sub-header: aG(F,§,A)Dí :på (A M f) -fyzçA W) teD0m(G) where the scalar function I(A M t) returns 0 if the argument k-relation is empty, and 1otherwise. The dot operator - inside the braces denotes tuple multiplication by a tuplewith non-overlapping header. The same operator outside the braces is scaling of a k-relation with 0 or 1 multiplier. Aggregate function Få produces a tuple with header Zfor each group A M t and depends only on the values for attributes Xin the group. 18 While the domain D0m(G) is typically infinite, for any finite k-relation A there is a finite number of non-empty groups A M t, so the operator is always computable. ln some ofthe embodiments, the implementation of AGGREGATE maintains a hash-table, where sub-tuples t serve as keys, and incrementally computed values of aggregate functionF; as values. Tuple t is iterated across the distinct set of HG(A), thus resulting in only non-empty groups A M t.

Query from this example can be expressed in as Q = Eæumljlçfl M f) -fyzçA w f) t6D0m({y}) so that for a k-relation uølo to k #112 Uàlvﬁä the domain D0m({y}) might be infinite, depending on the type of the column y, but theonly tuples from it, which would produce non-empty groups are (y -> 2) and (y -> 3).

Aggregate function Sumﬁš, is defined as: Def 2x(”)'kf1(”)» A i øﬂpl) = z -> ”GANl/lll, A I øHM) which would produce resulting tuples (y -> 2, z -> 3) and (y -> 3, z -> 6).An important particular case is aggregation without grouping. ln SQL such query wouldbe expressed as SELECT SUM(x) AS Z FROM A However, it's not the same thing as grouping on no attributes Eø, since the above SQLquery is guaranteed to produce at least one tuple, whereas Eø operator would return anempty result on empty input. lnstead, the latter query would be translated to Natural Algebra as a direct call to the 19 aggregate function:Q' = Sumšíåxﬂpilfﬂ) where the definition of Sumljš takes care of the case where input is empty, emulating the standard SQL behavior.

Example 9. Multi-column aggregation Another example of queries which typically lack formalization in the re|ationa| databasetheory is aggregation applied to multiple columns, also expressions under the aggregatefunction calls and across the results of aggregate functions. The following query has allthese features: SELECT y, SUM(x*z) AS u, MAX(x) - MIN(x) AS v FROM A GROUP BY y lt translates to the Natural Algebra as Q = H{A_Y,M,V}(E{A_Y}(SumšfíaÜ -Maxå ä -MinÉÄÄÉÄ pA_(A)MLaa1 = A.x - AZDM ><{ = _a2 -_a3}) where the first argument to E is a vertical composition of the standard single-column aggregate functions. The combined aggregate function Sumšfían-Maxlg_fcš}-Minš¿_xë} produces a tuple with header {u,_a2,_a3}, using the attributes Laa1,A.x} from a k- relation which is the second argument to E _ Any embodiment performing the translation of such query to Natural Algebra will need tointroduce the intermediate computed attributes, both inside the argument k-relation, andoutside the E operator. The algebraizer in the preferred embodiment uses generatedidentifiers starting with underscore and 'a' for the results of the aggregate functions orwith 'aa' for the arguments to aggregate functions. The projection to the set of columnsexplicitly mentioned in SELECT clause is added on top of the expression - otherwise _a2 and _a3 would also appear in the result.

The executable S-expression in the preferred embodiment demonstrates the usage ofCOMBINE-AGG macro, utilizing the built-in aggregate functions such as SUM, MAX, and MIN by supplying them with argument and result attributes, and combining them into a single multi-column aggregate function: (PROJECT (N-JoIN (AGGREGATE (N-JoIN (PREFIX A "A")(EXTEND _aa1 (NP.* A.x A.z)))(COMBINE-AGG (SUM _aa1 u))(MAX A.x _a2)(MIN A.x _a3))#(A-y))(EXTEND v (NP.- _a2 _a3)))#(A-y U v)) Example 10. Removal of duplicates Removal of duplicates (the distinct operator) is common in relational databases, especially in situations where multisets might be produced, e.g. due to a projection: SELECT DISTINCT y FROM A This translates to Natural Algebra as Q I "H{A.y}(/ÛA. which is represented as an executable S-expression in the preferred embodiment: (DISTINCT (PROJECT (PREFIX A "A") #(A.y))) The distinct operator is a particular case of grouping and aggregation, where grouping is performed on all attributes of the argument k-relation (as specified by the projection in our example), and the aggregate function is the constant empty tuple : 04||A|| = EH(A>(<)=A) Depending on the internal implementation of k-relation, the implementation of distinctoperator might be different in different embodiments. lf, for example, k-relations arealways internally represented as sets of unique tuples with multiplicity value attached toeach of them, the distinct operator would simply clone a k-relation and set multiplicitiesto 1. Other embodiments might use hash tables, or hash-table based stream windows to accommodate distinct over streaming k-relations, etc. 21 Example 11. Set operations Most query languages incorporate both set and multiset semantics, switching betweenthem in a number of explicit and implicit ways. One explicit way to switch to the setsemantics is the distinct operator in SQL. Some operations, such as UNION,INTERSECT, and EXCEPT are designed to operate on sets of tuples with matchingheaders, and produce sets of tuples. The translation of such operations to Naturalalgebra involves (a) inserting distinct operator around the arguments or result, and (b)checking that the headers of the argument relations are union-compatible.

The notion of union-compatible does not depend on the column names, but depends onthe implicit order of the attributes in each header, and their data types. We discuss theorder in the following notes, so far we assume that we try to 'cast' one k-relation to theheader of another.

Operator Cast,,-,(A,(B) would assume there are ordered headers I-'I(A)=(a,,...,a,,) and I-'I(B)= (b,,...,b,,) with the same length n, and the domains of the respective attributes either agree Vi : D0m(a,) = D0m(b,) or (optionally, in some embodiments), there exists a wider domain Ö, containing theoriginal ones: Vi : ÉIÖ,(DOm(a,) g A D0m(b,) g lf these conditions hold, the result of Casz,-,(A,(B)operator is k-relation B with headersubstituted to l-'I(A): DefCasfnclﬂß) I Pmmsaplﬂß) lf wider domains Ö, were derived for some of the attributes a, e the domain of such attribute in the resulting header would be set to Ö, .

Now, consider an SQL query 22 SELECT * FROM A UNION SELECT * FROM B which translates to Natural Algebra as Q I "PA (A) + CastA.H(A)(B"| or to the following executable S-expression in the preferred embodiment: (DISTINCT (N-UNION (PREFIX A "A") (CAST B (H-PREFIX (H A) "A")))) The multiset-based SQL counterpart of it would use UNION ALL instead: SELECT * FROM A UNION ALL SELECT * FROM B whose translation would not contain the distinct operator:Q I PA. (A) + CÜSIA.H(A)(B) or to the following executable S-expression in the preferred embodiment: (N-UNION (PREFIX A "A") (CAST B (H-PREFIX (H A) "A"))) Similarly, set-based INTERSECT operator in SQL, as in SELECT * FROM A INTERSECT SELECT * FROM B would translate to Natural Algebra as a natural join of the distinct views of the corresponding k-relations: Q I "PA (A)"M "CastA.H(A)(B"| I "PA VON CastA.H(A)(B"| or, to an executable S-expression in the preferred embodiment: (DISTINCT (N-JOIN (PREFIX A "A") (CAST B (H-PREFIX (H A) "A")))) Finally, EXCEPT operator in SQL denotes the set difference of the results of two querieswith matching headers: SELECT * FROM A EXCEPT SELECT * FROM B The set difference is Natural algebra is defined on the distinct views as a subtraction (i.e.natural union with the negated second argument) of the left k-relation and the set intersection of both: Q I "PA _ "PA (A)M CastA.H(A)(B"| 23 This corresponds to an executable S-expression in the preferred embodiment (N-UNION (DISTINCT (PREFIX A "A")(N* -l(DISTINCT (N-JOIN (PREFIX A "A") (CAST B (H-PREFIX (H A) "A")))))) where N* operator denotes a scalar multiplication of a k-relation.

Note on the original order of attributes So far we were regarding headers of k-relations as unordered sets. However, thefollowing simple rules provide an order of attributes, which is compatible with commonSQL ordering for the queries translated so far: 1) Any stored k-relation A has a user-specified ordered header H (A). Def . .2) H(AD query): H(HX(A)) = X5) The order of vertical composition of aggregate functions is specified by a query: . . . Def . .H(F,Z -K,ï)= Z |V6) The grouping attributes precede the attributes resulting from the aggregate function: . . Def . .Heinz) = G i 2 Note that these rules effectively break the commutativity of tuple join, natural join, andnatural union operations, that's why in Natural Algebra we treat headers as unorderedsets of attributes. All transformation rules for Natural Algebra expressions and any cost-based query optimizer (whose task is to re-order the join operands) should still treat these operations as commutative. 24 ln orderfor set operations from Example 11 to be both SQL-compliant and independenton the join order chosen by optimizer, the header of each operand to a set operationneeds to be computed first based on the 'most direct' translation, preserving the user specified join and projection order.For example, an SQL query SELECT * FROM A UNION ALL SELECT * FROM B JOIN C ON B.X = C.y would translate to:Q I /ÛA.(A)+ CÛSIA.H(A)(IÛB.(B)(B.X) M (y) pc.

Before making any further transformations, we memorize the user-specified orderedheader of the right operand to UNION ALL: X=H<,>»«,,,p,>=B.H<ß>ic.H lf any expression transformations change the join order, we will still use the original order of attributes specified byX for the purpose of Cast operator.

Example 12. Subquery as a table Most query languages provide some degree of composability, allowing a query to refer tothe result of another query. ln SQL, this is done either by view definitions and usage CREATE VIEW C AS SELECT * FROM B WHERE y > 0SELECT * FROM A JOIN C ON A.x > C.y or with a subquery in FROM clause: SELECT * FROM A JOIN (SELECT * FROM B WHERE y > 0) AS CON A.x > C.y Both cases are equivalent, and would translate to Natural Algebra as Q I /ÛAA W A.x>c.ß.y /Ûc.(1ÛB.(B)M{B-y > 0D or, as an executable S-expression in the preferred embedding: (N-JOIN (E-JOIN (PREFIX A "A") (PREFIX (N-JOIN (PREFIX B "B") (FILTER (NP.> B.y O))) "C") #0 #0) (FILTER (NP.> A.X C.B.y))) Here the expression pB_(B)><{B.y > 0} is the translation of the subquery, which in the host query receives alias C (translated to prefix C.), and participates in a theta-join.

Natural Algebra is ultimately composable, allowing any kind of k-relation expressions ineach place where a k-relation is expected. For this reason, the attribute prefixes are allowed to 'stack' one before the other.

Example 13. Correlated (lateral) subqueries Relational query languages, such as SQL, allow greater interconnectedness among thequery parts than simple context-free composability. Correlated subqueries rely on thevariable bindings of the host query - much like the generated sets in Natural Algebra relyon the variable bindings of the finite k-relations they are joined with. Correlatedsubqueries are common in filters (as shown in the following examples), but somedialects of SQL, such as PostgreSQL, allow correlated table-valued subqueries directlyin the FROM clause - so called LATERAL subqueries: SELECT * FROM A, LATERAL (SELECT * FROM B WHERE A.X > y) AS C Translating this query to Natural Algebra involves a predicate lift/ng technique. TheCartesian product from in the host query effectively gets joined with the predicate moved from the subquery, and becomes a theta-join: Q = pAÅfÛX/ßcl/FÛW MX > C-y} or, as an executable S-expression in the preferred embodiment (N-JOIN (E-JOIN (PREFIX A "A") (PREFIX B "C") #() #()) (FILTER (NP.> A.X C.y))) which is quite similar to the theta-join from the Example 7.

Example 14. Correlated aggregate subqueries 26 A different technique is needed to translate the following correlated aggregate subquery SELECT * FROM A JOIN LATERAL (SELECT SUM(y) AS Z FROM BWHERE A.x > y) AS C WHERE A.w > 0 lf the condition in the subquery were dependent on a grouping variable, we could haveused the same predicate lift/ng technique as in the previous example. However, herevariable y cannot be accessible outside the subquery. So in order to evaluate thecondition dependent both on A.x and y, we need to instead bind A.x inside the subquery.

This is achieved by the scaffo/ding technique: Q I (IÛA.(A)N {A-W > 0})XC' E'{A.x}(S”m{{1äy}nÛB.(B)M"H{A.x}(,Û/1. {A-x > B-YDM M {A.x = C.A.x} Here the subquery is joined with the so-called ﬁnite scaffold ||H{A_x}(pA_(A))|| of the remaining part of the outer query that produces a superset of all the bindings which A.xwould receive. The scaffold variable A.x also becomes a grouping variable, and aseparate group is generated for each such binding. Then, after the Cartesian product is formed, only those bindings which correspond to the actual bindings of A.x in the host query are retained.

The above is the 'compositional' role of the finite scaffold, which makes it possible tocompute every finite Natural Algebra subexpression in those embodiments which allow only context-free compositionality. ln different embodiments, the implementations of the grouping and aggregation Eoperator might be different. ln particular, if a grouping variable is bound by a join wherethe result of E participates, the grouping may be limited to only those bound values, effectively binding the grouping variable inside the under-aggregate expression. Since the ﬁnite scaffold ||H{A_x}(p guarantees a superset of such bindings, it can be safely omitted from the under-aggregate expression. The simplified equivalentQ I (IÛA.(A)M {A°W > O»M/ÛC.

This, however, is not a compositionally computable Natural Algebra expression, andwould only work in those embodiments which offer the said optimization for E. 27 Moreover, they should formally allow a variable which is not bound by any finite k-relation under the aggregate operator to be used as a grouping attribute.

Example 15. Existence subqueries Subqueries can also be found in conditions, used either with quantifiers, or returning asingle value (Example 19 below). The simplest case is the existence quantifier, acting as a logical expression: SELECT * FROM A WHERE EXISTS (SELECT * FROM B WHERE A.x > y) The check of whether the result of the correlated subquery is empty acts as a filter for the tuples from A: Q = pAAM {H(/>B_(B)N {A-x > B-y})} The logical predicate 5 is true, if the Natural Algebra expression under it is non-empty, and false otherwise. The whole term {H(pB_(B)N{A.x> B.y})} is a generated set with header {A.x}, computable only if {A.x} is bound. The preferred embodiment defines the sQ-EXISTS form to directly reflect the existence quantifier and its dependencies: (N-JOIN (PREFIX A "A") (FILTER (SQ-EXISTS (A.X) (N-JOIN (PREFIX B "B") (FILTER (NP.> A.X B.y)))))) with the first argument to SQ-EXISTS being the list of the attributes which need to bebound for evaluating it, and the second argument is the Natural Algebra expression under the quantifier.

This is, however, just the most direct translation. Taken separately, a filter with existencequantifier corresponds to the semijoin operator in relational algebra, and a filter withnegated existence quantifier corresponds to antijoin. Natural Algebra defines sem/join M and ant/join Q operators:DefA w B = fm ||H,,(A)(B)||04A Q B = A - A MB 28 and the query from this example can be rewritten as Q :p/i.(A)b<(/ÛB.(B)M{A-x > B-YD Similarly to the previous example, such Natural Algebra expression requires that theimplementation of semijoin M effectively passes the bindings for A.x from the leftoperand into the right, just like the natural join M does. Otherwise, if the implementationof semijoin requires context-free computability of both arguments, or if semijoin isrewritten according to its definition to be executed via projection and distinct, the right argument (pBB ><{A.x > B.y}) is effectively infinite and cannot be computed.

To resolve this problem, a similar scaffo/ding technique can be applied:Q I /ÛA.(A) M (IÛB. (Bwq "H{A.x}(,Û/1. M {A-x > B-YD The ﬁniie seaffe/d ||H{A_x}(p,,_(A))|| preduees eii pessibie bindings direeiiy ferAx. As iiiis is the only common attribute among the left and right parts of semijoin, those bindings that pass the test A.x > B.y with at least one of the tuples from B appear in the result of the semijoin. Hence the equivalence to the original translation.

Example 16. Subqueries with ANY quantifierLet's consider a subquery with ANY quantifier, as inSELECT * FROM A WHERE A.x > ANY (SELECT y FROM B) We do not need another quantifier to represent ANY in the Natural Algebra, as this query literally means "select those rows from table A where A.x is greater than at least one yfrom table B". Hence the translation, based on existence (non-emptiness) of a certain selection from B: Q = pAKAM {H(/>B_(B)>< {A-x > B-y})} which repeats the Natural Algebra translation from the previous example. Those queries are equivalent indeed, and the rewrites from the previous example are applicable. The key feature here is that the otherwise 'uncorrelated' subquery H¿B_y}(p becomes correlated, once extended by the filter {A.x > B.y} translating the condition around the 29 ANY quantifier. We also do not keep projection under the existence quantifier - the onlyrole of the projection in the SQL query was to point to the particular column to comparewith.

The full list of Natural Algebra operators which can be safely dropped directly underexistence quantifiers includes projection, extension, any join with many-to-at-least-oneguarantee, renaming and preﬁxing, Cartesian product if the other argument is known tobe non-empty, also left, right and full outer join (as defined below in Example 20),distinct, aggregation and grouping.

SQL also supports a negated version of ANY: SELECT * FROM A WHERE NOT A.x > ANY (SELECT y FROM B) meaning, literally "select those rows from table A where there is no such row in B thatA.x is greater than B.y", or equivalently but with a greater risk of ambiguity in English: "select those rows from table A where A.x is not greater than anyy from table B". Such query is translated with a negation of the existence quantifier:Q' = pAiAM íaïpßißw lA-x > B-y})} or, as an S-expression in the preferred embodiment, (N-JOIN (PREFIX A "A") (FILTER (NOT (SQ-EXISTS (A.x) (N-JOIN (PREFIX B "B") (FILTER (NP.> A.x B-y))))))) This translation can be rewritten to antijoin, as shown in the previous example:Q' I PA. (log (IÛB. (Bbq "H{A.x}(,Û/1. {A-x > B-YDor, more briefly: Q' = pA.(AW(/>B_(B)><{A-x > B-y}) if the ﬁnite scaffo/d ||H{A_x}(pA_ is not required by the particular embodiment.

Example 17. Subqueries with IN quantifier Most query languages, including SQL, allow providing explicit bindings for a certain variable, for exampleSELECT * FROM A WHERE A.x IN (1, 3, 5) which translates to Natural Algebra expression with a constant k-relation:Q = pA_(A)M{1>, 3>, S» or to an S-expression in the preferred embodiment, using COLUMN constructor: (N-JOIN (PREFIX A "A") (COLUMN X (l 3 5))) However IN can be also used to introduce a single-column subquery:SELECT * FROM A WHERE A.x IN (SELECT y FROM B) This is equivalent to using ANY quantifier with equality: SELECT * FROM A WHERE A.x = ANY (SELECT y FROM B) which translates to Natural Algebra similarly to what was shown in the previous example:Q = pAlAM {H(/>B.(B)>< lA-x = B-y})} or, to an executable S-expression in the preferred embodiment: (N-JOIN (PREFIX A "A") (FILTER (SQ-EXISTS (A.x) (N-JOIN (PREFIX B "B") (FILTER (EQUAL A.x B.y)))))) This translation can also be rewritten to semijoin.

Example 18. Subqueries with ALL quantifier The following SQL query can be phrased in English as "select rows from A where A.x is not greater than all y values in B":SELECT * FROM A WHERE NOT A.x > ALL (SELECT y FROM B) Though given English phrasing may sound ambiguous, the SQL standard defines this as equivalent to "select rows from A where A.x is not greater than at least oney value in B": SELECT * FROM A WHERE A.x <= ANY (SELECT y FROM B) 31 which is now using ANY quantifier with the negated condition, and translates to NaturalAlgebra accordingly: Q = pAAM {H(/>B_(B)>< {A-x S B-y})} which can be further rewritten to semijoin.

An equivalent S-expression in the preferred embodiment would be(N-JOIN (PREFIX A "A")(FILTER (SQ-EXISTS (A.x) (N-JOIN (PREFIX B "B") (FILTER (NP.<= A.x B.y)))))) Similarly, a non-negated case of ALL quantifier, "select rows from A where A.x is greaterthan all y values from B": SELECT * FROM A WHERE A.x > ALL (SELECT y FROM B) is defined in SQL as equivalent to the negated case with ANY quantifier, and thenegated predicate - "select rows from A for which there is no such y value in B that A.x is not greater than y":SELECT * FROM A WHERE NOT A.x <= ANY (SELECT y FROM B) which also translates to Natural Algebra, as shown in the Example 16.

Q = pAAM {ﬁH(/>B_(B)>< {A-x S B-y})} and can be accordingly rewritten to antijoin.

An equivalent S-expression in the preferred embodiment would be(N-JOIN (PREFIX A "A")(FILTER (NOT (SQ-EXISTS (A.x) (N-JOIN (PREFIX B "B")(FILTER (NP.<= A.x B-y)))))) Example 19. Single-value subqueries Sometimes the sole purpose of a subquery is to return a single value (i.e. exactly one result tuple mapping a single attribute). For example, a query "select all rows from A 32 where x is greater than the maximum y value in B" can be formulated in SQL using a non-correlated single value subquery: SELECT * FROM A WHERE A.x > (SELECT MAX(y) AS Z FROM B) Here the single-tuple result is guaranteed by a direct call to the aggregate function. TheNatural Algebra translation would have to construct a relation from a single tuple, hence the braces around the call to Max: Q I /ÛA.(A) M íMaxšäyﬂpß. ÜM {A-x > Z}ln SQL there is no concept of - oo, so MAX aggregate function would return Null if table B is empty. This behavior is also emulated by the definition of Maxfš in Natural Algebra /-~ (compare with definition of Sumšjš in Example 8): MaxšíﬂA) = ”GA 04 max(x(u)), A i øHM)Z-9NUll, A I øHM) A different situation is when single-tuple result is not guaranteed by the subquery expression, for example: SELECT * FROM A WHERE A.x > (SELECT y FROM B WHERE w > 10) Here the existence and uniqueness of the result from the subquery H{Y}(BM{W >10}) is a kind of external knowledge or promise which needs to be verified at run time. Allrelational databases which support SQL would throw a run-time error if the subquery inthis example returns more than one tuple, and would effectively compare A.x to Null if the subquery result is empty.

The Natural Algebra translation of the above SQL query, in the most general case, is:Q' : p,_(A)ï<'(,,,_(B)><{B_W > wymplx > En The key feature in this translation is a natural join :<- with many-to-at-most-oneguarantee. The same guarantee takes place when, for example, joining on a foreign key.

Embodiments which aim to fully simulate SQL semantics with respect to single-valuesubqueries will need to validate this guarantee on the actual data, and throw a run-timeerror if this guarantee is not fulfilled. 33 The preferred embodiment implements this run-time check with a function ASSERT-Ol, which returns the argument relation if its count is 0 or 1, and throws an error othen/vise: (N-JOIN (PREFIX A "A")(ASSERT-Ol (N-JOIN (PREFIX B "B")(FILTER (NP.> B.W l0))))(FILTER (NP.> A.X B.y))) Note on join cardinality guarantees A stronger version of the above is many-to-one guarantee, a relationship between two k-relations which we denote using a graphical primitive > from 'crow foot' notation in therelational database modeling. With respect to join this takes place, for example, whenjoining on a non-nu//ab/e foreign key. Other cases are proven, including - extension; VA VX g H(A) Vf(X) Vy Q H(A); A > {y = f(X)}- aggregation and group/ng: VA VX VFYZ : A > EÅFYZ ,A)- project/on and dfsﬁncf; VA VX; A > ||HX(A)|| Naturally, this guarantee includes the weaker 'many-to-at-most-one' guarantee used in the above example.

Different embodiments may use these kinds of guarantees for a number of purposes.For example, the corresponding physical implementation of the join operator might usesuch guarantee to avoid iterating on the respective operand past the first result tuple. Atthe same time, cost-based optimizer might find such guarantee useful for more accurateestimation of cardinalities of intermediate results (the stronger many-to-one guarantee offers more accurate estimates).

The many-to-one guarantee also offers a number of simplification opportunities forNatural Algebra expressions, such as VA VB; AÉB=A> >VA VB; AmB=A><||B||>VA VB VX;H(A); HX(A> which are useful for optimization and for proving query equivalences. 34 Example 20. Outerjoins Standard relational algebra left, right, and full outer join operators are directly expressed in SQL, for example: SELECT * FROM A LEFT JOIN B ON A.X = B.y The result of the query would contain all columns from A and B, and all tuples from Alisted at least once. lf for certain tuple re A there are join-compatible tuples u eB, allsuch tuple joins t-u will be listed, othen/vise, tuple zwill be extended with null values for attributes from the header of B.

There are simple 'outerjoin' analogs to natural join in Natural Algebra.

Def- left join; A MB = A w B + (A wa) MNuIIHQÜgKA)04 _' join: A Ö Def' OUtef jOin: A I A DQB "l" DQNUZZH(B)\H(A)+ ÜQNUZZH(A)\H(B) where NullX is a single-tuple k-relation mapping all attributes in Xto Null value.

However, these operators cannot be directly used to translate an SQL query, since thereis no such thing as a 'natural outer join' in SQL. ln practice, outer join analogs to equi- join and theta-join are useful, for example:| | | - left equl-jeln; AXM YB = AX MYB+(A-||HH(A)(AX w Y.B)||) mlvullwDef- left tllete-jeln; AM QB = A >A§-B+(A-||HH(A)(A w ,B)||) mlvullm)with other kinds of outerjoins being defined via the inner versions of theta- and equi-joinin the same manner, under the requirement of disjoint headers: H(A)m H(B) = ø _The query from this example would then simply translate to Natural Algebra as:Q I PA.(A)pB.(B) or, as an executable S-expression in the preferred embodiment: (E-LEFT-JOIN (PREFIX A "A") (PREFIX B "B") #(X) #(y)) where E-LEFT-JoIN form is defined as shown above, and is syntactically similar to E-JOIN.

Example 21. Ordered k-relations Ordered k-relations can be modeled in Natural Algebra as a k-relation extended with order attribute i. An SQL query returning a sorted relationSELECT * FROM A ORDER BY x Asc, y DEsC Translates to natural algebra using a sort operator lP:Ql = \P(pA_(A), Here is a sort-vector - a vector of attributes, where each attribute isannotated with either ascending A or descending v flag. The sort operator *P producesan ordered k-re/ation as specified by this sort-vector.

An ordered k-relation Å is a k-relation A where every tuple teA contains a special order attribute i(t). A sort-vector, formally, is either an empty vector, or any sort-vector Y concatenated with an annotated attribute: (i Def X: I>|aí^fia? Header of sort-vector H (X ) is a set of all attributes in X : H(( ))D=efø, H(Y | (11%): H(Y | aíV)D=efH(Y)o {al.} The sort operator W can be defined as a join: wuyzfíffiw{f(s1,_..,sk>} where H(X)= {s1,...,sk}, and the function i(s1,...,sk) is any such function that satisfiesthe precedence condition š defined by X on the set of tuples:1-(s1(1),___,sk(i))§ l-(sl(u),___,sk(u))e> få, Tuple zprecedes tuple u according to the sort-vector given H(z) = H(u) Q H(X): 36 tšu :i X=<)V[X=Y|a,^Atšu/\[u:t:>a,(t)Sa,(u)nv v X = Y | a; Atíu Aíuít :> al.(u)í alttﬂthis requires í relationship to be defined within every domain D0m(a,), a, e For an empty sort-vector, thee precedence relationship holds for any pair of tuples, and order function can be any constant. However empty sort-vector does not allow eliminating the sort operator: »i A, a special order attribute is still attached to A. 3. Translating SQL queries to Natural Algebra As the above examples have demonstrated, SQL functionality is totally covered byNatural Algebra, and every feature in SQL query has at least one Natural Algebrarepresentation. For many features we have defined corresponding derived operators,based on natural join and/or natural union, hence the equivalent forms using only basicoperators, or some intermediate derived ones. ln this section we outline the process ofobtaining a most direct Natural Algebra translation of an SQL query, which can be furthertransformed, as described in Section 4.

Though the translation methodology is presented here using SQL as an input querylanguage, a person skilled in the art should be able to define a similar algebraizationmethod for any other query language based on relational, object-relational, or graphsemantics. Section 3.9 below gives an example of translating LINQ queries in a similar way. 3.1 Translating FROM clause We start building a Natural Algebra expression with translation QFROM of the FROMclause, treating different kinds of FROM-expressions, according to the following rules:1) lf a FROM-expression is an aliased table name, such as A As A1, the translation is /ÛA1.(A) 2) lf a FROM-expression is an unaliased table name, such as A, the translation still 37 involves renaming: PA.(A) 3) lf a FROM-expression is any kind ofjoin of two FROM-expressions with join conditionspecified, the translation will introduce the theta-join version of the corresponding joinoperator (inner, left, right, full). For example, A LEFT JOIN B ON A.x = B.ywould translate to PA. (A) M A.x:ß.y PB. (B) Transforming a theta-join into equi-join is best expressed as a rule-based transformation: AaÅFbkB vk; ak@H(A)^bk@H(B) Ak Ao 4) lf a FROM-expression is a natural join, such as A NATURAL JOIN B, the translation would feature an equi-join with (ordered) set of common attributes on both sides: PA. M X PB.(B) where X = H(A)ñH(B) is a vector of common attributes, built from the orderedheaders using a non-commutative vector-based version of set intersection:. - .04 .XmY = X \ (XW)ln a schema-aware translation, header expression X is a good candidate for partial evaluation during the query translation phase. ) lf FROM clause contains a list of comma-separated FROM-expressions, thetranslation of the FROM clause is the Cartesian product of translations of theseexpressions. For example, FROM A, Bwill be translated to QFROM I PA.(A)X PB.(B) 3.2 Accommodating query parameters attributes distinct m lf a query makes use of parameters, they are enumerated as p,,...,p from any other attributes in the query, and are represented by a special k-relation Par, 38 H(Par) = {p1,...,pm}, which is guaranteed to contain single tuple and its values will beavailable at run-time just as values of any other k-relations. Par is joined with QFROM in order to obtain the full set of attributes available in the query:QPAR I Par MQFROM lf there are no parameters used in the query, QPAR = QFROM 3.3 Translating WHERE clause ln SQL queries WHERE clause contains a single Boolean expression P(X), dependent on attributes of the expression in FROM clause and query parameters. We can alwaysdirectly map it to a filtering generated set joined with the translation of the FROM clause.

QWHERE I QPAR N {P(Xand if there is no WHERE clause in the query, QWHERE = QPAR However, conjunctive WHERE conditions are quite common, and it is sometimes usefulto rewrite them into a set join of translations of conjunctive parts: {P(X) ^ Q(X)}~>{P(X)}>< {Q(X>}For example, a query containing FROM A, B WHERE A.x = B.y AND A.Z = ? would at this stage of translation be represented by a fragmentQWHERE I Par MIÛA.(A)NIÛB.(B)W {A-x I B-y}M{A-Z I p1} and using also the transformation rule in section 3.1 this would be equivalent to a nested equi-join: QWHERE I Pm” N/ÛAKAXX) WPB.(B)N {A-Z I P1} I ParW(PA.(A)PB.(B)) 3.4 Translating SELECT clause ln SQL queries SELECT clause corresponds to the projection operator in Natural 39 Algebra. However, any scalar expression effectively produces new attributes, so suchexpressions are translated to the generated sets. The following translation rules apply, inorder to obtain the translation QSELECT: 1) lf SELECT clause only contains a star, e.g.

SELECT * FROM A the whole header of QFROM translation is used in the projection: QSELECT I HA.H(A)(IÛA.(A)) lf the Natural Algebra expression has no other attributes than those listed in theprojection, the projection can removed, as specified by the following simp/ifying transformation rule: HXQQEA 2) lf SELECT clause contains a non-aliased attribute name, for example,SELECT A.x FROM A this attribute is used in the projection on top of the translation so far, for example,QSELECT I H{A.X}(PA.(A))- 3) lf SELECT clause contains an aliased expression, such as SELECT A.x + 1 As y FROM A a generated set is joined with the translation obtained so far, in this exampleQWHERE w {y = AX +1} This also applies to aliased expressions which consist of a single attribute. 4) lf SELECT clause contains a non-aliased expression, the default unique alias iscreated, and the translation is then performed according to case (3) above. Differentembodiments may have their own policies for generating default unique aliases. Apractice which would be best compatible with modern relational DBMSs is to allowarbitrary strings as attribute (column) names, and use the original expression string,“A.x + 1“ as an alias. ) lf SELECT clause contains DlSTlNCT keyword, the Distinct operator is added on top of the projection, for example, SELECT DISTINCT A.x, A.y FROM A is translated to |FHA%AﬂÄÅA» 3.5 Unqualified attribute resolution ln the SQL examples so far we always used attribute names qualified with table aliasesor names (which are their implicit aliases), such as A.x, B.y. ln a query where only asingle table is referred in the FROM clause, this is obviously redundant, and forming anequivalent translation of a query with unqualified attributes is a trivial task. We would stilltranslate a query SELECT x, y FROM Aas H{A_x,A_y}(pA_(A)) for consistency and composability reasons. However, a query SELECT X, y FROM A, B requires a disambiguation step which needs to look at the schema, and either report anambiguity, or transform unqualified attributes to the qualified ones, for exampleproducing SELECT A.x, B.y FROM A, B before any actual translation takes place. 3.6 Translating set operations Example 11 from the previous section covers the technique that covers the setoperations. At this point we have translation of at least two operand queries of a set operation, QÅELECT and QgELEcT, each having a projection operator on top, and we obtainQSETOP by header-casting and combining these projections. lf there are no set operations, QSETOP = QSELECT. 41 3.7 Translating ORDER BY clause Example 21 from the previous section illustrates the translation of the ORDER BY clause, by inserting the sort operator W and forming the order-vector X in all elementary cases. Here we show how does it compose with other SQL query features. ln a query with no set operations, the sort operator is inserted just under the projectionwhich is introduced by SELECT clause, but above anyjoins with generated sets, whichare introduced there also. This way sort operator W has access to both aliases and anyaliased attributes, including those not listed in SELECT clause. For example, the following two queries are equivalent: SELECT a AS X FROM A ORDER BY a and SELECT a AS X FROM A ORDER BY X while the result of either query will only contain column x: QORDER I H{X}(\P(1ÛA.(A)N {x I A°a}ﬂ)) I H{X}(\P(IÛA.(A)N {x I A°a}>)) However, in a query with set operations, projections are needed to form the compatible headers. For example, SELECT a AS X, b AS y FROM A UNION SELECT a, b FROM B ORDER BY X is a valid SQL query, while the same query ordering by a would not be valid, sinceORDER BY has no access to the projected-out attributes of either UNION branch.

This fact agrees with our technique for translating set operations. The translation of the above query would simply put the sort operator on top of the translation QSETOP: »<>f^>l QORDER I H{x,y}(1ÛA.(A) M {x I A-CÜM b] I A-bD-l- Casípßy) (H{B.a,B.b}(pB.(B))) 3.7 Translating aggregate queries An aggregate query in SQL is a query which contains at least one call to an aggregatefunction, either in SELECT, HAVING, or ORDER BY clauses. The set of aggregatefunctions is either limited to the built-in ones, and includes SUM, COUNT, MIN, MAX, AVG, or relies on an extensibility mechanism. An aggregate query may or may not have 42 GROUP BY and/or HAVING clauses.To illustrate the translation steps, we will follow an example, similar to Example 9 in the previous section, but featuring HAVING and ORDER BY clauses: SELECT y, SUM(X*z) AS u FROM A GROUP BY y HAVING MIN(x) > 5 ORDER BY MAx(z) DESCStep 1. Building list of aggregates ln order to determine if a given SQL query is an aggregate query, the followingtechnique is used in our system: each scalar expression in SELECT, HAVING, andORDER BY clauses is traversed in search for an aggregate function call. lf such call is found, the expression under it is put in a list of aggregate calls, where each record (F,e,a) consists of such an expression e, the aggregate function F being called, and anexplicit or generated alias a for this call. Our example query would produce the followinglist: (Sum, A.x - A.z, u), (Min, A.x, _ a2>, (Max, A.z, _ a3> _ The aggregate expressions in the query expressions are rewritten to the aliases fromthis list, so here is our example query after rewrite: SELECT y, u FROM A GROUP BY y HAVING _a2 > 5 ORDER BY _a3 DESC lf the list of aggregates is empty, given SQL query is not an aggregate query, and istranslated normally, as described in parts 3.1 - 3.6 of this section.

Step 2. Translating inner query The translation of an aggregate query includes inner and outer Natural Algebraexpressions, each having their own set of available attributes. The inner expression isformed by translating the query with original FROM and WHERE clauses, and whereSELECT clause lists all the distinct under-aggregate expressions listed in the first step,with generated aliases added whenever the expression is not a single attribute. The inner query corresponds to our example: SELECT x*z AS _aa1, x, z FROM A and translated to Natural Algebra as pA_(A)M{_aa1= A.x - A.z}. The translation of the SELECT clause of the inner query introduces the generated sets for any expressions 43 and aliases, but skips the projection, leaving all attributes from QFROM available in the query result.

With generated aliases for under-aggregate expressions are introduced in this step, theexpressions in the list of aggregates are rewritten to these aliases, so the first record in the list changes:(Sum, _ aal, u), (Min, A.x, _ a2>, (Max, A.z, _ aß) Step 3. Building a compound aggregate function No matter how many aggregate functions are involved in an SQL query, theircomputation needs to be done in one pass over the results of the inner query. Thecomputation itself can thus be a compound aggregate function. lt is easily built from thelist of aggregates, which is rewritten during the previous step.

The results of single-column aggregate functions are single attribute tuples with uniqueheaders, so a tuple join effectively concatenates them. Each aggregate function has aspecification of its input and output attribute in the list of aggregates, so our example we LI.Z Obtain SumšﬁaÜ -Mín{{¿_:ï} -Maxšg ä.Step 4. Translating outer query The outer query consists of the original SELECT clause, rewritten at step 1, withHAVING condition in place of the WHERE clause, and any original ORDER BY clause.As a replacement for the FROM clause we use either a grouping and aggregationoperator E if there is a GROUP BY clause, or a direct application of our compoundaggregate function to the result of the inner query. Such query is translated normally, asdescribed in sections 3.1 - 3.6, and the complete translation is Q = H¿A_y,u}(\I'(E¿A_y}(Sumšfía1}-Mínšä ï} -Maxš/*LSÄ pA_(A)D<{_aa1= A.x - A.z})>< M {_a2 > 5}),<_ a3v>» lf our example query had no GROUP BY clause (and consequently could not use yattribute in SELECT clause), SELECT SUM(x*z) AS u FROM A HAVING MIN (x) > 5ORDER BY MAX(z) DESC 44 the translation would contain such function application (effectively, a single-tuple k-relation), joined with the generated sets remaining in the rewritten SELECT clause of theouter query, sorted, and projected to the specified attributes: Q = H¿u}(\P(Sumšfía1} -Minšäfš} -MaxÉÃ:3}}(pA_(A)M {_ aa1= A.x - A.z}b< Mgaz > 5}),<_aw>)) Note that the call to aggregate function produces exactly one tuple, and the expressionunder the sort operator *P is either single-tuple or empty k-relation, so the sorting can besafely removed. This query is equivalent to one without ORDER BY clause. 3.8 Translating subqueries The spectrum of applicable techniques is illustrated in much detail in the examples 12-19 above, so here we provide a short summary. 1) ln SQL subqueries may appear- in FROM clause, in which case they may feed multiple attributes into the host query; - in conditions with ANY/ SOME / IN quantifiers, in which case they return a single column;- in scalar expressions, returning a single value; - as Boolean conditions with EXISTS / NOT EXSTS quantifiers, in which case only the emptiness or non-emptiness of their result matters. 2) There are correlated and uncorrelated subqueries. While the uncorrelated subqueriesare straightforward to translate, thanks to the compositionality of Natural Algebra,correlated subqueries depend on the bindings of attributes in the host query, and requireeither certain rewrites during the translation (predicate lifting, finite scaffoldingtechniques shown in the examples), or an execution framework will allows passing thecontext down into the subexpressions being evaluated. We describe such an executionframework in Section 6. 3.9. Other algebraizersThe steps described above are easily generalized to other query languages, for example, SPARQL and LINQ. The following example of LINQ query has obviouscounterparts of FROM, WHERE, ORDER BY, and SELECT clauses, which are described above: var result = (from p in Personwhere p.firstName == "Joe"orderby p . age select new { p.firstName, p.lastName, p.age }) Using the same approach, this query is translated to H{p.ﬁrstNamegp.lastNamegpJ/zge}(\P(pp. (Person {p'ﬁrsINan/le :y J0e'}? Section 6 provides more examples of translating queries with object and graphsemantics, featuring yet another input language - Starcounter QL, an object-orientedextension of SQL.

Natural Algebra is a common representation of database queries. Once a query istranslated to Natural Algebra, it can be validated, optimized, and executed using the system and method defined in this disclosure. 4. Natural Algebra transformation rules Most of the transformations of the Natural Algebra expressions including simplification,normalization, transformations between operator-rich and operator-homogenic forms,logical and physical optimization, can be expressed in a context-free way astransformation rules, and are the preferred way of transforming Natural Algebra expressions.

A transformation rule consists for input pattern, output pattern, applicability function, andvalue transformerfunction. lf omitted, applicability function is constant true, and value transformer function is identity function. 4.1 Operator definitions as transformation rules Example 2 mentions a simple transformation rule R,: HX(A):>®X +A 46 with input pattern HX (A) and output pattern øX +A corresponding to the definition of the projection operator. Any derived operator defined in this disclosure has at least onesuch rule, and the exhaustive application of this class of rules to any Natural Algebraexpression results in operator-homogenic form of that expression, consisting only ofnatural join and natural union operators.

A reverse rule R, øX +14* _>HX(A) would transform an expression towards an operator-rich form, by changing some of thenatural union occurrences to projections. The star next to the symbol A means 'match all the remaining arguments of the commutative and associative operator'. Otherwise, acombinatorial number of matches are possible, while only the fullest one makes practical SGHSG.

Applying Rz against an expression A1+ A2 +®M would transform it into H{X}(A, +A2), and how the input pattern is matched is summarized by the following table: X A*{x} A14" A2 4.2 Simplification rules An example of a simplification rule R3I øXÖQA* møXUHM) has the same feature, since natural join is also commutative and associative, and anysingle empty operand would make the result of the whole natural join empty.

Note that this kind of transformations can be performed both in schema-aware andschema-ignorant setting, i.e. whether header expressions are immediately evaluated tothe sets of attributes, or retained in the symbolic form. 4.3 Logical-to-physical rewrites and index utilization The note on join predicate rewrites after Example 7 given an example of logical-to-physical transformation, which can be expressed as a rule 47 R4: AWQBDQx > y} AD where the applicability condition checks that the attributes used in inequality conditionbelong to the respective headers. Additionally, while A* pattern can match any'remaining' part of the n-ary natural join, this rule only makes sense if B exposes a rangeindex on attribute y, i.e. it is either a stored k-relation with such index defined, or is aprojection/rename/union of k-relations having this property.

The set of physical indexes Idx(A) available for any subexpression A is propagated through these operators, according to the following principles: 1) lfA is a stored k-relation, Idx(A) is a list of indexes physically defined on it 2) For any index i” on an attribute a: if i” e Idx(A) then i” e Idx(pP_(A)) 3) Idx(HX(A)) = Idx(A), i.e. it is always possible to index a k-relation even if the respective attributes are later projected-out. 4) Idx(A + B) = Idx(A)mIdx(B), i.e. a natural union exposes a set of indexes whichpresent in both if its arguments.

These principles can be formulated as transformation rules, in order to bring the physicalaccess operator Rangey closer to the stored k-relation, while transforming the query in Example 7: Hu” Q:p,,»«p,_<ß>>«{f1.x>ßvfïplvøwRange:,:,,,,> RSI IÛA. (A) W pB. where rule RS is formulated after the principle (2) above w.r.t. the Rangey index and physical access operator: R5= Rangeäj,,ym>(ßp.(fl))ä>pp_ (Range<í,,,,y,,,,>(f1)) 4.4. Recursive computation of sub-expression traits The set of exposed physical indexes Idx(A) discussed above is an example of the 48 physical trait which can be recursively computed from leaf to root. Other such examples would include expression's header H(A), positiveness A > 0 (i.e. all tuple multiplicitiesare positive), distinctness guarantee (A = single-tuple guarantee (|A|=1), etc. ln our framework this is an open set of attributes, which can be extended in conjunction to adding new transformation rules into the system.

Another class of traits are the contextual traits which needs to be propagated from rootto leaves. Examples are a set of bound attributes lf Bound , or whether distinct operator is applied on top i For example, the »L contextual trait is useful for transforming a disjunction of filters into a natural union. Under normal circumstances, where multiplicities of tuples do matter,such transformation includes a second-order term, which impacts the rule's usability, since it provides little simplification:Ra {f(X)v g(y)}_>{f(X)}+ {g(Y)}~ {f(X)}>< {g(Y)}- However, under the contextual trait of i we have a true normalization/simplification rule:R» {f(X)v g(X)}Ä>{f(X)}+ {g(X)} The same effect can be achieved by without introducing the i contextual trait. An alternative set of rules would be Re ||{f(X)v g(X)H|_>||{f(X)}+ {g(X)}|| and a number of bi-directional rules for propagating the distinct operator down and up, se tnat tne input pattern ef R, ean be eventuaiiy matened; Re IIA W B||<_>||A|| WIIBII Riv ||A + Ilßllllmllfl +B|| Riv IIHX(A)II<Ä>||HX(IIAIIÅ| Having tne condition in R, fermuiated ae a eentextuai trait net eniy reduees tne number 49 or rules which need to be added into the system, but more importantly, the amount oftransformations needed to achieve the simplification expressed by the rule.

Transforming a query SELECT DIsTINcT * FROM A WHERE x = 1 OR y = 5originally translated as Q = ||pA_(A)M {A.x =1vA.y = 5}|| into QÜ) I ||IÛA.(A)M GA-x I 1}{A.x,A.y} + {A-y I 5}{A.x,A.y}]| would either take a single R, transformation, or a sequence of Ã9,R8,Ã9transformations. The practical gain of doing this transformation is applying thedistributivity rule next: R,2:A><(B+C) A> which would transform the Natural Algebra expression intoQl” = IIMAM lA-x =1}+ pAlAM My = 5}|| which offers the potential for betted index utilization. 4.5 Rules framework Previously in this section only few examples of different transformation rules werepresented, together with particular features of the rule application framework. Thisdisclosure does not aim to present a full and exhaustive list of transformation rules,instead, it presents a framework were rules can be easily formulated, managed, andexecuted. A transformation rule is a machine-readable and executable interpretation of amathematical fact about Natural Algebra, the properties of its basic and derivedoperators. ln any embodiment of the present invention a system (framework) for applying transformation rules would include the following components: 1) A system for computing certain traits of Natural Algebra subexpressions, optionallycaching them on the expression subtrees, and invalidating this cache when the corresponding leaf-wise or root-wise nodes are transformed.As an optimization, each trait may have its own invalidation logic, while headers H(A) are invariant under all valid Natural Algebra transformations, the inherent sorting order depends on the physical access operator (index) already chosen during the logical-to-physical transformation of the subexpressions. Similarly with contextual traits, while l« is stable, the set of bound attributes J» Bouna' depends on the position of the current subtree in each n-aryjoin root-wise, and the attributes bound by the previousarguments to that n-aryjoin. 2) A representation of transformation rules, including their input and output patterns,applicability conditions and value transformation functions. Optionally, capability to nameand group the rules into named classes. 3) An expression tree traversal algorithm, whether making complete traversal after eachtransformation, or, as optimization, marking the rules and parts of the expression treewhich did not match and do not need to be examined again. 4) A mechanism to perform the transformation of expressions as specified by a givenrule, optionally with logging, provenance recording, and backtracking capabilities.

) A set of strategies or heuristics to guide the transformation process towards a desiredgoal. This is especially important in presence of bi-directional transformation rules, orwhen utilizing backtracking capabilities to navigate the graph of possible equivalentforms of a Natural Algebra expression.

The rules framework is where most of the query optimization takes place. Certainalgebraization tasks also make use of this framework, as illustrated by the examples inthe next section.

. Starcounter QL examples Example 22. Object references Starcounter is multi-paradigm database, which allows querying stored classes using anextension of SQL language, here referred to as Starcounter QL. ln the view of Starcounter QL tables are classes, i.e. collections of instances (objects)with the same schema (set of fields). While one can select all field values from a certainclass, using the SQL star syntax: SELECT * FROM A one can also select objects directly from a class with a Starcounter QL query SELECT A FROM A While the first query selects tuples mapping each attribute of class A to its instancevalues, the second query returns tuples of width 1, containing objects. ln order to accommodate this dichotomy, we assume that every stored table (class) hasspecial attribute with a reserved name oid, besides all other attributes (fields) defined by the user.

For convenience of translating SQL star-syntax queries, we define a start-headerfunction H*(A). For a particular stored k-relation A, it would contain all its stored attributes except for the reserved oid attribute:H * (A) = H(A)\ {0zd} lf an argument k-relation is the result of a Natural Algebra operation, it might contain no oid attributes, or might contain multiple oid attributes from different tables. The header H*(AD This approach allows translating the first query from this example as Qi = HA.H*(A)(PA.(A)), and the second query as Qz I H{A.@zd}(/ÛA.

Similarly, pairs of objects form A and B can be returned from a query with a natural join:SELECT A, B FROM A NATURAL JOIN B would translate to Natural Algebra as Qa I H{A.Ûfd,B.Ûzd}(A NB) Example 23. Property chains The domain of the oid attributes introduced in the previous example is formally defined as a set of objects in the respective class: DefD0m(A.0id) = A While A as a k-relation may grow or shrink during the database lifetime, the respectivedomain will be changing accordingly. Other attributes in the same or other k-relationsmight have the same domain, for example a class B might have a field y, referring to objects of class A:D0m(B.y) = A _ A Starcounter QL query may be formulated to retrieve all objects of class B, togetherwith objects of class A which they reference with y field: SELECT B, y FROM B which would translate to Qz I H{B.Ûid,B.y}(IÛB.(B)) and the domain of the result of such query would be a Cartesian product of two classes as domains:D0m(Q2) = B >< A However, a query may instead retrieve values of x field of objects which are referred by y field of objects of class B:SELECT y.x FROM B which would translate to an equi-join using an oid attribute: Q3 I H{B.y.x}(pB. Dä pB.y.

Note that a prefix for the table A, representing the domain D0m(B.y) is chosen as thefully-prefixed name of that attribute.ln order to perform this translation, one needs to know the column type definition D0m(B.y)=A, which is part of the database schema. ln Starcounter QL, one can define a table using a name of another table as a column type: CREATE TABLE B (y A)A star syntax can be used to extract all non-oid fields of y objects:SELECT y.* FROM B Which would translate to Natural Algebra asQ4 I HB.y.H*(A)(IÛB.(B)(B.y) M (BymdpÛBy.

Longer property chains, as in SELECT y.x.z FROM B where D0m(x) = C and z e H(C), will translate to a superposition ofjoins: QS I H{B.y.x.z} Dä pB.y. Dä pB.y.x Note on null-tolerance in property chains There is an ongoing discussion regarding the semantics of property chain with respect tothe NULL values in object-referencing attributes. The queries Qg-QS from the aboveexample, given the Natural Algebra translations provided, will return no rows for those B.y values which are NULL.

Some applications might instead prefer as many result rows as there are bindings forB.y, whether NULL or not. This alternative semantics can be easily accommodated, by using left equi-joins instead of the default (inner) equi-joins: Q3 I H{B.y.x}(pB. M pB.y. Q4 I HB.y.H*(A)(IÛB. (B)(B.y) M (BymdpÛBy. QS I H{B.y.x.z}«/ÛB. (B)(B.y) M (Byßzfq/Ûßy.(A)) According to the definition of the left join, these Natural Algebra expressions with returnNULL values in those result rows where there is no object of class A with oid matching B.y attribute (or no object of class C with oid attribute matching the respective B.y.x, for the case of QS ).

Note on intermediate translations For the embodiments that perform a translation to the Natural Algebra in a schema-ignorant way (which was possible in all the examples so far), translating a property chaindirectly to a join is not possible. An intermediate translation can be formed instead: The last two queries from the Example 23 can be first translated asQ4 = H{B.y[»«]}(PB.(BQS = Ü{B.y[x][Z]}(/9B. (B These translations use a property dereference operator applied to a certain propertyin a projection. Now we can formulate a transformation rule translating every occurrence of to an equi-join, thus making a derived operator: R1: H{a[*]}oX (B) HQ.H*(A)UX (Bal) M (a.uzd)pa.(A)) RZ: HQILUHUX LIGHUÛAHA:A:DÛm(a)^peH(A) >H{a.p}UX (B Dä pa.(A)) Where B is any Natural Algebra expression, Xis any header expression, p is a propertyand a is an attribute expression which may involve another instance of operator. ln case a is not a simple property, but has a form b[q], the condition D0m(a)eA is defined as follows: D0m(b[q]) = A :f EIC : C = D0m(b)/\ q e H(C)/\ EIA : D0m(q) which effectively means that attribute q has A as its domain, and also belongs to thedomain of b. This definition works recursively, since b is not required to be a single attribute either.The first application of the rule RZ to QS with the following match: symb0l| a |p|X| B |Amatch|B.y[x]|z|®|pB_B)|C would transform it to: QSU) I H{B.y[x].z} (pB. Dä pB.y[x]. I = H{B.y[x].Z}(/3B.(B)W Paypqlc) W Üš-J/[X] = B-J/[xl-Ûídlndeed, the dereference operator can appear as part of any property used in a scalarexpression defining a generated set, so we will need another rewrite rule for this case: aeH(B)A¶A: A:D0m(a )ApeH(A) C(a|:p].c~>a.p.c)D R3: H{a[p].b}UX Dä ,H{a.p.b}UX ((BDQpa.(A)) where a and b are property expressions, p is a single property, and Xis the set ofother attributes in the projection, B is a Natural Algebra expression with header containing attribute denoted by a, and C is the rest of the expression, which will be bound to the join identity element if there are no more terms in the join.

The value transformation function C(a[p].c -> a.p.c)D obtains a value for rule variableD by exhaustively applying the given attribute expression transformation as a sub-rule tothe binding of the variable C in the input pattern. Note that c bindings might be different in each application of this sub-rule, whereas a andp are bound by the host rule.

Applying rule R3 to the latest form of QS with the following match: symb0l| a |p|b|match |B.y|x|z| we obtain Qšz) which is the translation of QS in the Example 23:QSQ) I H{B.y.x.z} Example 24. Path expressions: reverse properties Starcounter QL supports a full set of path expressions, borrowed from W3C SPARQL 1.1 Standard. The key semantic feature is the ability to return zero, one, or many values for a property. ln Example 23 we were assuming that B.y attribute acts as a foreign key,i.e. it points to at most one existing object in Domóø). However, such constraint is notrequired for the property chain syntax to work.

One example of multi-value property is a reverse property. A Starcounter QL querySELECT p.employer.^employer FROM Person p WHERE p.name = 'Joe'given the schema CREATE TABLE Company (name TEXT) CREATE TABLE Person (name TEXT, employer Company) would return objects of class Person, one object for each employee of a companywhere person named 'Joe' is employed. While p.employer has the type Company,p.employer.^employer property chain results in the class having employerproperty of type Company, that is, Person class (it is currently being discussed how to handle cases where multiple classes match this criterion).

The Natural Algebra translation of this query, given the presented schema, is Q I H{Pers0n .0íd} (Person ) Dä (Personßmployer) pPerson. (Person {p'name :y 'IOeV which corresponds to an executable S-expression in the preferred embodiment (PROJECT (N-JOIN (E-JOIN (PREFIX Person "p")(PREFIX Person "Person")#(employer) #(employer))(FILTER (EQUAL p.name "Joe") )) # (Person . oid)) Table (class) Company does not participate in the expression, although it is the domain of the employer attribute used in the equi-join.

As in the previous case, this translation can be obtained in two steps, (1) astraightforward schema-ignorant translation using the reverse property dereference operator:Q I Hbzmemployer^[empl0yer]}(pp.(Pers0n)>Q {p'name IVJOeV» and (2) rule-based transformation for the found among the projection attributes: R4: H{a^[p]}oX (B) 3A:A:DÛm(a):DÛm(p)AHC:pGH(C) >H{C_Ûzd}oX (B91) M (c_p)pc. with the following match symbol | a | p |X | B | A | Cmatch | pemployer | employer | ø | pp_(Person) | Company | Person resulting in the translation of the query based on the equi-join, as presented above.

Example 25. Path expressions: alternative paths Another case of a path expression producing 0-to-many results for each object instanceis the syntax of alternative paths. Consider a slightly modified Person table CREATE TABLE Person (name TEXT, employer Company, lastEmployer Company) And a query returning all people who have company named 'Starcounter' as a current or previous (last) employer: SELECT p FROM Person p WHERE p.(employer|lastEmployer).name = 'Starcounter' ls translated to a theta-join, having a disjunction in the join predicate: Q = Hlpm-d; ((15. (Ferm) W ßcümpw. (Cßmpﬂﬂyﬁm p.employerïCompanyoidvp.laszEmployer:CompanyoídM {Company.name ='Starcountaf'}) which corresponds to an executable S-expression in the preferred embodiment (note thetheta-join is 'flattened' to a natural join and a filter): (PROJECT (N-JOIN (PREFIX Person "p") (PREFIX Company "Company")(FILTER (OR (EQUAL p.employer Company.oid)(EQUAL p.lastEmployer Company.oid) ))(FILTER (EQUAL Company.name "Starcounter")))# (p.oid) ) Again, in order to arrive to this translation, the type of the properties employer andlastEmployer needs to be known. An intermediate translation can be constructed in a schema-ignorant way first, using the alternative v operator inside the property reference: Q = H { pm. d}( p I, (Person) N {p[empl0yer v lastEmployel/maame] ='Starc0unter'}) A transformation rule, aware of the types of alternative property paths would then simplytranslate the occurrences of the property references with v operator inside to the theta-joins (or, depending on the chosen property chain semantics, to the left theta-joins) with disjunction-based join criterion.

Future versions of Starcounter QL may allow different object types of the alternativeproperty paths, in which case the type of the property chain, or the handling of furtherproperties along the property chain (as name in this case), if they only belong to some but not all alternative types, needs to be defined at that point. 6. Query execution environment The Natural Algebra expressions are capable of fully encoding database queriestranslated from a wide range of query languages. They can also be transformed todifferent forms using the transformation rules framework, including the introduction of physical operators into the expression.

Any embodiment should include an executable form of each valid Natural Algebraexpression, and this form should be reachable with transformation rules available in thesystem. For example, the preferred embodiment based on S-expressions, as shown inexamples in Section 2, provides a fully composable context-free execution of eachNatural Algebra operator, hence any expression form is executable there.

A different embodiment might define a dedicated set of physical operators, partially orcompletely disjoint with the set of basic and derived Natural Algebra operatorsintroduced so far. A Natural Algebra expression, once fully translated to use this set ofoperators, may be set to a compiler or interpreter subsystem, where, given access to thedata in the database, computation of the result will take place.

Any such execution environment will either feature context-free execution (and thusrequire correlated subquery rewriting techniques such as predicate lifting and finitescaffolds) or facilitate passing of execution context to subexpressions in the form ofbound attributes/variables. 6.1 Compositional (context-free) execution model For example, let's return to the query from Example 7, which was partially translated to use physical operator Rangey in Section 4.3: Q I p/LQON/ÛB.(Rangeëwßyﬂuﬁ)More decisions on its execution may be taken by choosing an access method for A, forexample Scan(A), and a particular implementation of the natural join, for example nested-loop join MW: Qi” = /JAKSCCIHUIW”iJpß_(Rangeåw,,tx>(ß)) Even in context-free setting, a nested-loop join operator allows local passing of theattributes bound by the left operand into the right operand, and a the renaming operator p, which deals only with headers, is transparent to this context passing, hence A.x is 'visible' when executing Rangey _ 6.2 Execution model with full context passing ln Example 14 we translated an SQL query with correlated aggregate subquery into aNatural Algebra expression without any finite scaffolds or lifted predicates: Qi I (IÛA.(A)M {A-W > 0D MPC.

A number of logical-to-physical transformations would further produce a query usingrange indexes on both stored k-relations, and a nested-loop join: Qg) I pA.,+w)(A))NNLJ/Ûc.

Executing this query will require passing the A.x bindings all the way into thesubexpressions in the right operand of the nested-loop join. ln particular, executing the vw grouping and aggregate operator n with grouping variable bound would effectively produce zero or one group (result tuple) per binding.One of the more straightforward ways to implement such execution model would be to 1) Allocate an 'empty' tuple, capable of containing all attributes used anywhere in the Natural Algebra expression, whether they are projected-out or returned in the result. 2) Recursively execute the physical Natural Algebra expression starting from the rootallowing the operators to iterate on the results of their arguments' subexpressions. Everyreturn from any such operator provides values for some of the attributes in the result tuple. This result tuple is visible at all levels. 3) Each time the top operator returns a row, the result tuple is complete, and returned as part of the result set. 4) The subsequent recursive calls through the operator tree rewrite the attribute bindings. 61

Claims

CLA||\/IS

1.

2. A system (100) for representing, optimizing, and answering queries against relational databases, or object-oriented databases, or graph databases, comprising: specification of tuples and k-relations (111), where rows, object instances, orgraph nodes and edges are represented by tuples, and stored collections ofrows, object instances, or graph nodes and edges are represented by k-relations,wherein the collections represented are either sets or multisets; specification of two basic algebra operators: natural join (113) and natural union(114), which allow deriving one or more other database algebra operators as acomposition of these two basic operators over the stored k-relations (111), emptyk-relations, and generated sets (112), which are infinite k-relations finitelycomputable in the context of the query being processed; means for representing database queries as natural algebra expressions, whichare expressions comprising one or more of a natural join operator (113), a naturalunion operator (114), and other derived operators (120) over stored or generatedk-relations; and means for transforming the said expressions by systematically applyingtransformation rules, including specification of these rules, and sets of rules forseparate query processing tasks such as simplification, normalization, logical andphysical optimization.

3. The system of claim 1, wherein the means for representing database queries as natural algebra expressions are further configured for processing database queries which employ any combinations of the following features: selection, projection, inner and outerjoins, set operations such as union, intersection, difference,extension with computed attributes, grouping and aggregation, quantified subqueries, and property path expressions, by means of expressing these queries either in terms of the two basic operators 62 (113) and (114), or in terms of any derived Operators (120).

4. _ The system of claim 1, wherein the means for transforming the said expressions are are further configured for optimizing database queries represented in terms of naturaljoin (113), natural union (114), and, optionally, any derived operators (120), byapplying directed transformation rules, where the transformation rules are formulatedas pairs of input and output patterns, which are natural algebra expressions wheresome k-relations are substituted by variables, the said pairs optionally extended withan applicability condition, which is a Boolean function over rule variables and afunction for transforming the values bound to the rule variables when the rule isapplied.

5. _ The system (100) of claim 1, further comprising object-oriented data structures for representing database queries as natural algebra expressions, representing sets oftransformation rules, and allowing compilation of the query optimization and execution processes into a machine-executable program.

6. _ A method (200) for relational, object-relational, or graph query answering based on a sequence of algebraic representations (221, 241, 242, 243), comprising: ø translating (211 - 219) a database query (201 - 209), to a Natural Algebra representation (221 ), v systematically applying (230) transformation rules (231-233) in order to transformthe Natural Algebra representation (241) of the query being processed to alogically optimized representation (242), and eventually to a physically optimizedexecutable representation (243), ø evaluating (280) the physically optimized executable representation against adatabase instance (270) in order to obtain a query answer (290).

7. _ The method of claim 5, further comprising translating queries in other supported query languages, based on the alternative algebraizers (210) for each supportedquery language, into a Natural Algebra representation, (221).

8. _ The method of claim 5, further comprising translating queries with path expressions, such as property chains, reverse property chains, and alternatives of pathsubexpressions, to their Natural Algebra representation, by introducing additionalderived operators to represent these path expressions, and then rewriting them to 63

9. natural joins and natural unions in a rule-based translation step using the rules (231). A computer program loadable into a memory communicatively connected or coupledto at least one data processor, comprising software for executing the methodaccording any of the method claims 5-7 when the program is run on the at least onedata processor. A processor-readable medium, having a program recorded thereon, where theprogram is to make at least one data processor execute the method according to ofany of the method claims 5-7 when the program is loaded into the at least one dataprocessor. 64