WO2015148739A1

WO2015148739A1 - System and methods for data integration in n-dimensional space

Info

Publication number: WO2015148739A1
Application number: PCT/US2015/022595
Authority: WO
Inventors: Spyro Mousses; Christopher YOO; Toni R. FARLEY
Original assignee: Systems Imagination, Inc.
Priority date: 2014-03-26
Filing date: 2015-03-26
Publication date: 2015-10-01

Abstract

A system comprising a general framework for perceiving data as points in n-dimensional space, and methods to map data in specialized structures to the frame-work. The framework provides flexibility and scale in integrating data for system interoperability in a non-intrusive manner that does not impose a standard on the external data resources. The framework comprises an abstraction layer that forms a network, allowing disparate data elements to converge in n-dimensional space, and methods to map existing data structures to the abstraction layer.

Description

SYSTEM AND METHODS FOR DATA INTEGRATION IN

N-DIMENSIONAL SPACE

FIELD OF THE INVENTION

[0001] The present invention relates to data integration and in particular to a system and methods for integrating disparate data in n-dimensional space using a generalized framework.

BACKGROUND OF THE INVENTION

[0002] Data is stored and transported in disparate formats, such as spreadsheets (flat file databases), structured databases (e.g. relational, object, object-relational, hierarchical, network, triplestore or document), graph structures (e.g. binary graph, graph, or hypergraph), etc. These disparate formats inhibit system interoperability. Portable data formats aim to solve this problem by introducing intermediate data structures to support the transport of data (import/export) between systems, such as extensible Markup Language (XML) and specializations thereof, such as Graph Markup Language

(GraphML), and Resource Description Framework XML (RDF/XML). Existing portability formats are cumbersome to marshall data in and out of, such as XML, or too restrictive in form, such as RDF/XML. Many portability solutions support the

"aggregation" of data, with limited support for "linking" aggregate data in a common format. Typically, linking is supported by the addition of extraneous data, such as metadata and tags, which exasperates the complexity and spacial footprint of the original data.

[0003] Both aggregation and linking are necessary to "integrate" data. What is needed is a generalized solution that views all data formats as specializations of the general form. The system should be non-intrusive to external data resources, and not require a standard be adopted for structuring or sharing data.

[0004] The present disclosure provides a system comprising a general framework for perceiving data as points in ^-dimensional space, and methods to map data in specialized structures to the framework. The framework provides flexibility and scale in integrating data for system interoperability in a non-intrusive manner that does not impose a standard on the external data resources. The framework comprises an abstraction layer that forms a network, allowing disparate data elements to converge in ^-dimensional space, and methods to map existing data structures to the abstraction layer.

SUMMARY OF THE INVENTION

[0005] To overcome challenges associated with data stored in different locations, and disparate formats, the present invention describes a system and methods for integrating data in an n-dimensional framework, comprising steps to aggregate and relate data across a plurality of formats.

BRIEF DESCRIPTION OF THE DRAWINGS

[0006] For a more complete understanding of the invention, reference is made to the following description and accompanying drawings, in which:

[0007] FIG. 1 is two relational database schema models for a lab;

[0008] FIG. 2 is an XML schema for a clinic;

[0009] FIG. 3 is part of an XML schema from DrugBank;

[0010] FIG. 4 is part of a spreadsheet from NCBI Gene;

[0011] FIG. 5 is a relational database schema model for a store; and

[0012] FIG. 6 is an XML schema for inventory management .

DETAILED DESCRIPTION OF THE INVENTION

[0013] The present disclosure provides a system for managing disparate locations and formats of data, and methods for data mapping that assimilate/integrate data in a framework that allows the data to be linking without adding additional data as annotations (meta-data, keywords, tags, etc.) on the data elements. The system uses a framework for unsupervised data assimilation, and methods illustrate how to map data from disparate formats to the general framework.

[0014] Aggregating and linking rules in the framework are related to the format of the original data "input". Some data formats in include:

1. Relational database schema, where a database may contain a plurality of tables. Each row in a table represents an entity (record), columns (fields) hold keys, and cells hold values. Special keys are assigned to some columns as primary key/foreign key to form relationships.

2. Flat file or spreadsheet, where a file contains a single spreadsheet with row, column, and cell semantics similar to a singular table in a relational database. Entities in the same spreadsheet are related.

3. Extensible markup language (XML), where a document contains entities, which are elements identified by a pair of tags. Attributes are stored on elements as key/value pairs, or as simple "child" elements. A root element that contains other elements may exist. An element's "child" elements may also refer to other elements in the document to form relationships.

4. Document database, where the database contains documents as entities, and key/value pairs are stored on the entities. Collections of documents may also exist within a single database. Relationships may be formed by one document referencing or embedding other documents.

5. General graph, where the graph structure may be stored in a file using a GraphML, adjacency list, or other format. Nodes in the graph represent entities, and may be annotated with attributes as key/value pairs. Relationships are formed by edges connecting nodes.

6. Generalized hypergraph (described below), where each element is an entity, and key/value pairs are stored on entities. Relationships are present in an element's internal/external sets.

7. C⁴ model (described below), where each element is an entity, which may be annotated with attributes as key/value pairs. Relationships are present in the sublists of each entity.

[0015] Equivalent types of components for some formats are shown in Table 1, where the last row denotes a specific data element. Table 1 Equivalent types of components in different input types

[0016] The present invention may also be applied to data modeled in new ways, such as the generalized hypergraph and C⁴ models described below, with components mapped in a similar manner as Table 1.

[0017] Generalized Hypergraph

[0018] From patent application number US 13/463,603; A graph is defined as G{V,E) where Vis a set of vertices (nodes) on the graph and £ is a set of edges (links) between two vertices. A hypergraph is a generalization of a graph where an edge can connect any number of vertices, and we have E = {ei|ei C V ) .We refer to the nodes and edges of the graph as elements, and define this set ε = V U E. We extend the hypergraph model to allow links among edges, then E = {ei|ei ε} . An element in this model may be defined as consisting of the components:

1. Unique Identifier (UID)

2. Attribute Set: key/value pairs

3. Internal Element Set: UIDs of elements contained in this element

4. External Element Set: UIDs of elements that contain this element

[0019] C Model

[0020] Knowledge may be defined by a collection of related entities. An entity may be defined by a unique identifier, and a pair of ordered lists comprising a number of sublists each. For example, the following four sublists represent four ways in which an entity relates to another entity, as described in Table 2. Table 2 Four sublists of an entity ( ⁴).

[0021] In Table 2 the pair of lists are denoted a and β, and the four sublists are collectively referred to as ⁴. These lists, along with a unique identifier (UID) define an entity as:

UID, a, β (1)

[0022] Additionally, we may add a set of key/value pairs to an entity to capture additional attributes and/or annotations on the entity, result in the definition of an entity as:

UID, Attribute_Set, α, β (2)

[0023] The lists a and β in Table 2 have a reciprocal relationship. For instance, given an entity, x, the sublists in x(a) contain the UIDs of other entities that relate to this entity by the semantic meaning given in Table 2, as:

1. composed of (has-a): x is an entity that is made up of the entities in this sub list

2. includes: x is a general classification of the specific entities in this sublist

3. derived from: x is a concept or entity that is derived from the combination of the entities in this sublist

4. caused by: x is an effect that is caused by the entities in this sublist

[0024] The sublists in χ(β) contain the UIDs of other entities that relate to this entity by the semantic meaning given in Table 2, as:

1. part of: x is a part of the entities in this sublist

2. member of (is-a): x is a member of the classification entities in this sublist

3. contributes to: x is one of the pluralities of interacting entities that contribute to the derived entity or concept in this sublist 4. effects: x is a cause that results in the effects in this sublist

[0025] By these definitions, the first sublist allows for abstractions, where an entity x can be viewed as a singular entity, or expanded and viewed by the sum of its parts using a(Composition).

[0026] Further, a sublist may begin with a binary digit specifying whether or not an ordering is imposed on the items in the list, where 0 denotes unordered and 1 denotes ordered.

[0027] System Framework

[0028] Points in two dimensional space may be defined by an (x, y) coordinate system. In the data formats shown in Table 1, both a relational database table and spreadsheet store data in rows and columns, which intersect at "cells" that can be referenced by {row, column) as a two dimensional coordinate. Points in three dimensional spaced may be defined by an (x, y, z) coordinate system. The relational database potentially comprises many tables, and we may reference a cell in a table by {row, column, table). Finally, we may have more than one database, and following this method, we may reference a cell using a four dimensional coordinate system as {row, column, table, database).

[0029] To facilitate discussion, in the present system, we may want to transpose the coordinate system to refer to a point as {database, table, column, row), which maps the last three coordinates as (z, y, x) in 3 -dimensional space. This provides a framework similar to Internet Protocol (IP) addressing, having a form shown in (3), where _ refers to a location, which may be an IP address, and β, γ, δ, ε refer to points in a four dimensional coordinate system.

α, β, γ, δ, ε (3)

[0030] The structure in (3) is minimally sufficient to locate a piece of data, but can be expanded to additional dimensions. Looking again at Table 1, the first four rows have a direct correspondence to β, γ, ε, δ . [0031] In addition to defining coordinates to locate data, the data itself may be fetched by the system and stored with the coordinate using another dimension ω, then we have (4). Data may be fetched on demand, or pre-fetched and shared (exported/ imported) with other systems along with the coordinate system.

α, β, γ, δ, ε, ω (4)

[0032] We may specify to iterate over all columns Y and rows using a "mask" as shown in (5) and (6).

α, β, Υ, ^ (5)

< location > . < database > . < table > .Y.X(6)

[0033] Then, (5) refers to all of the complete data records (all columns and rows) stored in table < table > of < database > stored at < location >. The text references in (6), and text values for each y 6 Y , may map to numbers generated by and known to the system and system servers may share these map tables, and mapped values, similar to how domain name servers (DNS) operate on the Internet.

[0034] Once all of the desired data is fully mapped in n dimensional space, the next step is to align the data so that related data intersects. This may be handled by maintaining an offset to entities known to the system. This offset for an entity may be stored in the map table. Computing the offset from one data store to the next allows them to overlap spatially to recover related data. An offset computed on one dimension provides one degree of freedom to each coordinate stored in the system. In some embodiments, k≤ n dimensions may be offset to provide k degrees of freedom. An offset, Δ, may be represented in the coordinate system associated with its coordinate dimension as shown in (7).

α.β.γ.δ, Δ.ε.ω (7)

[0035] As the system is introduced to new shared columns, new offsets are created and stored in the map table. This may be a semi-automated process, wherein upon accessing a new data set to integrate, the system: 1. for each δ component,

2. look up value (including mapped "synonyms") in map table,

3. if exists, check that values are of similar type and format, and if there is overlap with existing data,

4. if success, add offset,

5. if name not found, but a similar value set exists under another name, log a query to later be addressed by a human operator.

[0036] Step 5 of the method asserts that when column names do not match, the system will look for columns with similar types/formats of values, and log a hypothesis that any found column names refer to the same entity. Then, a system administrator can check this hypotheses, and instruct the system to either accept or reject it.

[0037] In some embodiments, a geospatial, spatiotemporal, or any other coordinate may be included in the coordinate system.

[0038] In some embodiments, the system may behave in an unsupervised manner, by including learning methods to automatically test hypothesis and accept or reject them without human user intervention.

[0039] In some embodiments, dimensions may be offset and mapped using a transposed form of the data, where rows and columns both represent entities, and each entity may map to entities in other data sets.

[0040] Rows and columns in a relational database or spreadsheet are ordered, while their counterparts in other formats (e.g. elements, documents, nodes, etc.) may not be.

However, an order may be imposed. For an XML document, the order of elements may be based on the order in which they appear in the document schema. Documents in a document database may be ordered by the document's UID. Nodes in a graph may be ordered by UID, or the order in which they appear in a specific graph representation (e.g. adjacency lists). Child elements in XML, attributes, keys, and annotations may be numbered by the order in which they appear.

[0041] The present illustration maps relational databases. Similar methods can be applied to other data formats. In some embodiments, existing methods may be applied to convert other data formats, such as XML and graph, to a relational database format, and then apply the illustrated method.

[0042] Semantic Column Mapping by Abstraction

[0043] In some embodiments, feature generation techniques may be used to align columns with the same semantic meaning, but different types of data, or data measured in different ways. Disparate types and measures may be defined as:

[0044] disparate data types numeric (integer, floating point), string (alphanumeric), etc.

[0045] disparate measures quantitative vs qualitative, count vs percentage, age in days vs age in years, etc.

[0046] Feature Generation

[0047] An exemplary feature generation technique takes as input descriptors defined by key/value pairs, and applies a set of rules to generate features.

[0048] The Descriptors may be represented by a set of tuples, where each tuple is one of:

(name, numeric _yalue, measure) (8)

(name, string_value[, measure]) (9)

[0049] The elements of the tuples in (8) and (9) may be defined as:

[0050] name a string that identifies the descriptor (e.g. age, color, price, a particular gene name, etc.)

[0051] numeric_value a numeric value

[0052] string_value a string of alphanumeric characters

[0053] measure a particular measure (quantity) on the value (e.g. years, Euros, PPM, kg/mol)

[0054] The name in (8) and (9) may be any attribute or measure that represents an aspect of an entity. The values in (8) and (9) may be any type of data, including numeric, alpha, alphanumeric, or a reference to a location that contains a media file (e.g. image, audio, video, etc.) In some embodiments, the location referenced by a value may refer to a location in an external database or web server. This model may map to a relational database model, as (field, value, "table"), when a row in a table relates to an entity, and different tables are used for different measures with the same semantic, on the same entities.

[0055] Defining descriptors in this way is optimally concise and provides consistency across data sets. Being optimally concise provides the maximum flexibility to support disparate data, and the minimum space to support scalability and portability. Having this consistent format across all data sets reduces the necessary complexity of a rule set that operates on the descriptors.

[0056] The Rule Set may include equivalence rules for translating Descriptors to Features . Rules may be defined based on the domain of the input data (e.g. equivalence rules for molecular data). Rules may form associated equivalencies for disparate types and measures. A rule may exist for all string_values, and all measures (quantities), wherein the first requirement handles disparate data types (e.g. associating string values to equivalent numeric values - equivalent values), and the second requirement handles disparate measures with the same semantic meaning (e.g. associating numeric values with the same semantic meaning that were measured in a different way - equivalent quantities).

[0057] As an example of a rule, descriptors that have discrete values may translate directly to features. As a further example of a rule, descriptors that have discrete values may translate into features that represent finite ranges or sets of discrete values. As a further example of a rule, descriptors that have continuous values may have associated rules that apply statistical techniques to categorize the values into discrete ranges or sets, where each set is a unique feature on the descriptor. [0058] Examples of some rules are shown in Table 3. [0059] The first two rules in Table 3 equate a numeric value to a semantically equivalent string value for a given quantity (count). The next two rules equate a string value to another semantically equivalent string value. The last two rules equate disparate quantities to a semantically equivalent feature (age). In the last case, more concise rules may exist to create age ranges.

[0060] The rules in Table 3 include statistically derived values that may be computed as:

Table 3 Example equivalence rules

[0061] mean a statistical mean

[0062] std one standard deviation

[0063] The simplicity of a rule set defined in this way is permitted by the preferred method of representing descriptors using the tuples (8) and (9). The tuples and the rule set combine to form a generalized format for providing input to a Processor for feature generation. The generalization provides maximum flexibility as it captures descriptors and rules in a straightforward and concise fashion. Any method that does not rely on generalizing the inputs would require significantly more complex rule sets to capture the same semantics of feature generation, resulting in a system that is not flexible and does not scale. The present approach overcomes the complexities associated with harmonizing across disparate data sets, thereby alleviating the need to perform data pre-processing steps required by other approaches.

[0064] Following feature generation, features may be stored as an abstraction using the data models described in sections "Generalized Hypergraph" and " ⁴ Model". Then, those abstracted feature columns may be aligned using the methods of the present invention.

[0065] Exemplary Uses

[0066] Embodiments of the present invention may be used in integrating internet user data (for example from previous search history on search engines like Google, BING, and Yahoo, and social interaction and preferences data from social network services like Facebook). The system in the present disclosure can be used to create a very rich situational context to personalize the search experience when consumers are using online resources to find content (movies, books, music, consumer products, services, or any content or resource on the Internet). Using the present invention, it is feasible to link consumer preferences, and even an entire online footprint (including profiles from Facebook, etc.), to an extremely wide variety of online consumer product and service resources.

[0067] Embodiments of the present invention may be used in military intelligence and law enforcement surveillance of threats achieved by integrating knowledge integrated from disparate data content distributed throughout the Internet, relating to individuals, organizations, and states. For example, there may be thousands of databases across the Internet that hold disparate data about a single terrorist suspect. Some of the content may not be linked to the suspect's name, but may have contextual patterns and semantic concepts that when integrated into a coherent context can implicate that suspect to a particular situation. Intelligently linking to those disparate data elements so they can be analyzed requires formulating such a context, and using it to recover relevant content across the Internet, which may be achieved by using the present invention.

[0068] Example 1

[0069] Embodiments of the present invention may be used in intelligently linking and federating medical and scientific databases as a higher order multiscalar knowledge network. Instead of metadata tags, mapping of semantically meaningful data structures to enable content to be linked at the abstracted level provides true integration of content across a federated network. For example, a genomic data database, a pharmacogenomics knowledge base, and a clinical trials information database, all may have content related to a particular drug, such as Gemcitabine. The genomic database has gene mutations that affect a pathway, the pharmacogenomic database has linked that pathway to regulating response to Gemcitabine, and the trial database contains information about a clinical trial to evaluate Gemcitabine. Traditionally, meta-data tags would need to be matched between the genomic and clinical trial database in order to link content. In this example, there is no direct link between the two. Using the present invention, content may be intelligently linked through the network, without having to create hundreds of thousands of tags to describe the content in each of the three networked databases. The present invention may integrate content at an abstracted overlay network level, and link disparate data formats within each content source in a generalized n-dimensional framework. The framework may then be used to find a clinical trial for a patient given only the genomic context stored in the genomic database, without the need to match metadata tags.

[0070] For example, consider the exemplary relational database model in FIG. 1, the XML schemas in FIG. 2 and FIG. 3, and the spreadsheet in FIG. 4.

[0071] The XML schema in FIG. 3 is part of the DrugBank schema. (DrugBank: a comprehensive resource for in silico drug discovery and exploration. Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z, Woolsey J.Nucleic Acids Res. 2006 Jan l;34(Database issue) :D668-72.)

[0072] The spreadsheet in FIG. 4 is part of the National Center for Biotechnology

Information (NCBI) Gene database ((htJ ://ww ,ncbi.^'nlra.Bib.gov:^/ge?ie)).

[0073] For ease of exposition, the Location (a) of each of these datasets is listed in Table 4, and referenced by Label (Lx, where x = [1..4]) in Table 5.

[0074] The coordinate system is derived from the models in FIG. 1, the schemas in FIG. 2 and FIG. 3, and the spreadsheet header row in FIG. 4. Coordinates may be defined as shown in Table 5.

Table 4 Location for example datasets Label Location

Li < lab URL >

L2 < clinic URL >

L3 htt :// www . drugbank. ca/system/download s/ current/

L4

[0075] The datasets at Li, L2, Li and L4 are related on fields representing patients, genes and drugs. These relationships are shown in Table 6.

[0076] Additionally, relations exist within a dataset, and may be aligned within the dataset. For example, Li.DrugTarget. A maps to Li.DrugTarget.2A .

[0077] In Table 5, two additional fields (age and birthdate) are related, but do not contain the same types of data:

Li.Patient.1 (10)

L2.clinic.xml.1.4 (11)

[0078] The feature generation method described in section "Semantic Column Mapping by Abstraction" may be applied by creating a rule to generate a common feature for (10) and (11). The feature may be stored as an abstraction on each dataset using one of the data models described in sections "Generalized Hypergraph" and " ⁴ Model".

[0079] We can link the data in Table 6, and other related fields, by using an offset to intersect entities in the 6th dimension. Assuming a map table has been previously generated that identifies fields with different names as having the same semantic meaning (using methods to harmonize data across different nomenclatures and ontologies), and provides an offset Δ for this entity, the data sets in Table 6 intersect when shifted in the 6th dimension by Δ. Then, data may be identified as:

α.β.γ.Γ, Δ. (12)

[0080] Once the coordinate system is offset using the map file, the adjusted coordinates may be used to relate data within and across respective data stores. Table 5 Coordinates for example datasets

Ref a Database, Document or Table, Root Y Column or Key δ

File (β) Element, or

Spreadsheet

IG 1 Li Patient patient 1 idpatient 1 gender 2 age 3 expression Ί idexpression 1 gene 2

RP M 3 idpatient 4

L₂ Drug Target gene 1 idgene 1 symbol 2 drug iddrug 1 name 2 target 3 drug iddrug 1 gene_idgene 2 action 3IG 2 L₂ clinic.xml patients 1 id 1 name 2 address 3 birthdate 4 gender 5IG 3 L₂ drugbank.xml drugs 1 drugbank-id 1 name 2 partners 2 id 1 name 2 gene-name 3IG 4 Homo sapiens. gene info sheet 1 1 tax id 1 genelD 2 symbol 3 chromosome 4 map localization 5 Table 6 Related fields in example datasets.

[0081] Example 2

[0082] Entities linked via the Internet may have extraneous data, such as metadata and semantic tags added to achieve some superficial level of situational awareness and content awareness. This approach is very limiting in many ways, and does not scale well for rich contextualization of data sets. Extraneous "tagging" exasperates the complexity and spacial footprint of the original data. This invention revolutionizes how data is linked across large scale networks that integrate very disparate types of entities, and can be applied to achieve unprecedented situational and semantic content awareness within the Internet, and networks beyond. Conventionally, the Internet of Things (IoT) uses uniquely identifiable objects and maps these to virtual representations in an Internet- like structure, making it possible to connect a wide variety of devices (and the data they generate) through the Internet. Similarly the Internet of Everything (IoE) goes further to use meta-data tags to connect people, processes, data, and other things to make networked connections more relevant and valuable. Given that the goal for the IoE is to turn information into actions that create new capabilities, richer experiences, and unprecedented economic opportunity for businesses, individuals, and countries, the present invention may be directly applied for the scalable capturing of much richer situational awareness of data content without adding complexity. Using the system and methods described in this disclosure may create unprecedented interoperability that can scale to any size network, along with the ability to go beyond superficially described connections by truly integrating data entities. The present invention therefore empowers the IoT and IoE by enabling a completely novel way of integrating disparate types of information content. For example, in some embodiments, the present invention may enable deep content awareness and situational knowledge to be captured in an overlay network, which may then support intelligent search, recovery, and interchange of content that is not feasible at a large scale using conventional systems.

[0083] As an illustrative example, consider the online store database schema model in FIG. 5, comprising three entity tables (customer, order, item) and one join table representing a many-to-many, or HasAndBelongsToMany (HABTM) relationship between entities in the order and item. The example XML schema for products in FIG. 6 contains product and distributor entities that are linked in a HABTM relationship by the root element products. Table 7 illustrates the related fields in this example.

Table 7 Related fields in example datasets.

[0084] Generating an offset for the barcode entity aligns these two datasets, effectively mapping customers to products to distributors, enabling analysis such as identifying customers with a tendency to purchase products originating from the same distributor.

Claims

CLAIMS: What is claimed:

1. A system to integrate data in a common framework in ^-dimensional space comprising:

a) A coordinate system for identifying data locale comprising a map file for translating textual descriptions to numbers

b) Methods to map data formats to the coordinate system

c) Methods to integrate data using the coordinate system comprising:

i. Identifying the physical location of the data.

ii. Identifying the format of the data.

iii. Identifying the components of the input (e.g. Table 1). iv. Generating a mask for the data.

v. Generating offsets for dimensions of the data.

vi. Adding offsets to the mask.

2. The method of claim 1 further comprising a plurality of data sets, wherein the data sets are persisted in different geographic locations.

3. The method of claim 1 further comprising a plurality of data sets, wherein the data sets are persisted in different formats.