WO2015148739A1 - System and methods for data integration in n-dimensional space - Google Patents

System and methods for data integration in n-dimensional space Download PDF

Info

Publication number
WO2015148739A1
WO2015148739A1 PCT/US2015/022595 US2015022595W WO2015148739A1 WO 2015148739 A1 WO2015148739 A1 WO 2015148739A1 US 2015022595 W US2015022595 W US 2015022595W WO 2015148739 A1 WO2015148739 A1 WO 2015148739A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
methods
entity
dimensional space
entities
Prior art date
Application number
PCT/US2015/022595
Other languages
French (fr)
Inventor
Spyro Mousses
Christopher YOO
Toni R. FARLEY
Original Assignee
Systems Imagination, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Systems Imagination, Inc. filed Critical Systems Imagination, Inc.
Publication of WO2015148739A1 publication Critical patent/WO2015148739A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems
    • G06F16/94Hypermedia

Definitions

  • the present invention relates to data integration and in particular to a system and methods for integrating disparate data in n-dimensional space using a generalized framework.
  • Data is stored and transported in disparate formats, such as spreadsheets (flat file databases), structured databases (e.g. relational, object, object-relational, hierarchical, network, triplestore or document), graph structures (e.g. binary graph, graph, or hypergraph), etc.
  • spreadsheets flat file databases
  • structured databases e.g. relational, object, object-relational, hierarchical, network, triplestore or document
  • graph structures e.g. binary graph, graph, or hypergraph
  • Portable data formats aim to solve this problem by introducing intermediate data structures to support the transport of data (import/export) between systems, such as extensible Markup Language (XML) and specializations thereof, such as Graph Markup Language
  • the present disclosure provides a system comprising a general framework for perceiving data as points in ⁇ -dimensional space, and methods to map data in specialized structures to the framework.
  • the framework provides flexibility and scale in integrating data for system interoperability in a non-intrusive manner that does not impose a standard on the external data resources.
  • the framework comprises an abstraction layer that forms a network, allowing disparate data elements to converge in ⁇ -dimensional space, and methods to map existing data structures to the abstraction layer.
  • the present invention describes a system and methods for integrating data in an n-dimensional framework, comprising steps to aggregate and relate data across a plurality of formats.
  • FIG. 1 is two relational database schema models for a lab
  • FIG. 2 is an XML schema for a clinic
  • FIG. 3 is part of an XML schema from DrugBank
  • FIG. 4 is part of a spreadsheet from NCBI Gene
  • FIG. 5 is a relational database schema model for a store
  • FIG. 6 is an XML schema for inventory management .
  • the present disclosure provides a system for managing disparate locations and formats of data, and methods for data mapping that assimilate/integrate data in a framework that allows the data to be linking without adding additional data as annotations (meta-data, keywords, tags, etc.) on the data elements.
  • the system uses a framework for unsupervised data assimilation, and methods illustrate how to map data from disparate formats to the general framework.
  • Relational database schema where a database may contain a plurality of tables. Each row in a table represents an entity (record), columns (fields) hold keys, and cells hold values. Special keys are assigned to some columns as primary key/foreign key to form relationships.
  • Extensible markup language where a document contains entities, which are elements identified by a pair of tags. Attributes are stored on elements as key/value pairs, or as simple "child” elements. A root element that contains other elements may exist. An element's "child” elements may also refer to other elements in the document to form relationships.
  • Document database where the database contains documents as entities, and key/value pairs are stored on the entities. Collections of documents may also exist within a single database. Relationships may be formed by one document referencing or embedding other documents.
  • the present invention may also be applied to data modeled in new ways, such as the generalized hypergraph and C 4 models described below, with components mapped in a similar manner as Table 1.
  • a graph is defined as G ⁇ V,E) where Vis a set of vertices (nodes) on the graph and £ is a set of edges (links) between two vertices.
  • We refer to the nodes and edges of the graph as elements, and define this set ⁇ V U E.
  • We extend the hypergraph model to allow links among edges, then E ⁇ ei
  • An element in this model may be defined as consisting of the components:
  • Knowledge may be defined by a collection of related entities.
  • An entity may be defined by a unique identifier, and a pair of ordered lists comprising a number of sublists each. For example, the following four sublists represent four ways in which an entity relates to another entity, as described in Table 2.
  • Table 2 Four sublists of an entity ( 4 ).
  • x is a general classification of the specific entities in this sublist
  • derived from: x is a concept or entity that is derived from the combination of the entities in this sublist
  • x is one of the pluralities of interacting entities that contribute to the derived entity or concept in this sublist 4.
  • the first sublist allows for abstractions, where an entity x can be viewed as a singular entity, or expanded and viewed by the sum of its parts using a(Composition).
  • a sublist may begin with a binary digit specifying whether or not an ordering is imposed on the items in the list, where 0 denotes unordered and 1 denotes ordered.
  • Points in two dimensional space may be defined by an (x, y) coordinate system.
  • both a relational database table and spreadsheet store data in rows and columns, which intersect at "cells" that can be referenced by ⁇ row, column) as a two dimensional coordinate.
  • Points in three dimensional spaced may be defined by an (x, y, z) coordinate system.
  • the relational database potentially comprises many tables, and we may reference a cell in a table by ⁇ row, column, table).
  • the structure in (3) is minimally sufficient to locate a piece of data, but can be expanded to additional dimensions. Looking again at Table 1, the first four rows have a direct correspondence to ⁇ , ⁇ , ⁇ , ⁇ . [0031] In addition to defining coordinates to locate data, the data itself may be fetched by the system and stored with the coordinate using another dimension ⁇ , then we have (4). Data may be fetched on demand, or pre-fetched and shared (exported/ imported) with other systems along with the coordinate system.
  • (5) refers to all of the complete data records (all columns and rows) stored in table ⁇ table > of ⁇ database > stored at ⁇ location >.
  • the text references in (6), and text values for each y 6 Y may map to numbers generated by and known to the system and system servers may share these map tables, and mapped values, similar to how domain name servers (DNS) operate on the Internet.
  • DNS domain name servers
  • the next step is to align the data so that related data intersects. This may be handled by maintaining an offset to entities known to the system. This offset for an entity may be stored in the map table. Computing the offset from one data store to the next allows them to overlap spatially to recover related data.
  • An offset computed on one dimension provides one degree of freedom to each coordinate stored in the system.
  • k ⁇ n dimensions may be offset to provide k degrees of freedom.
  • An offset, ⁇ may be represented in the coordinate system associated with its coordinate dimension as shown in (7).
  • Step 5 of the method asserts that when column names do not match, the system will look for columns with similar types/formats of values, and log a hypothesis that any found column names refer to the same entity. Then, a system administrator can check this hypotheses, and instruct the system to either accept or reject it.
  • a geospatial, spatiotemporal, or any other coordinate may be included in the coordinate system.
  • the system may behave in an unsupervised manner, by including learning methods to automatically test hypothesis and accept or reject them without human user intervention.
  • dimensions may be offset and mapped using a transposed form of the data, where rows and columns both represent entities, and each entity may map to entities in other data sets.
  • Rows and columns in a relational database or spreadsheet are ordered, while their counterparts in other formats (e.g. elements, documents, nodes, etc.) may not be.
  • an order may be imposed.
  • the order of elements may be based on the order in which they appear in the document schema.
  • Documents in a document database may be ordered by the document's UID.
  • Nodes in a graph may be ordered by UID, or the order in which they appear in a specific graph representation (e.g. adjacency lists).
  • Child elements in XML, attributes, keys, and annotations may be numbered by the order in which they appear.
  • the present illustration maps relational databases. Similar methods can be applied to other data formats. In some embodiments, existing methods may be applied to convert other data formats, such as XML and graph, to a relational database format, and then apply the illustrated method.
  • feature generation techniques may be used to align columns with the same semantic meaning, but different types of data, or data measured in different ways. Disparate types and measures may be defined as:
  • An exemplary feature generation technique takes as input descriptors defined by key/value pairs, and applies a set of rules to generate features.
  • the Descriptors may be represented by a set of tuples, where each tuple is one of:
  • name a string that identifies the descriptor (e.g. age, color, price, a particular gene name, etc.)
  • numeric_value a numeric value
  • string_value a string of alphanumeric characters
  • measure a particular measure (quantity) on the value (e.g. years, Euros, PPM, kg/mol)
  • the name in (8) and (9) may be any attribute or measure that represents an aspect of an entity.
  • the values in (8) and (9) may be any type of data, including numeric, alpha, alphanumeric, or a reference to a location that contains a media file (e.g. image, audio, video, etc.)
  • the location referenced by a value may refer to a location in an external database or web server.
  • This model may map to a relational database model, as (field, value, "table"), when a row in a table relates to an entity, and different tables are used for different measures with the same semantic, on the same entities.
  • the Rule Set may include equivalence rules for translating Descriptors to Features .
  • Rules may be defined based on the domain of the input data (e.g. equivalence rules for molecular data). Rules may form associated equivalencies for disparate types and measures.
  • a rule may exist for all string_values, and all measures (quantities), wherein the first requirement handles disparate data types (e.g. associating string values to equivalent numeric values - equivalent values), and the second requirement handles disparate measures with the same semantic meaning (e.g. associating numeric values with the same semantic meaning that were measured in a different way - equivalent quantities).
  • descriptors that have discrete values may translate directly to features.
  • descriptors that have discrete values may translate into features that represent finite ranges or sets of discrete values.
  • descriptors that have continuous values may have associated rules that apply statistical techniques to categorize the values into discrete ranges or sets, where each set is a unique feature on the descriptor.
  • Examples of some rules are shown in Table 3.
  • the first two rules in Table 3 equate a numeric value to a semantically equivalent string value for a given quantity (count).
  • the next two rules equate a string value to another semantically equivalent string value.
  • the last two rules equate disparate quantities to a semantically equivalent feature (age). In the last case, more concise rules may exist to create age ranges.
  • the rules in Table 3 include statistically derived values that may be computed as:
  • features may be stored as an abstraction using the data models described in sections "Generalized Hypergraph” and " 4 Model”. Then, those abstracted feature columns may be aligned using the methods of the present invention.
  • Embodiments of the present invention may be used in integrating internet user data (for example from previous search history on search engines like Google, BING, and Yahoo, and social interaction and preferences data from social network services like Facebook).
  • the system in the present disclosure can be used to create a very rich situational context to personalize the search experience when consumers are using online resources to find content (movies, books, music, consumer products, services, or any content or resource on the Internet).
  • Embodiments of the present invention may be used in military intelligence and law enforcement surveillance of threats achieved by integrating knowledge integrated from disparate data content distributed throughout the Internet, relating to individuals, organizations, and states. For example, there may be thousands of databases across the Internet that hold disparate data about a single terrorist suspect. Some of the content may not be linked to the suspect's name, but may have contextual patterns and semantic concepts that when integrated into a coherent context can implicate that suspect to a particular situation. Intelligently linking to those disparate data elements so they can be analyzed requires formulating such a context, and using it to recover relevant content across the Internet, which may be achieved by using the present invention.
  • Embodiments of the present invention may be used in intelligently linking and federating medical and scientific databases as a higher order multiscalar knowledge network. Instead of metadata tags, mapping of semantically meaningful data structures to enable content to be linked at the abstracted level provides true integration of content across a federated network.
  • a genomic data database, a pharmacogenomics knowledge base, and a clinical trials information database all may have content related to a particular drug, such as Gemcitabine.
  • the genomic database has gene mutations that affect a pathway
  • the pharmacogenomic database has linked that pathway to regulating response to Gemcitabine
  • the trial database contains information about a clinical trial to evaluate Gemcitabine.
  • meta-data tags would need to be matched between the genomic and clinical trial database in order to link content.
  • content may be intelligently linked through the network, without having to create hundreds of thousands of tags to describe the content in each of the three networked databases.
  • the present invention may integrate content at an abstracted overlay network level, and link disparate data formats within each content source in a generalized n-dimensional framework. The framework may then be used to find a clinical trial for a patient given only the genomic context stored in the genomic database, without the need to match metadata tags.
  • the XML schema in FIG. 3 is part of the DrugBank schema.
  • DrugBank a comprehensive resource for in silico drug discovery and exploration.
  • Wishart DS Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z, Woolsey J.Nucleic Acids Res. 2006 Jan l;34(Database issue) :D668-72.
  • NCBI Gene database ((htJ ://ww ,ncbi. ' nlra.Bib.gov: / ge?ie)).
  • the coordinate system is derived from the models in FIG. 1, the schemas in FIG. 2 and FIG. 3, and the spreadsheet header row in FIG. 4. Coordinates may be defined as shown in Table 5.
  • Li.DrugTarget. A maps to Li.DrugTarget.2A .
  • the feature generation method described in section "Semantic Column Mapping by Abstraction” may be applied by creating a rule to generate a common feature for (10) and (11).
  • the feature may be stored as an abstraction on each dataset using one of the data models described in sections “Generalized Hypergraph” and “ 4 Model”.
  • the adjusted coordinates may be used to relate data within and across respective data stores.
  • Entities linked via the Internet may have extraneous data, such as metadata and semantic tags added to achieve some superficial level of situational awareness and content awareness.
  • This approach is very limiting in many ways, and does not scale well for rich contextualization of data sets.
  • Extraneous "tagging" exasperates the complexity and spacial footprint of the original data.
  • This invention revolutionizes how data is linked across large scale networks that integrate very disparate types of entities, and can be applied to achieve unprecedented situational and semantic content awareness within the Internet, and networks beyond.
  • the Internet of Things uses uniquely identifiable objects and maps these to virtual representations in an Internet- like structure, making it possible to connect a wide variety of devices (and the data they generate) through the Internet.
  • the Internet of Everything goes further to use meta-data tags to connect people, processes, data, and other things to make networked connections more relevant and valuable.
  • the goal for the IoE is to turn information into actions that create new capabilities, richer experiences, and unprecedented economic opportunity for businesses, individuals, and countries
  • the present invention may be directly applied for the scalable capturing of much richer situational awareness of data content without adding complexity.
  • Using the system and methods described in this disclosure may create unprecedented interoperability that can scale to any size network, along with the ability to go beyond superficially described connections by truly integrating data entities.
  • the present invention therefore empowers the IoT and IoE by enabling a completely novel way of integrating disparate types of information content.
  • the present invention may enable deep content awareness and situational knowledge to be captured in an overlay network, which may then support intelligent search, recovery, and interchange of content that is not feasible at a large scale using conventional systems.
  • FIG. 5 As an illustrative example, consider the online store database schema model in FIG. 5, comprising three entity tables (customer, order, item) and one join table representing a many-to-many, or HasAndBelongsToMany (HABTM) relationship between entities in the order and item.
  • HABTM HasAndBelongsToMany
  • the example XML schema for products in FIG. 6 contains product and distributor entities that are linked in a HABTM relationship by the root element products.
  • Table 7 illustrates the related fields in this example.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A system comprising a general framework for perceiving data as points in n-dimensional space, and methods to map data in specialized structures to the frame-work. The framework provides flexibility and scale in integrating data for system interoperability in a non-intrusive manner that does not impose a standard on the external data resources. The framework comprises an abstraction layer that forms a network, allowing disparate data elements to converge in n-dimensional space, and methods to map existing data structures to the abstraction layer.

Description

SYSTEM AND METHODS FOR DATA INTEGRATION IN
N-DIMENSIONAL SPACE
FIELD OF THE INVENTION
[0001] The present invention relates to data integration and in particular to a system and methods for integrating disparate data in n-dimensional space using a generalized framework.
BACKGROUND OF THE INVENTION
[0002] Data is stored and transported in disparate formats, such as spreadsheets (flat file databases), structured databases (e.g. relational, object, object-relational, hierarchical, network, triplestore or document), graph structures (e.g. binary graph, graph, or hypergraph), etc. These disparate formats inhibit system interoperability. Portable data formats aim to solve this problem by introducing intermediate data structures to support the transport of data (import/export) between systems, such as extensible Markup Language (XML) and specializations thereof, such as Graph Markup Language
(GraphML), and Resource Description Framework XML (RDF/XML). Existing portability formats are cumbersome to marshall data in and out of, such as XML, or too restrictive in form, such as RDF/XML. Many portability solutions support the
"aggregation" of data, with limited support for "linking" aggregate data in a common format. Typically, linking is supported by the addition of extraneous data, such as metadata and tags, which exasperates the complexity and spacial footprint of the original data.
[0003] Both aggregation and linking are necessary to "integrate" data. What is needed is a generalized solution that views all data formats as specializations of the general form. The system should be non-intrusive to external data resources, and not require a standard be adopted for structuring or sharing data.
[0004] The present disclosure provides a system comprising a general framework for perceiving data as points in ^-dimensional space, and methods to map data in specialized structures to the framework. The framework provides flexibility and scale in integrating data for system interoperability in a non-intrusive manner that does not impose a standard on the external data resources. The framework comprises an abstraction layer that forms a network, allowing disparate data elements to converge in ^-dimensional space, and methods to map existing data structures to the abstraction layer.
SUMMARY OF THE INVENTION
[0005] To overcome challenges associated with data stored in different locations, and disparate formats, the present invention describes a system and methods for integrating data in an n-dimensional framework, comprising steps to aggregate and relate data across a plurality of formats.
BRIEF DESCRIPTION OF THE DRAWINGS
[0006] For a more complete understanding of the invention, reference is made to the following description and accompanying drawings, in which:
[0007] FIG. 1 is two relational database schema models for a lab;
[0008] FIG. 2 is an XML schema for a clinic;
[0009] FIG. 3 is part of an XML schema from DrugBank;
[0010] FIG. 4 is part of a spreadsheet from NCBI Gene;
[0011] FIG. 5 is a relational database schema model for a store; and
[0012] FIG. 6 is an XML schema for inventory management .
DETAILED DESCRIPTION OF THE INVENTION
[0013] The present disclosure provides a system for managing disparate locations and formats of data, and methods for data mapping that assimilate/integrate data in a framework that allows the data to be linking without adding additional data as annotations (meta-data, keywords, tags, etc.) on the data elements. The system uses a framework for unsupervised data assimilation, and methods illustrate how to map data from disparate formats to the general framework.
[0014] Aggregating and linking rules in the framework are related to the format of the original data "input". Some data formats in include:
1. Relational database schema, where a database may contain a plurality of tables. Each row in a table represents an entity (record), columns (fields) hold keys, and cells hold values. Special keys are assigned to some columns as primary key/foreign key to form relationships.
2. Flat file or spreadsheet, where a file contains a single spreadsheet with row, column, and cell semantics similar to a singular table in a relational database. Entities in the same spreadsheet are related.
3. Extensible markup language (XML), where a document contains entities, which are elements identified by a pair of tags. Attributes are stored on elements as key/value pairs, or as simple "child" elements. A root element that contains other elements may exist. An element's "child" elements may also refer to other elements in the document to form relationships.
4. Document database, where the database contains documents as entities, and key/value pairs are stored on the entities. Collections of documents may also exist within a single database. Relationships may be formed by one document referencing or embedding other documents.
5. General graph, where the graph structure may be stored in a file using a GraphML, adjacency list, or other format. Nodes in the graph represent entities, and may be annotated with attributes as key/value pairs. Relationships are formed by edges connecting nodes.
6. Generalized hypergraph (described below), where each element is an entity, and key/value pairs are stored on entities. Relationships are present in an element's internal/external sets.
7. C4 model (described below), where each element is an entity, which may be annotated with attributes as key/value pairs. Relationships are present in the sublists of each entity.
[0015] Equivalent types of components for some formats are shown in Table 1, where the last row denotes a specific data element. Table 1 Equivalent types of components in different input types
Figure imgf000006_0001
[0016] The present invention may also be applied to data modeled in new ways, such as the generalized hypergraph and C4 models described below, with components mapped in a similar manner as Table 1.
[0017] Generalized Hypergraph
[0018] From patent application number US 13/463,603; A graph is defined as G{V,E) where Vis a set of vertices (nodes) on the graph and £ is a set of edges (links) between two vertices. A hypergraph is a generalization of a graph where an edge can connect any number of vertices, and we have E = {ei|ei C V ) .We refer to the nodes and edges of the graph as elements, and define this set ε = V U E. We extend the hypergraph model to allow links among edges, then E = {ei|ei ε} . An element in this model may be defined as consisting of the components:
1. Unique Identifier (UID)
2. Attribute Set: key/value pairs
3. Internal Element Set: UIDs of elements contained in this element
4. External Element Set: UIDs of elements that contain this element
[0019] C Model
[0020] Knowledge may be defined by a collection of related entities. An entity may be defined by a unique identifier, and a pair of ordered lists comprising a number of sublists each. For example, the following four sublists represent four ways in which an entity relates to another entity, as described in Table 2. Table 2 Four sublists of an entity ( 4).
Figure imgf000007_0001
[0021] In Table 2 the pair of lists are denoted a and β, and the four sublists are collectively referred to as 4. These lists, along with a unique identifier (UID) define an entity as:
UID, a, β (1)
[0022] Additionally, we may add a set of key/value pairs to an entity to capture additional attributes and/or annotations on the entity, result in the definition of an entity as:
UID, Attribute_Set, α, β (2)
[0023] The lists a and β in Table 2 have a reciprocal relationship. For instance, given an entity, x, the sublists in x(a) contain the UIDs of other entities that relate to this entity by the semantic meaning given in Table 2, as:
1. composed of (has-a): x is an entity that is made up of the entities in this sub list
2. includes: x is a general classification of the specific entities in this sublist
3. derived from: x is a concept or entity that is derived from the combination of the entities in this sublist
4. caused by: x is an effect that is caused by the entities in this sublist
[0024] The sublists in χ(β) contain the UIDs of other entities that relate to this entity by the semantic meaning given in Table 2, as:
1. part of: x is a part of the entities in this sublist
2. member of (is-a): x is a member of the classification entities in this sublist
3. contributes to: x is one of the pluralities of interacting entities that contribute to the derived entity or concept in this sublist 4. effects: x is a cause that results in the effects in this sublist
[0025] By these definitions, the first sublist allows for abstractions, where an entity x can be viewed as a singular entity, or expanded and viewed by the sum of its parts using a(Composition).
[0026] Further, a sublist may begin with a binary digit specifying whether or not an ordering is imposed on the items in the list, where 0 denotes unordered and 1 denotes ordered.
[0027] System Framework
[0028] Points in two dimensional space may be defined by an (x, y) coordinate system. In the data formats shown in Table 1, both a relational database table and spreadsheet store data in rows and columns, which intersect at "cells" that can be referenced by {row, column) as a two dimensional coordinate. Points in three dimensional spaced may be defined by an (x, y, z) coordinate system. The relational database potentially comprises many tables, and we may reference a cell in a table by {row, column, table). Finally, we may have more than one database, and following this method, we may reference a cell using a four dimensional coordinate system as {row, column, table, database).
[0029] To facilitate discussion, in the present system, we may want to transpose the coordinate system to refer to a point as {database, table, column, row), which maps the last three coordinates as (z, y, x) in 3 -dimensional space. This provides a framework similar to Internet Protocol (IP) addressing, having a form shown in (3), where _ refers to a location, which may be an IP address, and β, γ, δ, ε refer to points in a four dimensional coordinate system.
α, β, γ, δ, ε (3)
[0030] The structure in (3) is minimally sufficient to locate a piece of data, but can be expanded to additional dimensions. Looking again at Table 1, the first four rows have a direct correspondence to β, γ, ε, δ . [0031] In addition to defining coordinates to locate data, the data itself may be fetched by the system and stored with the coordinate using another dimension ω, then we have (4). Data may be fetched on demand, or pre-fetched and shared (exported/ imported) with other systems along with the coordinate system.
α, β, γ, δ, ε, ω (4)
[0032] We may specify to iterate over all columns Y and rows using a "mask" as shown in (5) and (6).
α, β, Υ, ^ (5)
< location > . < database > . < table > .Y.X(6)
[0033] Then, (5) refers to all of the complete data records (all columns and rows) stored in table < table > of < database > stored at < location >. The text references in (6), and text values for each y 6 Y , may map to numbers generated by and known to the system and system servers may share these map tables, and mapped values, similar to how domain name servers (DNS) operate on the Internet.
[0034] Once all of the desired data is fully mapped in n dimensional space, the next step is to align the data so that related data intersects. This may be handled by maintaining an offset to entities known to the system. This offset for an entity may be stored in the map table. Computing the offset from one data store to the next allows them to overlap spatially to recover related data. An offset computed on one dimension provides one degree of freedom to each coordinate stored in the system. In some embodiments, k≤ n dimensions may be offset to provide k degrees of freedom. An offset, Δ, may be represented in the coordinate system associated with its coordinate dimension as shown in (7).
α.β.γ.δ, Δ.ε.ω (7)
[0035] As the system is introduced to new shared columns, new offsets are created and stored in the map table. This may be a semi-automated process, wherein upon accessing a new data set to integrate, the system: 1. for each δ component,
2. look up value (including mapped "synonyms") in map table,
3. if exists, check that values are of similar type and format, and if there is overlap with existing data,
4. if success, add offset,
5. if name not found, but a similar value set exists under another name, log a query to later be addressed by a human operator.
[0036] Step 5 of the method asserts that when column names do not match, the system will look for columns with similar types/formats of values, and log a hypothesis that any found column names refer to the same entity. Then, a system administrator can check this hypotheses, and instruct the system to either accept or reject it.
[0037] In some embodiments, a geospatial, spatiotemporal, or any other coordinate may be included in the coordinate system.
[0038] In some embodiments, the system may behave in an unsupervised manner, by including learning methods to automatically test hypothesis and accept or reject them without human user intervention.
[0039] In some embodiments, dimensions may be offset and mapped using a transposed form of the data, where rows and columns both represent entities, and each entity may map to entities in other data sets.
[0040] Rows and columns in a relational database or spreadsheet are ordered, while their counterparts in other formats (e.g. elements, documents, nodes, etc.) may not be.
However, an order may be imposed. For an XML document, the order of elements may be based on the order in which they appear in the document schema. Documents in a document database may be ordered by the document's UID. Nodes in a graph may be ordered by UID, or the order in which they appear in a specific graph representation (e.g. adjacency lists). Child elements in XML, attributes, keys, and annotations may be numbered by the order in which they appear.
[0041] The present illustration maps relational databases. Similar methods can be applied to other data formats. In some embodiments, existing methods may be applied to convert other data formats, such as XML and graph, to a relational database format, and then apply the illustrated method.
[0042] Semantic Column Mapping by Abstraction
[0043] In some embodiments, feature generation techniques may be used to align columns with the same semantic meaning, but different types of data, or data measured in different ways. Disparate types and measures may be defined as:
[0044] disparate data types numeric (integer, floating point), string (alphanumeric), etc.
[0045] disparate measures quantitative vs qualitative, count vs percentage, age in days vs age in years, etc.
[0046] Feature Generation
[0047] An exemplary feature generation technique takes as input descriptors defined by key/value pairs, and applies a set of rules to generate features.
[0048] The Descriptors may be represented by a set of tuples, where each tuple is one of:
(name, numeric _yalue, measure) (8)
(name, string_value[, measure]) (9)
[0049] The elements of the tuples in (8) and (9) may be defined as:
[0050] name a string that identifies the descriptor (e.g. age, color, price, a particular gene name, etc.)
[0051] numeric_value a numeric value
[0052] string_value a string of alphanumeric characters
[0053] measure a particular measure (quantity) on the value (e.g. years, Euros, PPM, kg/mol)
[0054] The name in (8) and (9) may be any attribute or measure that represents an aspect of an entity. The values in (8) and (9) may be any type of data, including numeric, alpha, alphanumeric, or a reference to a location that contains a media file (e.g. image, audio, video, etc.) In some embodiments, the location referenced by a value may refer to a location in an external database or web server. This model may map to a relational database model, as (field, value, "table"), when a row in a table relates to an entity, and different tables are used for different measures with the same semantic, on the same entities.
[0055] Defining descriptors in this way is optimally concise and provides consistency across data sets. Being optimally concise provides the maximum flexibility to support disparate data, and the minimum space to support scalability and portability. Having this consistent format across all data sets reduces the necessary complexity of a rule set that operates on the descriptors.
[0056] The Rule Set may include equivalence rules for translating Descriptors to Features . Rules may be defined based on the domain of the input data (e.g. equivalence rules for molecular data). Rules may form associated equivalencies for disparate types and measures. A rule may exist for all string_values, and all measures (quantities), wherein the first requirement handles disparate data types (e.g. associating string values to equivalent numeric values - equivalent values), and the second requirement handles disparate measures with the same semantic meaning (e.g. associating numeric values with the same semantic meaning that were measured in a different way - equivalent quantities).
[0057] As an example of a rule, descriptors that have discrete values may translate directly to features. As a further example of a rule, descriptors that have discrete values may translate into features that represent finite ranges or sets of discrete values. As a further example of a rule, descriptors that have continuous values may have associated rules that apply statistical techniques to categorize the values into discrete ranges or sets, where each set is a unique feature on the descriptor. [0058] Examples of some rules are shown in Table 3. [0059] The first two rules in Table 3 equate a numeric value to a semantically equivalent string value for a given quantity (count). The next two rules equate a string value to another semantically equivalent string value. The last two rules equate disparate quantities to a semantically equivalent feature (age). In the last case, more concise rules may exist to create age ranges.
[0060] The rules in Table 3 include statistically derived values that may be computed as:
Table 3 Example equivalence rules
Figure imgf000013_0001
[0061] mean a statistical mean
[0062] std one standard deviation
[0063] The simplicity of a rule set defined in this way is permitted by the preferred method of representing descriptors using the tuples (8) and (9). The tuples and the rule set combine to form a generalized format for providing input to a Processor for feature generation. The generalization provides maximum flexibility as it captures descriptors and rules in a straightforward and concise fashion. Any method that does not rely on generalizing the inputs would require significantly more complex rule sets to capture the same semantics of feature generation, resulting in a system that is not flexible and does not scale. The present approach overcomes the complexities associated with harmonizing across disparate data sets, thereby alleviating the need to perform data pre-processing steps required by other approaches.
[0064] Following feature generation, features may be stored as an abstraction using the data models described in sections "Generalized Hypergraph" and " 4 Model". Then, those abstracted feature columns may be aligned using the methods of the present invention.
[0065] Exemplary Uses
[0066] Embodiments of the present invention may be used in integrating internet user data (for example from previous search history on search engines like Google, BING, and Yahoo, and social interaction and preferences data from social network services like Facebook). The system in the present disclosure can be used to create a very rich situational context to personalize the search experience when consumers are using online resources to find content (movies, books, music, consumer products, services, or any content or resource on the Internet). Using the present invention, it is feasible to link consumer preferences, and even an entire online footprint (including profiles from Facebook, etc.), to an extremely wide variety of online consumer product and service resources.
[0067] Embodiments of the present invention may be used in military intelligence and law enforcement surveillance of threats achieved by integrating knowledge integrated from disparate data content distributed throughout the Internet, relating to individuals, organizations, and states. For example, there may be thousands of databases across the Internet that hold disparate data about a single terrorist suspect. Some of the content may not be linked to the suspect's name, but may have contextual patterns and semantic concepts that when integrated into a coherent context can implicate that suspect to a particular situation. Intelligently linking to those disparate data elements so they can be analyzed requires formulating such a context, and using it to recover relevant content across the Internet, which may be achieved by using the present invention.
[0068] Example 1
[0069] Embodiments of the present invention may be used in intelligently linking and federating medical and scientific databases as a higher order multiscalar knowledge network. Instead of metadata tags, mapping of semantically meaningful data structures to enable content to be linked at the abstracted level provides true integration of content across a federated network. For example, a genomic data database, a pharmacogenomics knowledge base, and a clinical trials information database, all may have content related to a particular drug, such as Gemcitabine. The genomic database has gene mutations that affect a pathway, the pharmacogenomic database has linked that pathway to regulating response to Gemcitabine, and the trial database contains information about a clinical trial to evaluate Gemcitabine. Traditionally, meta-data tags would need to be matched between the genomic and clinical trial database in order to link content. In this example, there is no direct link between the two. Using the present invention, content may be intelligently linked through the network, without having to create hundreds of thousands of tags to describe the content in each of the three networked databases. The present invention may integrate content at an abstracted overlay network level, and link disparate data formats within each content source in a generalized n-dimensional framework. The framework may then be used to find a clinical trial for a patient given only the genomic context stored in the genomic database, without the need to match metadata tags.
[0070] For example, consider the exemplary relational database model in FIG. 1, the XML schemas in FIG. 2 and FIG. 3, and the spreadsheet in FIG. 4.
[0071] The XML schema in FIG. 3 is part of the DrugBank schema. (DrugBank: a comprehensive resource for in silico drug discovery and exploration. Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z, Woolsey J.Nucleic Acids Res. 2006 Jan l;34(Database issue) :D668-72.)
[0072] The spreadsheet in FIG. 4 is part of the National Center for Biotechnology
Information (NCBI) Gene database ((htJ ://ww ,ncbi.'nlra.Bib.gov:/ge?ie)).
[0073] For ease of exposition, the Location (a) of each of these datasets is listed in Table 4, and referenced by Label (Lx, where x = [1..4]) in Table 5.
[0074] The coordinate system is derived from the models in FIG. 1, the schemas in FIG. 2 and FIG. 3, and the spreadsheet header row in FIG. 4. Coordinates may be defined as shown in Table 5.
Table 4 Location for example datasets Label Location
Li < lab URL >
L2 < clinic URL >
L3 htt :// www . drugbank. ca/system/download s/ current/
L4
[0075] The datasets at Li, L2, Li and L4 are related on fields representing patients, genes and drugs. These relationships are shown in Table 6.
[0076] Additionally, relations exist within a dataset, and may be aligned within the dataset. For example, Li.DrugTarget. A maps to Li.DrugTarget.2A .
[0077] In Table 5, two additional fields (age and birthdate) are related, but do not contain the same types of data:
Li.Patient.1 (10)
L2.clinic.xml.1.4 (11)
[0078] The feature generation method described in section "Semantic Column Mapping by Abstraction" may be applied by creating a rule to generate a common feature for (10) and (11). The feature may be stored as an abstraction on each dataset using one of the data models described in sections "Generalized Hypergraph" and " 4 Model".
[0079] We can link the data in Table 6, and other related fields, by using an offset to intersect entities in the 6th dimension. Assuming a map table has been previously generated that identifies fields with different names as having the same semantic meaning (using methods to harmonize data across different nomenclatures and ontologies), and provides an offset Δ for this entity, the data sets in Table 6 intersect when shifted in the 6th dimension by Δ. Then, data may be identified as:
α.β.γ.Γ, Δ. (12)
[0080] Once the coordinate system is offset using the map file, the adjusted coordinates may be used to relate data within and across respective data stores. Table 5 Coordinates for example datasets
Ref a Database, Document or Table, Root Y Column or Key δ
File (β) Element, or
Spreadsheet
IG 1 Li Patient patient 1 idpatient 1 gender 2 age 3 expression Ί idexpression 1 gene 2
RP M 3 idpatient 4
L2 Drug Target gene 1 idgene 1 symbol 2 drug iddrug 1 name 2 target 3 drug iddrug 1 gene_idgene 2 action 3IG 2 L2 clinic.xml patients 1 id 1 name 2 address 3 birthdate 4 gender 5IG 3 L2 drugbank.xml drugs 1 drugbank-id 1 name 2 partners 2 id 1 name 2 gene-name 3IG 4 Homo sapiens. gene info sheet 1 1 tax id 1 genelD 2 symbol 3 chromosome 4 map localization 5 Table 6 Related fields in example datasets.
Figure imgf000018_0001
[0081] Example 2
[0082] Entities linked via the Internet may have extraneous data, such as metadata and semantic tags added to achieve some superficial level of situational awareness and content awareness. This approach is very limiting in many ways, and does not scale well for rich contextualization of data sets. Extraneous "tagging" exasperates the complexity and spacial footprint of the original data. This invention revolutionizes how data is linked across large scale networks that integrate very disparate types of entities, and can be applied to achieve unprecedented situational and semantic content awareness within the Internet, and networks beyond. Conventionally, the Internet of Things (IoT) uses uniquely identifiable objects and maps these to virtual representations in an Internet- like structure, making it possible to connect a wide variety of devices (and the data they generate) through the Internet. Similarly the Internet of Everything (IoE) goes further to use meta-data tags to connect people, processes, data, and other things to make networked connections more relevant and valuable. Given that the goal for the IoE is to turn information into actions that create new capabilities, richer experiences, and unprecedented economic opportunity for businesses, individuals, and countries, the present invention may be directly applied for the scalable capturing of much richer situational awareness of data content without adding complexity. Using the system and methods described in this disclosure may create unprecedented interoperability that can scale to any size network, along with the ability to go beyond superficially described connections by truly integrating data entities. The present invention therefore empowers the IoT and IoE by enabling a completely novel way of integrating disparate types of information content. For example, in some embodiments, the present invention may enable deep content awareness and situational knowledge to be captured in an overlay network, which may then support intelligent search, recovery, and interchange of content that is not feasible at a large scale using conventional systems.
[0083] As an illustrative example, consider the online store database schema model in FIG. 5, comprising three entity tables (customer, order, item) and one join table representing a many-to-many, or HasAndBelongsToMany (HABTM) relationship between entities in the order and item. The example XML schema for products in FIG. 6 contains product and distributor entities that are linked in a HABTM relationship by the root element products. Table 7 illustrates the related fields in this example.
Table 7 Related fields in example datasets.
Figure imgf000019_0001
[0084] Generating an offset for the barcode entity aligns these two datasets, effectively mapping customers to products to distributors, enabling analysis such as identifying customers with a tendency to purchase products originating from the same distributor.

Claims

CLAIMS: What is claimed:
1. A system to integrate data in a common framework in ^-dimensional space comprising:
a) A coordinate system for identifying data locale comprising a map file for translating textual descriptions to numbers
b) Methods to map data formats to the coordinate system
c) Methods to integrate data using the coordinate system comprising:
i. Identifying the physical location of the data.
ii. Identifying the format of the data.
iii. Identifying the components of the input (e.g. Table 1). iv. Generating a mask for the data.
v. Generating offsets for dimensions of the data.
vi. Adding offsets to the mask.
2. The method of claim 1 further comprising a plurality of data sets, wherein the data sets are persisted in different geographic locations.
3. The method of claim 1 further comprising a plurality of data sets, wherein the data sets are persisted in different formats.
PCT/US2015/022595 2014-03-26 2015-03-26 System and methods for data integration in n-dimensional space WO2015148739A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201461970412P 2014-03-26 2014-03-26
US61/970,412 2014-03-26

Publications (1)

Publication Number Publication Date
WO2015148739A1 true WO2015148739A1 (en) 2015-10-01

Family

ID=54196375

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/022595 WO2015148739A1 (en) 2014-03-26 2015-03-26 System and methods for data integration in n-dimensional space

Country Status (1)

Country Link
WO (1) WO2015148739A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5642524A (en) * 1994-09-29 1997-06-24 Keeling; John A. Methods for generating N-dimensional hypercube structures and improved such structures
US5917500A (en) * 1998-01-05 1999-06-29 N-Dimensional Visualization, Llc Intellectual structure for visualization of n-dimensional space utilizing a parallel coordinate system
US20020194167A1 (en) * 1999-08-04 2002-12-19 Reuven Bakalash Relational database management system having integrated non-relational multi-dimensional data store of aggregated data elements
US6799115B1 (en) * 2002-02-28 2004-09-28 Garmin Ltd. Systems, functional data, and methods to pack n-dimensional data in a PDA
US20040215655A1 (en) * 2003-01-13 2004-10-28 Vasudev Rangadass Enterprise solution framework incorporating a master data management system for centrally managing core reference data associated with an enterprise

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5642524A (en) * 1994-09-29 1997-06-24 Keeling; John A. Methods for generating N-dimensional hypercube structures and improved such structures
US5917500A (en) * 1998-01-05 1999-06-29 N-Dimensional Visualization, Llc Intellectual structure for visualization of n-dimensional space utilizing a parallel coordinate system
US20020194167A1 (en) * 1999-08-04 2002-12-19 Reuven Bakalash Relational database management system having integrated non-relational multi-dimensional data store of aggregated data elements
US6799115B1 (en) * 2002-02-28 2004-09-28 Garmin Ltd. Systems, functional data, and methods to pack n-dimensional data in a PDA
US20040215655A1 (en) * 2003-01-13 2004-10-28 Vasudev Rangadass Enterprise solution framework incorporating a master data management system for centrally managing core reference data associated with an enterprise

Similar Documents

Publication Publication Date Title
US11726992B2 (en) Query generation for collaborative datasets
US11210307B2 (en) Consolidator platform to implement collaborative datasets via distributed computer networks
US11366824B2 (en) Dataset analysis and dataset attribute inferencing to form collaborative datasets
US11423039B2 (en) Collaborative dataset consolidation via distributed computer networks
US11334625B2 (en) Loading collaborative datasets into data stores for queries via distributed computer networks
AU2017282656B2 (en) Collaborative dataset consolidation via distributed computer networks
Wang et al. Richpedia: a large-scale, comprehensive multi-modal knowledge graph
US10699027B2 (en) Loading collaborative datasets into data stores for queries via distributed computer networks
US10102258B2 (en) Collaborative dataset consolidation via distributed computer networks
US10346429B2 (en) Management of collaborative datasets via distributed computer networks
US20190317961A1 (en) Computerized tools configured to determine subsets of graph data arrangements for linking relevant data to enrich datasets associated with a data-driven collaborative dataset platform
US11675808B2 (en) Dataset analysis and dataset attribute inferencing to form collaborative datasets
Guo et al. User relationship strength modeling for friend recommendation on Instagram
Kumara et al. Web-service clustering with a hybrid of ontology learning and information-retrieval-based term similarity
Castro-Medina et al. Application of dynamic fragmentation methods in multimedia databases: a review
Papadaki et al. Towards interactive analytics over RDF graphs
Vazhkudai et al. Constellation: A science graph network for scalable data and knowledge discovery in extreme-scale scientific collaborations
Jin et al. PERSEUS-HUB: interactive and collective exploration of large-scale graphs
Khan et al. Online social networks (OSN) evolution model based on homophily and preferential attachment
Knoblock et al. A scalable architecture for extracting, aligning, linking, and visualizing multi-Int data
WO2015148739A1 (en) System and methods for data integration in n-dimensional space
Xu et al. Automatic Semantic Modeling for Structural Data Source with the Prior Knowledge from Knowledge Base
Khan et al. Bivariate, Cluster and Suitability Analysis of NoSQL Solutions for Different Application Areas
US11947554B2 (en) Loading collaborative datasets into data stores for queries via distributed computer networks
Cooray Molecular biological databases: evolutionary history, data modeling, implementation and ethical background

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15769746

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13.01.17)

122 Ep: pct application non-entry in european phase

Ref document number: 15769746

Country of ref document: EP

Kind code of ref document: A1