DE102009037848A1

DE102009037848A1 - Computer-aided digital semantic annotate information i.e. medical image data processing method, involves generating digital data set for incorporating combination of element of matrix as semantic relation

Info

Publication number: DE102009037848A1
Application number: DE102009037848A
Authority: DE
Inventors: Markus Bundschus; Yi Huang; Achim Rettinger; Volker Dr. Tresp
Original assignee: Siemens AG
Current assignee: Siemens AG
Priority date: 2009-05-15
Filing date: 2009-08-18
Publication date: 2010-11-25

Abstract

The method involves generating a matrix including a dimension for an entity amount of part of entities included in a set of adhesive slates (a1-a7, b1-b7). Entry of matrix is determined with zero using a prediction value of multivariate prediction-model, where the prediction value restores the entry of matrix with zero. Digital data set is generated for incorporating the combination of an element of the matrix as a semantic relation and the corresponding entry of matrix as a probability measurement for a better semantic relation. Independent claims are also included for the following: (1) a method for scanning information (2) a computer program product with machine readable medium for storing programming code for performing a method for processing a computer-aided digital semantic annotate information.

Description

Die Erfindung betrifft ein Verfahren zum rechnergestützten Verarbeiten von digitalen semantisch annotierten Informationen sowie ein Verfahren zur Abfrage von derart verarbeiteten Informationen und ein entsprechendes Computerprogrammprodukt.The The invention relates to a method for computer-aided Processing digital semantic annotated information as well a method for retrieving such processed information and a corresponding computer program product.

In einer Vielzahl von technischen Anwendungsgebieten ist es wünschenswert, digital gespeicherte Informationen in geeigneter Weise semantisch zu annotieren, um auf diese Weise mittels Abfragen semantische Bedeutungsinhalte aus den Informationen zu extrahieren bzw. weiteres semantisches Wissen aus den Informationen abzuleiten.In a variety of technical fields of application, it is desirable digitally stored information in a suitably semantic to annotate in this way by means of queries semantic meaning content extract from the information or further semantic To derive knowledge from the information.

Heutzutage gibt es insbesondere im Bereich des World Wide Web verschiedene Ansätze, um die große Menge an Informationen in geeigneter Weise semantisch zu annotieren. In diesem Zusammenhang ist insbesondere die graphische Repräsentation von Wissen über das sog. Ressource Description Framework (auch abgekürzt als RDF) bekannt. Dabei wird semantisches Wissen in der Form von auf Tripeln basierenden Aussagen repräsentiert, welche ein Subjekt, ein Prädikat in der Form einer Eigenschaft und ein Objekt in der Form eines Eigenschaftswerts enthalten.nowadays There are many, especially in the area of the World Wide Web Approaches to the large amount of information in appropriate semantically annotate. In this context is in particular the graphic representation of knowledge about the so-called Resource Description Framework (also abbreviated known as RDF). This semantic knowledge in the form of represents triplet based statements which a subject, a predicate in the form of a property and an object in the form of a property value.

Aus dem Stand der Technik sind maschinelle Lernverfahren bekannt, mit denen aus semantisch annotiertem Wissen, insbesondere basierend auf den oben beschriebenen Tripeln, weiteres semantisches Wissen mit probabilistischen Methoden abgeleitet wird. Dabei sind Verfahren bekannt, welche auf einem globalen probabilistischen Modell beruhen und die Wahrscheinlichkeit von semantischen Aussagen in einer bestimmten Domäne vorhersagen können. Diese Ansätze weisen den Nachteil auf, dass sie für Anwendungen mit einer großen Menge an semantischen Informationen wegen ihrer schlechten Skalierbarkeit nicht geeignet sind. Darüber hinaus sind Ansätze bekannt, welche auf sog. Markov-Blanket-Modellen beruhen, wie z. B. ILP (ILP = Inductive Logic Programming). Diese Verfahren weisen den Nachteil auf, dass sie oftmals von der sog. Closed-World-Annahme ausgehen, wonach semantische Relationen, von denen nicht bekannt ist, dass sie wahr sind, als falsch angenommen werden. Diese Annahme ist für viele Anwendungsszenarien nicht zweckmäßig.Out In the prior art, machine learning methods are known, with those from semantically annotated knowledge, in particular based on the triples described above, further semantic knowledge is derived with probabilistic methods. There are procedures which are based on a global probabilistic model and the probability of semantic statements in a given domain can predict. These approaches have the disadvantage on that they are for applications with a big one Amount of semantic information because of its poor scalability are not suitable. In addition, approaches are known, which are based on so-called. Markov Blanket models, such. ILP (ILP = Inductive Logic Programming). These methods have the disadvantage that they often from the so-called closed-world assumption emanate, according to which semantic relations, of which unknown is that they are true, to be wrong. This assumption is not appropriate for many application scenarios.

Aufgabe der Erfindung ist es deshalb, ein Verfahren zum rechnergestützten Verarbeiten von digitalen semantisch annotierten Informationen zu schaffen, mit dem große semantische Datenmengen in geeigneter Weise derart verarbeitet werden, dass unter Berücksichtigung von statistischen Abhängigkeiten in den semantisch annotierten Informationen weiteres semantisches Wissen extrahiert werden kann.task The invention is therefore a method for computer-aided Processing digital semantically annotated information about create, with the big semantic data sets in a suitable way be processed in such a way that taking into account of statistical dependencies in the semantically annotated Information further semantic knowledge can be extracted.

Diese Aufgabe wird durch die unabhängigen Patentansprüche gelöst. Weiterbildungen der Erfindung sind in den abhängigen Ansprüchen definiert.These Task is by the independent claims solved. Further developments of the invention are in the dependent Claims defined.

In dem erfindungsgemäßen Verfahren werden semantisch annotierte Informationen verarbeitet, welche eine Vielzahl von Tripeln umfassen, wobei ein Tripel eine semantische Entität in der Form eines Subjekts, eine semantische Entität in der Form eines eine Eigenschaft repräsentierenden Prädikats und eine semantische Entität in der Form eines einen Eigenschaftswert repräsentierenden Objekts enthält. Die Tripel können dabei auf einer beliebigen semantischen Beschreibungssprache beruhen. Insbesondere können die Tripel durch den bereits oben erwähnten RDF-Graphen dargestellt werden.In The method according to the invention becomes semantic annotated information, which processes a variety of triples in which a triple is a semantic entity in the form of a subject, a semantic entity in the Form of a predicate representing a property and a semantic entity in the form of a property value contains representing object. The triples can use any semantic description language based. In particular, the triple can through the already above mentioned RDF graphs.

In dem erfindungsgemäßen Verfahren wird für eine Entitäts-Menge aus zumindest einem Teil der in der Vielzahl von Tripeln enthaltenen Entitäten eine Matrix umfassend eine erste und zweite Dimension generiert. Unter erster und zweiter Dimension sind dabei insbesondere entsprechende Zeilen bzw. Spalten einer Matrix (oder umgekehrt) zu verstehen. Die Elemente der ersten Dimension der Matrix umfassen die Entitäten der Entitäts-Menge und die Elemente der zweiten Dimension umfassen vorgegebene Merkmale von Tripeln, wobei zumindest ein Tripel in der Vielzahl von Tripeln ein vorgegebenes Merkmal und ein Element der ersten Dimension umfasst. Im Sinne der Erfindung sind vorgegebene Merkmale insbesondere Bestandteile bzw. Teile von Tripeln. In bevorzugten, weiter unten beschriebenen Varianten sind die vorgegebenen Merkmale beispielsweise nur ein Prädikat oder eine Kombination aus Subjekt und Prädikat bzw. aus Prädikat und Objekt.In the inventive method is for an entity set from at least a part of in the Variety of triples contained entities a matrix comprehensively generating a first and second dimension. Under first and second dimension are in particular corresponding lines or columns of a matrix (or vice versa) to understand. The Elements The first dimension of the matrix comprises the entities of the Entity set and elements of the second dimension include given characteristics of triples, wherein at least one triple in the plurality of triples a given feature and an element the first dimension includes. For the purposes of the invention are given Features in particular components or parts of triples. In preferred, Variants described below are the given features For example, only one predicate or a combination of Subject and predicate or predicate and object.

Die Einträge der Matrix werden erfindungsgemäß derart festgelegt, dass die Matrix für jede Kombination aus einem Element der ersten Dimension und einem Element der zweiten Dimension, für welche ein Tripel umfassend das Element der ersten Dimension und das Element der zweiten Dimension in der Vielzahl von Tripeln existiert, als Eintrag einen vorbestimmten Wert ungleich Null aufweist. Dieser vorbestimmte Wert wird einmalig festgelegt und ist immer der gleiche für alle Einträge ungleich Null. Vorzugsweise wird dieser vorbestimmte Wert auf Eins gesetzt. Im Unterschied hierzu werden alle anderen Einträge, d. h. alle anderen Kombination aus Element der ersten Dimension und Element der zweiten Dimension, auf Null gesetzt.The entries of the matrix are determined according to the invention such that the matrix exists for each combination of a first dimension element and a second dimension element for which a triple comprising the first dimension element and the second dimension element exists in the plurality of triples , as entry has a predetermined non-zero value. This predetermined value is set once and is always the same for all nonzero entries. Preferably, this is vorbe vorbe set value to one. In contrast to this, all other entries, ie all other combinations of the first dimension element and the second dimension element, are set to zero.

Erfindungsgemäß wird die generierte Matrix derart weiterverarbeitet, dass für die Einträge der Matrix gleich Null mittels eines multivariaten Vorhersagemodells Prädiktionswerte ermittelt werden, welche die jeweiligen Einträge gleich Null ersetzen. Multivariate Vorhersagemodelle, welche im Englischen auch unter dem Begriff „Multivariate Structured Prediction” bekannt sind, beruhen darauf, dass alle zu prädizierenden Ausgaben gemeinsam vorhergesagt werden können, so dass die statistische Stärke zwischen den Ausgaben geteilt werden kann. Multivariate Vorhersagemodelle sind hinlänglich aus dem Stand der Technik bekannt und werden nunmehr zur statistischen Vorhersage der entsprechenden Matrix-Einträge gleich Null verwendet. Das Ergebnis wird erfindungsgemäß als digitaler Datensatz generiert, der die Kombinationen der Elemente der Matrix als semantische Relationen und den jeweiligen zugehörigen Eintrag der Matrix als Wahrscheinlichkeitsmaß für die Wahrheit bzw. Richtigkeit der semantischen Relation enthält. Dabei behalten in dem Datensatz alle Matrixeinträge, welche ursprünglich ungleich Null waren, ihren vorbestimmten Wert. Matrixeinträge mit diesem vorbestimmten Wert können somit als semantische Relationen identifiziert werden, welche sicher, d. h. mit einer Wahrscheinlichkeit von 100%, vorliegen. Im Unterschied hierzu kann zu den anderen Matrixeinträgen nur ein Maß angegeben werden, welches eine Konfidenz bzw. Wahrscheinlichkeit dahingehend repräsentiert, dass die entsprechende semantische Relation wahr (d. h. vorhanden) ist.According to the invention the generated matrix processed such that for the entries of the matrix equal zero by means of a multivariate Prediction values are determined which prediction model replace the respective entries equal to zero. multivariate Prediction models, which are also known in English as "multivariate Structured Prediction "are based on that all predicted outputs jointly predicted can be, so the statistical strength between the issues can be shared. Multivariate predictive models are well known in the art and now become the statistical prediction of the corresponding matrix entries used equal to zero. The result is according to the invention as digital record that generates the combinations of elements of the Matrix as semantic relations and their corresponding ones Entry of the matrix as probability measure for contains the truth or correctness of the semantic relation. Keep in the record all matrix entries, which originally non-zero, their predetermined value. Matrix entries with this predetermined value can thus identified as semantic relations, which are sure d. H. with a probability of 100%. In difference For this purpose, only one measure can be specified for the other matrix entries which is a confidence or probability to that effect represents that the corresponding semantic relation true (that is, existing).

Das erfindungsgemäße Verfahren zeichnet sich dadurch aus, dass in geeigneter Weise aus einer großen Menge von digitalen semantisch annotierten Informationen neue semantische Relationen in Kombination mit einem entsprechenden Wahrscheinlichkeitsmaß für deren Richtigkeit identifiziert werden können. Die Qualität der Prädiktion ist dabei sehr gut, wie die Erfinder durch entsprechende Experimente bestätigen konnten.The inventive method is characterized made that suitable from a large amount of digital semantic annotated information new semantic Relations in combination with a corresponding probability measure for their correctness can be identified. The quality The prediction is very good, as the inventors through could confirm appropriate experiments.

Wie bereits oben erwähnt, können die im erfindungsgemäßen Verfahren verarbeiteten Tripel auf RDF-Graphen beruhen. Vorzugsweise sind dabei die semantischen Entitäten der betrachteten Entitäts-Menge entsprechende Ressourcen, wobei eine Ressource in dem Kontext von RDF durch einen eindeutigen Bezeichner (einer sog. URI) identifiziert wird. Im Rahmen von RDF ist dabei das Subjekt eines Tripels immer eine Ressource, wohingegen ein Objekt des Tripels sowohl eine Ressource als auch ein Literal sein kann. Literale sind dabei Zeichenfolgen, die zur Darstellung der Werte von Basistypen definiert bzw. zulässig sind.As already mentioned above, in the inventive Process processed triple based on RDF graphs. Preferably Here are the semantic entities of the considered Entity set corresponding resources, being a resource in the context of RDF, by a unique identifier (a so-called URI) is identified. RDF is the subject a triple always a resource, whereas an object of the triple can be both a resource and a literal. Literals are there Strings that are used to represent the values of base types or admissible.

In dem erfindungsgemäßen Verfahren können verschiedene multivariate Vorhersagemodelle zur Generierung des Datensatzes eingesetzt werden. Vorzugsweise beruhen diese Modelle auf einer Matrix-Approximation, insbesondere einer Matrix-Approximation mit niedrigem Rang, welche oftmals auch mit dem engli schen Ausdruck „Low-Rank Matrix Approximation” bezeichnet wird. Solche Approximationen sind hinlänglich aus dem Stand der Technik bekannt und werden deshalb nicht im Detail beschrieben. In bevorzugten Varianten der Erfindung können als Approximationen insbesondere eine Singulärwertzerlegung, eine nicht-negative Matrixfaktorisierung oder eine Latent Dirichlet Allocation eingesetzt werden.In the method of the invention can various multivariate predictive models for generating the Record are used. Preferably, these models are based on a matrix approximation, in particular a matrix approximation with low rank, which often also with the English expression "low-rank Matrix approximation "is called. Such approximations are well known in the art and are therefore not described in detail. In preferred variants of Invention can as approximations in particular a Singular value decomposition, a non-negative matrix factorization or a Latent Dirichlet Allocation.

In einer besonders bevorzugten Ausführungsform werden in dem erfindungsgemäßen Verfahren nach der Generierung der Matrix solche Elemente der zweiten Dimension aus der Matrix gestrichen, deren Anzahl an Einträgen mit dem vorbestimmten Wert kleiner als eine vorgegebene Schwelle ist. Auf diese Weise wird sowohl die Recheneffizienz als auch die Performanz des Verfahrens erhöht.In a particularly preferred embodiment are in the inventive method after generation the matrix contains such elements of the second dimension from the matrix deleted, their number of entries with the predetermined Value is less than a predetermined threshold. In this way becomes both the computational efficiency and the performance of the process elevated.

In einer Variante der Erfindung umfassen die vorgegebenen Merkmale der Tripel erste Elemente in der Form von Paaren aus Prädikat und Objekt, für welche jeweils zumindest ein Tripel mit einer Entität aus der Entitäts-Menge als Subjekt und dem jeweiligen Paar als Prädikat und Objekt existiert, wobei die Matrix für jede Kombination aus einem Element der ersten Dimension und einem ersten Element der zweiten Dimension, für welche ein Tripel in der Vielzahl von Tripeln mit dem Element der ersten Dimension als Subjekt und dem ersten Element der zweiten Dimension als Prädikat und Objekt existiert, als Eintrag den vorbestimmten Wert ungleich Null aufweist und für die anderen Kombinationen aus Element der ersten Dimension und erstem Element der zweiten Dimension als Eintrag den Wert Null aufweist.In A variant of the invention comprises the predetermined features the triple first elements in the form of pairs of predicate and object, for which in each case at least one triple with an entity from the entity set as a subject and the respective pair exists as predicate and object, where the matrix for each combination consists of one element the first dimension and a first element of the second dimension, for which a triple in the multitude of triples with the Element of the first dimension as subject and the first element the second dimension exists as predicate and object, as entry has the predetermined value not equal to zero and for the other combinations of element of the first dimension and the first Element of the second dimension has an entry of zero.

In einer weiteren Variante des erfindungsgemäßen Verfahrens umfassen die vorgegebenen Merkmale zweite Elemente in der Form von Paaren aus Subjekt und Prädikat, für welche jeweils zumindest ein Tripel mit dem jeweiligen Paar als Subjekt und Prädikat und einer Entität aus der Entitäts-Menge als Objekt existiert, wobei die Matrix für jede Kombination aus einem Element der ersten Dimension und einem zweiten Element der zweiten Dimension, für welche ein Tripel in der Vielzahl von Tripeln mit dem zweiten Element der zweiten Dimension als Subjekt und Prädikat und dem Element der ersten Dimension als Objekt existiert, als Eintrag den vorbestimmten Wert aufweist und für die anderen Kombinationen aus Element der ersten Dimension und zweitem Element der zweiten Dimension als Eintrag den Wert Null aufweist.In a further variant of the method according to the invention, the predetermined features comprise second elements in the form of pairs of subject and predicate, for which in each case at least one triple exists with the respective pair as subject and predicate and an entity from the entity set as object, where the matrix for each combination of one element of the first dimension and a second element of the second dimension for which one triple exists in the plurality of triplets having the second element of the second dimension as subject and predicate and the element of the first dimension as object Entry has the predetermined value and for the other combinations of element of the first dimension and two element of the second dimension has an entry of zero.

In einer weiteren Variante des erfindungsgemäßen Verfahrens umfassen die vorgegebenen Merkmale dritte Elemente in der Form von Prädikaten, für welche jeweils zumindest ein Tripel mit einer Entität aus der Entitäts-Menge als Subjekt, dem jeweiligen Prädikat als Prädikat und einem beliebigen in der Vielzahl von Tripeln enthaltenen Objekt existiert, wobei die Matrix für jede Kombination aus einem Element der ersten Dimension und einem dritten Element der zweiten Dimension, für welche ein Tripel in der Vielzahl von Tripeln mit dem Element der ersten Dimension als Subjekt und dem dritten Element der zweiten Dimension als Prädikat sowie einem beliebigen in der Vielzahl von Tripeln enthaltenen Objekt existiert, als Eintrag den vorbestimmten Wert aufweist und für die anderen Kombinationen aus Element der ersten Dimension und drittem Element der zweiten Dimension als Eintrag den Wert Null aufweist.In a further variant of the invention Method include the predetermined features third elements in the form of predicates, for each of which at least a triple with an entity from the entity set as subject, the respective predicate as predicate and any object contained in the plurality of triples exists, where the matrix for each combination of a Element of the first dimension and a third element of the second Dimension for which a triple in the multitude of triples with the element of the first dimension as subject and the third Element of the second dimension as a predicate and a any object contained in the multitude of triples exists, as entry has the predetermined value and for the other combinations of element of the first dimension and third Element of the second dimension has an entry of zero.

In einer weiteren bevorzugten Ausgestaltung des erfindungsgemäßen Verfahrens können die vorgegebenen Merkmale der Tripel gegebenenfalls vierte Elemente in der Form von Prädikaten umfassen, für welche jeweils zumindest ein Tripeln mit einem beliebigen in der Vielzahl von Tripeln enthaltenen Subjekt, dem jeweiligen Prädikat als Prädikat und einer Entität aus der Entitäts-Menge als Objekt existiert, wobei die Matrix für jede Kombination aus einem Element der ersten Dimension und einem vierten Element der zweiten Dimension, für welche ein Tripel in der Vielzahl von Tripeln mit einem beliebigen in der Vielzahl von Tripeln enthaltenen Subjekt, dem vierten Element der zweiten Dimension als Prädikat und dem Element der ersten Dimension als Objekt existiert, als Eintrag den vorbestimmten Wert aufweist und für die anderen Kombinationen aus Element der ersten Dimension und viertem Element der zweiten Dimension als Eintrag den Wert Null aufweist.In a further preferred embodiment of the invention Procedure can be the given characteristics of the triple possibly fourth elements in the form of predicates comprise, for each of which at least one triples with any subject contained in the plurality of triples, the predicate as a predicate and a Entity from the entity set exists as an object, where the matrix for each combination consists of one element the first dimension and a fourth element of the second dimension, for which a triple in the multitude of triples with one any subject contained in the plurality of triples, the fourth element of the second dimension as a predicate and the element of the first dimension exists as an object, as an entry has the predetermined value and for the other combinations from element of the first dimension and fourth element of the second Dimension has an entry of zero.

In einer weiteren, besonders bevorzugten Ausführungsform der Erfindung werden in der Matrix auch Eigenschaften von Entitäten berücksichtigt, welche nicht unmittelbar eine Eigenschaft eines Elements der ersten Dimension betreffen. In diesem Fall enthält die Matrix ferner als fünfte Elemente aus der Vielzahl von Tripeln aggregierte Eigenschaftswerte von einem oder mehreren Entitäten, welche mit einem jeweiligen Element der ersten Dimension der Matrix in einem oder mehreren gemeinsamen Tripeln enthalten sind. Auf diese Weise können in der statistischen Analyse Eigenschaften von Entitäten berücksichtigt werden, welche mit der jeweiligen Entität in der Matrix eine Relation haben.In a further, particularly preferred embodiment of the Invention also become properties of entities in the matrix takes into account which is not a property directly of an element of the first dimension. In this case contains the matrix further as fifth elements of the plurality of triples aggregated property values of one or more Entities associated with a respective element of the first Dimension of the matrix in one or more common triples are included. In this way, in the statistical Analysis properties of entities considered which are associated with the respective entity in the matrix have a relation.

Das erfindungsgemäße Verfahren kann für beliebige Arten von semantisch annotierten digitalen Informationen eingesetzt werden. In einer bevorzugten Anwendung umfassen die semantisch annotierten Informationen medizinische Daten, insbesondere Informationen zu medizinischen Bilddaten.The inventive method can for any Types of semantically annotated digital information is used become. In a preferred application, the semantically annotated ones include Information medical data, in particular information about medical image data.

Ein weiterer bevorzugter Anwendungsbereich der Erfindung ist die semantische Annotierung von Informationen im Internet. In diesem Fall sind die semantisch annotierten digitalen Informationen in semantisch annotierten Webseiten enthalten.One Another preferred field of application of the invention is the semantic Annotation of information on the Internet. In this case, the semantically annotated digital information in semantically annotated Web pages included.

Neben dem oben beschriebenen erfindungsgemäßen Verfahren, mit dem Datensätze umfassend semantische Relationen und zugeordnete Wahrscheinlichkeitsmaße generiert werden, betrifft die Erfindung ferner ein Verfahren zur Abfrage von Informationen aus diesen generierten Datensätzen. Dabei werden eine oder mehrere Abfragen nach einer semantischen Relation unter Berücksichtigung des Wahrscheinlichkeitsmaßes für die Wahrheit der semantischen Relation an den digitalen Datensatz gerichtet. Vorzugsweise wird dabei mit der oder den Abfragen nach semantischen Relationen gesucht, welche ein Wahrscheinlich keitsmaß in einem vorbestimmten Werteintervall aufweisen. Ebenso können die Ergebnisse der Abfrage oder Abfragen die Wahrscheinlichkeitsmaße enthalten, die den semantischen Relationen zugeordnet sind, welche gemäß der jeweiligen Abfrage aufgefunden werden.Next the method according to the invention described above, with the data sets comprising semantic relations and associated probability measures are generated the invention further provides a method for retrieving information from these generated records. It will be one or more Querying for a semantic relation under consideration the probability measure for the truth the semantic relation addressed to the digital record. Preferably, it is semantic with the or the queries Relationships were sought, which were a probability measure in have a predetermined value interval. Likewise the results of the query or queries the probability measures which are associated with the semantic relations, which be found according to the respective query.

In einer besonders bevorzugten Ausführungsform wird zur Generierung der Abfrage oder Abfragen eine Syntax verwendet, welche auf der aus dem Stand der Technik bekannten Abfragesprache SPARQL beruht und entsprechend durch Befehle zur Berücksichtigung von Wahrscheinlichkeitsmaßen ergänzt ist.In A particularly preferred embodiment is for generating the query or queries use a syntax which is based on the is based on the known from the prior art query language SPARQL and accordingly by commands to account for Probability measures is supplemented.

Neben den oben beschriebenen Verfahren betrifft die Erfindung ferner ein Computerprogrammprodukt mit einem auf einem maschinenlesbaren Träger gespeicherten Programmcode zur Durchführung jeder Variante der oben beschriebenen Verfahren, wenn das Programm auf einem Rechner abläuft.Next The invention further relates to the method described above Computer program product with one on a machine readable carrier stored program code for performing each variant the procedure described above, if the program is on a machine expires.

Ausführungsbeispiele der Erfindung werden nachfolgend anhand der beigefügten Figuren detailliert beschrieben.embodiments The invention will be described below with reference to the attached Figures detailed.

Es zeigen:It demonstrate:

1 eine beispielhafte Darstellung der in einer Ausführungsform des erfindungsgemäßen Verfahrens verarbeiteten semantischen Informationen basierend auf einem RDF-Graphen; 1 an exemplary representation of the processed in one embodiment of the inventive method semantic information based on an RDF graphs;

2 eine schematische Darstellung eines Beispiels eines Modells aus semantischen Relationen, basierend auf dem Ausführungsformen des erfindungsgemäßen Verfahrens getestet wurden; und 2 a schematic representation of an example of a model of semantic relations, based on the embodiments of the method according to the invention have been tested; and

3 ein Diagramm, welches mit Ausführungsformen des Verfahrens erreichte Ergebnisse mit anderen Verfahren vergleicht. 3 a diagram which compares results achieved with embodiments of the method with other methods.

Nachfolgend wird eine Ausführungsform des erfindungsgemäßen Verfahrens basierend auf dem sog. Resource Description Framework beschrieben, welches auch als RDF bezeichnet wird. Dabei handelt es sich um eine Familie von Standards des World-Wide-Web-Konsortiums W3C zur formalen Beschreibung von Informationen – im World Wide Web. Eine Ressource ist dabei ein Beispiel einer semantischen Entität im Sinne von Anspruch 1. Eine Ressource kann eindeutig über einen sog. Uniform Resource Identifier URI identifiziert werden. Basierend auf RDF werden semantische Informationen mittels semantischer Relationen in der Form von Tripeln aus Subjekt, Eigenschaft und Eigenschaftswert bzw. äquivalent aus Subjekt, Prädikat und Objekt beschrieben. In RDF wird aus diesen Tripeln eine Graphstruktur erzeugt, bei der ein Tripel als gerichteter Pfad von einem Knoten, der das Subjekt repräsentiert, zu einem Knoten, der das Objekt bzw. den Eigenschaftswert repräsentiert, dargestellt wird. Dieser Pfad ist mit der entsprechenden Eigenschaft gemäß dem Tripel versehen. In der hier beschriebenen Ausführungsform wird davon ausgegangen, dass eine komplette Datenbasis mit Tripeln basierend auf einem RDF-Graphen in digitaler Form vorliegt. Dieser Graph kann z. B. in entsprechenden Webseiten im Internet enthalten sein und zur semantischen Beschreibung von Webinhalten dienen.following is an embodiment of the invention Method based on the so-called Resource Description Framework described, which is also referred to as RDF. It acts It is a family of standards of the World Wide Web Consortium W3C for the formal description of information - in the World Wide Web. A resource is an example of a semantic one Entity as defined in claim 1. A resource can be unique over a so-called Uniform Resource Identifier URI can be identified. Based on RDF, semantic information is semantic Relations in the form of triples of subject, property and Property value or equivalent from subject, predicate and object described. In RDF, these triples become a graph structure generated by a triple as a directed path from a node, representing the subject, becoming a node that Represents object or property value becomes. This path is with the corresponding property according to the Triples provided. In the embodiment described here It is assumed that a complete database with triples based on an RDF graph in digital form. This Graph can z. B. contained in corresponding websites on the Internet and serve for the semantic description of web content.

Die semantischen Beschreibungen in dem RDF-Graphen beruhen auf dem Vokabular einer entsprechenden Beschreibungssprache, welche als RDF-Schema oder RDFS bezeichnet wird. RDF und RDFS bilden dabei einen gemeinsamen RDF-Graphen, um die entsprechenden Tripel zu beschreiben. Es gibt jedoch auch noch verschiedene weitere semantische Beschreibungsmöglichkeiten von Inhalten, welche in dem erfindungsgemäßen Verfahren verwendet werden können, wie z. B. OWL-Ontologien (OWL = Ontology Web Language), die auf RDF-Graphen aufbauen und diesen weitere semantische Ausdrucksmöglichkeiten hinzufügen.The Semantic descriptions in the RDF graph are based on vocabulary a corresponding description language, which is called RDF schema or RDFS. RDF and RDFS form a common RDF graphs to describe the corresponding triples. There is but also several other semantic description options of contents which in the invention Method can be used, such. B. OWL ontologies (OWL = Ontology Web Language) that build on RDF graphs and Add to this further semantic expression possibilities.

RDF-Graphen können in geeigneter Weise abgefragt werden, um aus den Graphen entsprechende semantische Informationen zu gewinnen. Beispielsweise können Abfragen an den RDF-Graphen gerichtet werden, mit denen nach Subjekten gesucht wird, die eine Eigenschaft mit einem bestimmten Eigenschaftswert aufweisen. Zur Abfrage dieser Informationen wird beispielsweise die Abfragesprache SPARQL verwendet, welche einen Standard für das Abfragen von RDF-spezifischen Informationen und zur Ausgabe entsprechender Abfrageergebnisse darstellt. SPARQL erlaubt die Formulierung von aussagekräftigen Abfragen. In ihrer Basisfunktionalität sucht eine SPARQL-Abfrage nach Graph-Mustern, jedoch können gegebenenfalls auch aussagekräftigere Abfrage-Muster generiert werden, insbesondere um Filterungen durchzuführen und die Ausgabe entsprechend zu formatieren. Wie weiter unten näher beschrieben wird, können mit einer entsprechenden Erweiterung der Syntax von SPARQL neue, bisher nicht bekannte Abfragen an den erfindungsgemäß generierten Datensatz gerichtet werden. Zum besseren Verständnis der späteren Ausführungen werden deshalb hier kurz einige Basisbefehle von SPARQL erläutert.RDF graphs can be queried in an appropriate manner to get out of the Graphene to gain appropriate semantic information. For example queries can be directed to the RDF graph, with looking for subjects that have a property with a have a certain property value. To query this information For example, the query language SPARQL is used, which a standard for querying RDF-specific information and to output corresponding query results. SPARQL allows the formulation of meaningful queries. In its base functionality, it looks for a SPARQL query according to graph patterns, but may also be more meaningful if necessary Query patterns are generated, in particular to perform filtering and format the output accordingly. As further below can be described, with an appropriate extension the syntax of SPARQL new, so far unknown queries to the directed generated according to the data set become. For a better understanding of the later Therefore, here are a few basic commands explained by SPARQL.

Eine SPARQL-Abfrage enthält üblicherweise einen sog. PREFIX-Befehl, welcher den Namensraum spezifiziert, d. h. ein entsprechendes Verzeichnis, in dem die betrachteten semantischen Entitäten enthalten sind. Mit Hilfe des SPARQL-Befehls SELECT wird das Ausgabemuster der Abfrage festgelegt und mit dem WHERE-Befehl wird spezifiziert, welche Muster des entsprechenden RDF-Graphen zu durchsuchen sind. Der WHERE-Befehl kann geeignete Variable enthalten. Kompliziertere Anfragen können über Gruppierungen, optionale Muster und alternative Muster erreicht werden. Ferner können Filter verwendet werden, um das Such-Muster weiter zu beschränken. Filter können numerische Vergleiche wie „größer”, „kleiner” und „gleich”, spezielle Operatoren, boolsche Operatoren und arithmetische Operationen umfassen. Das Ausgabe-Format kann über die Befehle CONSTRUCT, DESCRIBE und ASK modifiziert werden. Mit CONSTRUCT kann die Ausgabe als ein RDF-Dokument formatiert werden. Mit dem Befehl MODIFY können Ausgabe-Muster manipuliert werden. Über Schlüsselwörter ORDER BY, DISTINCT und dergleichen kann die Redundanz in den gefundenen Ergebnissen vermindert werden.A SPARQL query usually contains a so-called. PREFIX command specifying the namespace, d. H. a corresponding Directory in which the considered semantic entities are included. The SPARQL SELECT command becomes the output pattern the query is specified and specified with the WHERE command, which patterns of the corresponding RDF graph to search. The WHERE command may contain appropriate variables. complicated Inquiries can be made via groupings, optional Patterns and alternative patterns are achieved. Furthermore, can Filters are used to further restrict the search pattern. Filters can have numeric comparisons like "bigger", "smaller" and "equal", special ones Include operators, Boolean operators and arithmetic operations. The output format can be specified via the commands CONSTRUCT, DESCRIBE and ASK are modified. With CONSTRUCT, the output can be used as a Formatted RDF document. You can use the MODIFY command Output patterns are manipulated. About keywords ORDER BY, DISTINCT and the like can be found in the redundancy Results are reduced.

Damit die oben beschriebenen SPARQL-Abfragen zu aussagekräftigeren Ergebnissen führen, können über entsprechende Schlussfolgerungs-Verfahren, welche auch als Reasoning-Verfahren bezeichnet werden, ursprünglich im RDF-Graphen nicht explizit enthaltene Informationen abgeleitet werden. Es gibt eine Vielzahl von Software-Programmen bekannt, die Reasoning durchführen können. Das erfindungsgemäße Verfahren kann dabei sowohl auf einem originären RDF-Graphen als auch auf einen RDF-Graphen angewendet werden, in dem vorab mit Reasoning die ursprüngliche Wissensbasis erweitert wurde.In order for the SPARQL queries described above to yield more meaningful results, inference statements, also referred to as reasoning methods, can be used to derive information not originally contained in the RDF graph. There are a variety of software programs known that can perform reasoning. The inventive method can be applied both to an original RDF graph and to an RDF graph in which the original knowledge base was previously extended with reasoning.

Im Folgenden werden Ausführungsformen eines neuartigen Verfahrens beschrieben, mit denen durch maschinelles Lernen aus den bekannten Tripeln eines RDF-Graphen weitere, nicht aus dem Graphen entnehmbare Tripel abgeleitet werden, wobei diese Tripel mit entsprechenden Maßen versehen sind, welche als Wahrscheinlichkeiten interpretiert werden können und angegeben, mit welcher Wahrscheinlichkeit das entsprechende Tripel existiert, d. h. die auf diesem Tripel beruhende Aussage, wahr ist. Im Stand der Technik gibt es bereits Ansätze zum Lernen von semantischen Relationen in der Form von Tripeln, beispielsweise beruhend auf globalen probabilistischen Modellen bzw. auf Markov-Blanket-Modellen. Die bekannten Verfahren weisen jedoch Nachteile auf, insbesondere skalieren sie nicht bei größeren Datenmengen bzw. beruhen auf nur bedingt zur Analyse der Informationen geeigneten Annahmen.in the Following are embodiments of a novel method described with which by machine learning from the well-known Triples an RDF graph other, not removable from the graph Are derived triple, with these corresponding triples Measures are provided which interprets as probabilities can be and specified with what probability the corresponding triple exists, d. H. the on this triple based statement, true. There are already in the prior art Approaches to learning semantic relations in the form of triples, for example, based on global probabilistic Models or on Markov Blanket models. The known methods have however, they have disadvantages, especially they do not scale for larger ones Data sets or are based only conditionally for the analysis of the information suitable assumptions.

In den nachfolgend beschriebenen Ausführungsformen der Erfindung wird eine statistische Analyse der Informationen in einem RDF-Graphen durchgeführt. Dabei werden zunächst, beispielsweise durch einen Benutzer, statistische Einheiten und eine Population festgelegt. Statistische Einheiten sind dabei in der vorliegenden Ausführungsform semantische Entitäten in der Form entsprechender Ressourcen in dem RDF-Graphen. Diese Ressourcen können beispielsweise in dem RDF-Graphen entsprechend spezifizierte Personen mit bestimmten Eigenschaften bzw. Relationen zu anderen Personen sein. Eine Population ist ein Satz von statistischen Einheiten, für welche die nachfolgend beschriebene statistische Inferenz durchgeführt wird. Die Population kann auf verschiedene Arten definiert sein. Beispielsweise kann sie in einem speziellen RDF-Graphen Personen aus einem bestimmten Land oder alternativ weibliche Studenten von einer bestimmten Universität betreffen. In der nachfolgend beschriebenen statistischen Analyse wird dabei nur eine Untermenge der Population, d. h. ein sog. Sample, untersucht.In the embodiments of the invention described below will be a statistical analysis of the information in an RDF graph carried out. First, for example by a user, statistical units and a population established. Statistical units are in the present Embodiment semantic entities in the form corresponding resources in the RDF graph. These resources can for example, persons specified in the RDF graph certain characteristics or relations to other persons. A population is a set of statistical units, for which performs the statistical inference described below becomes. The population can be defined in several ways. For example, it can be in a special RDF graph people from a particular country or alternatively female students from of a particular university. In the following The statistical analysis described here becomes only a subset of the population, d. H. a so-called sample, examined.

Basierend auf der Definition einer statistischen Einheit und einer Population wird darauf aufbauend ein Knotensatz für jede statistische Einheit definiert, der im Folgenden auch als SUMS-Satz (SUMS = Statistical Unit Node Set) bezeichnet wird. Im Folgenden bezeichnet U = {u} einen Satz von statistischen Einheiten des betrachteten Samples. Ein SUMS-Satz mit der Bezeichnung SUNS_u für die statistische Einheit u umfasst dabei alle probabilistischen Knoten, welche allen im RDF-Graphen vorhandenen und allen potentiellen Tripeln entsprechen, in denen u entweder ein Subjekt oder ein Objekt ist. Dies wird nochmals anhand von 1 verdeutlicht, welche im linken Teil ein Fragment eines in der Erfindung verarbeiteten RDF-Graphen mit zwei Ressourcen bzw. statistischen Einheiten A und B zeigt. Diese Ressourcen stellen Knoten in dem Graphen dar und sind über semantische Relationen in der Form entsprechender Tripel mit anderen semantischen Entitäten, ebenfalls in der Form von Knoten, verbunden. Im Graphen vorhandene bzw. potentiell gegebene, jedoch nicht im Graphen enthaltene Relationen sind dabei durch entsprechend gerichtete Pfeile a₁, a₂, ..., a₇ bzw. b₁, b₂, ..., b₇ bezeichnet. Endet ein Pfeil bei der Ressource A oder B ist diese Ressource im entsprechenden Tripel ein Objekt. Geht ein Pfeil von der Ressource A oder B aus, ist die Ressource das Subjekt im entsprechenden Tripel. Vorhandene Tripel im RDF-Graphen sind mit durchgezogenen Pfeilen wiedergegeben, wohingegen potentielle Tripel im RDF-Graphen mit gestrichelten Pfeilen dargestellt sind. Aus dem RDF-Graphen können die oben beschriebenen probabilistischen Knoten bzw. SUNS-Sätze abgeleitet werden, was in 1 sche matisch durch den Pfeil P angedeutet ist. In 1 ist dabei der SUNS-Satz für die statistische Einheit A durch die probabilistischen Knoten

innerhalb des Kreises C1 angedeutet. Demgegenüber ist der SUMS-Satz für die statistische Einheit B durch die probabilistischen Knoten

innerhalb des Kreises C2 angedeutet.Based on the definition of a statistical unit and a population, a node set is defined for each statistical unit, which is also referred to below as the SUMS set (SUMS = Statistical Unit Node Set). In the following U = {u} denotes a set of statistical units of the considered sample. A SUMS set called SUNS _u for the statistical unit u includes all probabilistic nodes that correspond to all the potential triples in the RDF graph and where u is either a subject or an object. This is again based on 1 which shows in the left part a fragment of an RDF graph with two resources or statistical units A and B processed in the invention. These resources represent nodes in the graph and are connected via semantic relations in the form of corresponding triples to other semantic entities, also in the form of nodes. In the graph existing or potentially given, but not contained in the graph relations are designated by correspondingly directed arrows a ₁ , a ₂ , ..., a ₇ and b ₁ , b ₂ , ..., b ₇ . If an arrow ends at resource A or B, this resource is an object in the corresponding triple. If an arrow goes from resource A or B, the resource is the subject in the corresponding triple. Existing triples in the RDF graph are represented by solid arrows, whereas potential triples in the RDF graph are represented by dashed arrows. From the RDF graph, the probabilistic nodes or SUNS phrases described above can be derived, which is described in US Pat 1 cal is indicated schematically by the arrow P. In 1 is the SUNS theorem for the statistical unit A by the probabilistic nodes

indicated within the circle C1. In contrast, the SUMS set for the statistical unit B is by the probabilistic nodes

indicated within the circle C2.

Ausgehend von der soeben beschriebenen Definition der SUNS-Sätze aus probabilistischen Knoten wird als erste Restriktion festgelegt, dass im Falle, dass es Tripel zwischen statistischen Einheiten in der Form (u_i, p, u_j) mit u_i, u_j ∊ U gibt,

ein Mitglied des SUNS-Satzes

jedoch nicht des SUNS-Satzes

ist. Es sei hierbei angemerkt, dass im Folgenden die Notation (x, p, y) ein Tripel mit dem Subjekt x, dem Prädikat p und dem Objekt y bezeichnet. Die gerade genannte Beschränkung ist erforderlich, da ansonsten der gleiche probabilistische Knoten in zwei unterschiedlichen SUNS-Sätzen auftreten würde, was die beiden SUNS-Sätze stark abhängig voneinander machen würde. Die zweite Beschränkung besteht darin, dass nachfolgend nur probabilistische Knotentypen betrachtet werden, welche in dem RDF-Graphen mit einer gewissen Frequenz auftreten, da eine statistische Analyse keine Aussage über seltene Ereignisse machen kann. Basierend auf obigen Definitionen und Beschränkungen wird in der hier beschriebenen Ausführungsform eine Matrix in folgenden vier Schritten erzeugt:On the basis of the just described definition of SUNS sentences from probabilistic nodes, the first restriction is that in the case that there are triples between statistical units in the form (u _i , p, u _j ) with u _i , u _j ε U .

a member of the SUNS set

but not the SUNS sentence

is. It should be noted here that in the following the notation (x, p, y) denotes a triple with the subject x, the predicate p and the object y. The just mentioned constraint is required because otherwise the same probabilistic node would appear in two different SUNS sets, which would make the two SUNS sets highly dependent on each other. The second limitation is that below only probabilistic node types are considered which occur in the RDF graph at a certain frequency, since a statistical analysis can make no statement about rare events. Based on the above definitions and limitations, in the embodiment described here, a matrix is generated in the following four steps:

Schritt 1:Step 1:

Basierend auf dem Sample U = {u} von statistischen Einheiten enthält die Datenmatrix eine Zeile pro statistischer Einheit. Es wird dabei das Paar (p, o) betrachtet, so dass ein Tripel der Form (u, p, o) in dem RDF-Graphen für wenigstens ein u ∊ U existiert. Für jedes separate Paar (p, o) wird eine Spalte in der Matrix erzeugt. Der Eintrag in der Matrix für eine statistische Einheit u und ein Paar (p, o) ist dabei gleich 1, falls das Tripel (u, p, o) in dem RDF-Graphen vorhanden ist. Ansonsten wird der Eintrag der Datenmatrix auf 0 gesetzt.Based on the sample contains U = {u} of statistical units the data matrix one row per statistical unit. It will be there consider the pair (p, o), so that a triple of the form (u, p, o) in the RDF graph for at least one u ε U exists. For each separate pair (p, o) there will be a column in the Matrix generated. The entry in the matrix for a statistical Unit u and a pair (p, o) is equal to 1, if the triple (u, p, o) exists in the RDF graph. Otherwise the entry will be the data matrix is set to 0.

Schritt 2:Step 2:

Zusätzlich wird eine Spalte für jedes separate p aus der Menge der obigen Paare (p, o) erzeugt. Der Eintrag in der Matrix für eine statistische Einheit u und ein Prädikat p ist dabei gleich 1, falls das Tripel (u, p, o) für wenigstens ein Objekt o in dem RDF-Graphen existiert, ansonsten ist der entsprechende Matrixeintrag 0.additionally will one column for each separate p from the set of above pairs (p, o) generated. The entry in the matrix for a statistical unit u and a predicate p are included equal to 1 if the triple (u, p, o) is at least one Object o exists in the RDF graph, otherwise the corresponding one Matrix entry 0.

Schritt 3:Step 3:

In diesem Schritt wird ein Paar (s, p) betrachtet, so dass ein Tripel in der Form (s, p, u) in dem RDF-Graphen für wenigstens eine statistische Einheit u ∊ U existiert. Für jedes separate Paar (s, p) wird dabei eine Spalte in der Datenmatrix erzeugt. Der Eintrag der Matrix für eine statistische Einheit u und ein Paar (s, p) ist dabei gleich 1, falls das Tripel (s, p, u) in dem RDF-Graphen existiert. Ansonsten ist der Eintrag 0.In In this step, a pair (s, p) is considered, making a triple in the form (s, p, u) in the RDF graph for at least a statistical unit u ε U exists. For each separate pair (s, p) becomes a column in the data matrix generated. The entry of the matrix for a statistical unit u and a pair (s, p) is equal to 1, if the triple (s, p, u) exists in the RDF graph. Otherwise, the entry is 0.

Schritt 4:Step 4:

Zusätzlich wird ein Eintrag für jedes separate p aus der Menge der obigen Paare (s, p) erzeugt. Der Eintrag in der Matrix für eine statistische Einheit u und das Prädikat p ist dabei gleich 1, falls das Tripel (s, p, u) für wenigstens ein Subjekt s in dem RDF-Graphen existiert. Ansonsten ist der Eintrag 0.additionally is an entry for each separate p from the set of above pairs (s, p) generated. The entry in the matrix for a statistical unit u and the predicate p is included equal to 1 if the triple (s, p, u) is at least one Subject s exists in the RDF graph. Otherwise, the entry 0th

In einem den obigen Schritten 1 bis 4 folgendem Nachverarbeitungsschritt werden aus der erzeugten Matrix diejenigen Spalten entfernt, für welche die Anzahl von Einsen kleiner als ein Schwellwert t ist. Falls es Tripel zwischen den statistischen Einheiten in der Form (u_i, p, u_j) mit u_i, u_j ∊ U gibt, werden Spalten für u_j entfernt, bei denen eine statistische Einheit u_j als Objekt fungiert. Somit tritt jede semantische Aussage bzw. Relation nur einmal in der Datenmatrix auf.In a post-processing step following the above steps 1 to 4, those columns are removed from the generated matrix for which the number of ones is smaller than a threshold value t. If there are triples between the statistical units in the form (u _i , p, u _j ) with u _i , u _j ε U, columns for u _{j are} removed for which a statistical unit u _j acts as an object. Thus, each semantic statement or relation occurs only once in the data matrix.

Im Folgenden ist zu beachten, dass die nachfolgend beschriebene Ableitung von probabilistischen Aussagen stark von der Definition einer statistischen Einheit, der Definition einer Population und der Auswahl eines Samples aus der Population abhängt. Somit können die Wahrscheinlichkeitsmaße variieren, wenn sich diese Größen verändern.in the It should be noted that the derivation described below from probabilistic statements strongly from the definition of a statistical one Unit, the definition of a population and the selection of a sample depends on the population. Thus, the probability measures vary as these sizes change.

In einer Vielzahl von Anwendungsfällen ist es ferner wünschenswert, in der oben beschriebenen Matrix, welche SUNS-Sätze für vorgegebene statistische Einheiten beschreibt, auch Informationen außerhalb der SUMS-Sätze zu integrieren. Beispielsweise kann in einem entsprechenden RDF-Graphen unter Umständen der Gesundheitszustand einer ersten Person auch basierend auf dem Gesundheitszustand einer anderen zweiten Person prädiziert werden, welche gemäß der semantischen Relation einer Freundschaft der Freund der ersten Person ist. Z. B. können sich ansteckende Krankheiten zwischen Freunden übertragen. Diese Information kann in einfacher Weise zu der oben beschriebenen Matrix in der Form von zusätzlichen Spalten hinzugefügt werden. Dabei werden entsprechende probabilistische Knoten (d. h. semantische Relationen), welche Eigenschaften von statistischen Einheiten mit semantischen Relationen zu den statistischen Einheiten gemäß den Zeilen der Matrix beschreiben, als feste Eingaben in der Matrix betrachtet. Diese Eingaben stellen aggregierte (z. B. gemittelte) Eigenschaftswerte dar. Beispielsweise kann eine Spalte vorgesehen sein, in der für eine Person gemäß einer entsprechenden Zeile angegeben ist, wie der mittlere Gesundheitszustand aller Freunde dieser Person ist. In 1 ist im rechten Teil nochmals die Berücksichtigung von probabilistischen Knoten außerhalb eines SUMS-Satzes als Eingaben in der Matrix verdeutlicht. Dabei können für die probabilistischen Knoten der statistischen Einheit A entsprechende probabilistische Knoten der statistischen Einheit B als Eingabe I1 benutzt werden. Ebenso können für probabilistische Knoten der statistischen Einheit B die probabilistischen Knoten der statistischen Einheit A als Eingabe I2 benutzt werden.In a variety of applications, it is further desirable to also integrate information outside the SUMS sets in the above-described matrix describing SUNS sets for given statistical units. For example, in a corresponding RDF graph, the health of a first person may also be predicated based on the health of another second person who, according to the semantic relation of a friendship, is the first person's friend. For example, infectious diseases can be transmitted between friends. This information can be easily added to the matrix described above in the form of additional columns. In doing so, corresponding probabilistic nodes (ie semantic relations) which describe properties of statistical units with semantic relations to the statistical units according to the rows of the matrix are considered as fixed entries in the matrix. These inputs represent aggregated (e.g., averaged) property values. For example, a column may be provided that indicates to a person according to a corresponding row what the median health of all of his or her friends is. In 1 In the right-hand part, the consideration of probabilistic nodes outside of a SUMS sentence is again illustrated as entries in the matrix. In this case, probabilistic nodes of the statistical unit B corresponding to the probabilistic nodes of the statistical unit A can be used as input I1. Likewise, for probabilistic nodes of the statistical unit B, the probabilistic nodes of the statistical unit A can be used as input I2.

Die gemäß dem obigen Verfahren resultierende Matrix ist üblicherweise relativ groß, binär (d. h. sie enthält nur die Werte 0 und 1) und spärlich besetzt. Der Wert 1 steht in der Matrix dabei dafür, dass es ein Tripel in dem RDF-Graphen gibt, welches die entsprechenden Elemente der Zeile und Spalte enthält. Ein Wert von 0 steht dafür, dass nicht bekannt ist, ob ein Tripel mit den entsprechenden Elementen aus Zeile und Spalte der Matrix existiert. Ein solches Szenario wurde bereits in verschiedenen Kontexten in der Vergangenheit studiert. Erfindungsgemäß werden zur Weiterverarbeitung der Matrix verschiedene an sich bekannte Matrix-Approximations-Verfahren in der Form von Matrix-Vervollständigungs-Verfahren (englisch: Matrix Completion Methods) verwendet. In den hier beschriebenen Ausführungsformen der Erfindung wurden ein Verfahren basierend auf einer Eigenvektor-Analyse der Matrix (beispielsweise in Druckschrift [1] beschrieben), ein Verfahren basierend auf einer nicht-negativen Matrixfaktorisierung (beispielsweise in Druckschrift [2] beschrieben) und ein Verfahren basierend auf Latent Dirichlet Allocation (beispielsweise in Druckschrift [3] beschrieben) umgesetzt. Die Eigenvektor-Analyse basiert auf einer Singulärwertzerlegung. Die nicht-negative Matrixfaktorisierung stellt eine Matrixzerlegung unter den Randbedingungen dar, dass alle Terme in den faktorisierten Matrizen nicht-negativ sind. Latent Dirichlet Allocation basiert auf einer bayesianischen Behandlung eines generativen Topic-Modells. Die Verfahren sind an sich bekannt, ihr Einsatz im Kontext der Erfindung ist jedoch neu. Alle drei verwendeten Matrix-Approximationen schätzen dabei unbekannte Matrixeinträge, d. h. Matrixeinträge, denen – wie im Vorangegangenen beschrieben – der Wert 0 zugewiesen wurde, über eine Approximation, bei der die Matrix an eine Matrix mit niedrigem Rang angenähert wird.The matrix resulting from the above procedure is usually relatively large, binary (ie, containing only the values 0 and 1) and sparsely populated. The value 1 in the matrix indicates that there is a triple in the RDF graph containing the corresponding elements of the row and column. A value of 0 indicates that it is not known whether a triple with the corresponding row and column elements of the Matrix exists. Such a scenario has already been studied in different contexts in the past. According to the invention, various known per se matrix approximation methods in the form of matrix completion methods are used for further processing of the matrix. In the embodiments of the invention described herein, a method based on eigenvector analysis of the matrix (described for example in reference [1]), a method based on non-negative matrix factorization (for example described in reference [2]), and a method based on latent Dirichlet Allocation (for example, described in reference [3]) implemented. The eigenvector analysis is based on a singular value decomposition. The non-negative matrix factorization represents a matrix decomposition under the constraints that all terms in the factored matrices are non-negative. Latent Dirichlet Allocation is based on a Bayesian treatment of a generative topic model. The methods are known per se, but their use in the context of the invention is new. All three matrix approximations used here estimate unknown matrix entries, ie matrix entries to which - as described above - the value 0 was assigned, via an approximation in which the matrix is approximated to a matrix with a low rank.

Das Prinzip dieser Matrix-Approximation wird zur Verdeutlichung kurz basierend auf der Singulärwertzerlegung erläutert.The Principle of this matrix approximation is short for clarity explained based on the singular value decomposition.

Die ursprüngliche, gemäß dem vorangegangenen Verfahren generierte Datenmatrix, welche im Folgenden als M bezeichnet ist, wird dabei wie folgt zerlegt: M = U·D·VT. The original data matrix generated according to the preceding method, which is referred to below as M, is decomposed as follows: M = U · D · V T ,

Die Matrix D stellt dabei eine unitäre Matrix mit Diagonaleinträgen aufsteigender Größe dar. Diese Matrix wird approximiert, indem kleine Diagonalwerte, d. h. Diagonalwerte, welche eine vorbestimmte Schwelle unterschreiten, auf 0 gesetzt werden. Die approximierte Matrix D ~ wird dann verwendet, um auf die ursprüngliche Matrix rückzurechnen, d. h. es wird berechnet: M ~ = U·D ~·VT. The matrix D represents a unitary matrix with diagonal entries of increasing size. This matrix is approximated by setting small diagonal values, ie diagonal values which fall below a predetermined threshold, to 0. The approximate matrix D ~ is then used to calculate back to the original matrix, ie it is calculated: M ~ = U · D ~ · V T ,

Auf diese Weise erhält man eine entsprechend approximierte Matrix, bei der die ursprünglichen Einträge von 0 nunmehr mit Werten besetzt sind, welche erfindungsgemäß die Maße für die Wahrheit des Tripels und damit der semantischen Relation aus den entsprechenden Elementen von Zeile und Spalte der Matrix sind. Diese Maße werden somit als Konfidenzwerte bzw. Wahrscheinlichkeiten interpretiert. Es wird hierdurch ein digitaler Datensatz erzeugt, in dem den einzelnen semantischen Relationen basierend auf den Einträgen der Matrix zusätzlich ein Wahrscheinlichkeitsmaß zugeordnet ist. Dabei ist zu beachten, dass die ursprünglichen, auf den Wert 1 gesetzten Matrixeinträge diesen Wert beibehalten, auch wenn er in der Matrix M verändert ist. Dies liegt daran, dass von diesen Einträgen bekannt ist, dass sie in den semantisch annotierten Informationen als Tripel vorkommen. Deshalb müssen diese Tripel ein Wahrscheinlichkeitsmaß von 1 aufweisen.On this way you get a corresponding approximated Matrix, where the original entries of 0 are now occupied by values which inventively the Measures for the truth of the triple and thus the semantic relation from the corresponding elements of line and column of the matrix. These dimensions are thus as Confidence values or probabilities interpreted. It will This creates a digital record in which the individual semantic relations based on the entries of the Matrix additionally assigned a probability measure is. It should be noted that the original, on value of 1 set matrix entries retain this value, even if he is changed in the matrix M. This is Remember that these entries are known to be occur in the semantically annotated information as a triple. Therefore, these triples have a probability measure of 1 have.

Die oben beschriebenen Matrix-Approximationen stellen spezielle Ausführungsformen von multivariaten Vorhersagemodellen dar, welche im Englischen auch als „Multiple Output Prediction” bezeichnet werden. Diese Vorhersagemodelle zeichnen sich dadurch aus, dass alle Ausgaben, d. h. alle Einträge in der entsprechenden Matrix, gemeinsam vorhergesagt werden. Der Grund hierfür ist, dass einige oder alle Modellparameter auf alle Ausgaben sensitiv sind, so dass mit einer derartigen Vorhersage die Abschätzung dieser Parameter verbessert wird.The The above-described matrix approximations represent specific embodiments of multivariate predictive models, which in English also be referred to as "multiple output prediction". These Predictive models are characterized by the fact that all expenses, d. H. all entries in the corresponding matrix, in common be predicted. The reason for this is that some or all model parameters are sensitive to all outputs, so that with such a prediction the estimate of this Parameter is improved.

Zusammenfassend wird das erfindungsgemäße Verfahren derart umgesetzt, dass zunächst die statistischen Einheiten und die Population über eine Konfigurationsdatei festgelegt werden. Anschließend werden die probabilistischen Knoten in dem SUNS-Satz automatisch ausgewählt und die Datenmatrix basierend auf einem Sample der Population mit geeigneter Größe erzeugt. Die Größe des Samples ist dabei abhängig von der Rechenzeit, welche in der entsprechenden Anwendung für das Training toleriert werden kann. Es werden dann in einem Lernprozess die Wahrscheinlichkeiten für die Existenz von SUMS-Aussagen, von denen nicht bekannt ist, ob sie wahr sind, basierend auf den oben beschriebenen Matrix-Approximationen abgeschätzt. Die hierdurch gewonnenen Erkenntnisse werden in einer besonders bevorzugten Ausführungsform dazu genutzt, um Abfragen an den generierten Datensatz zu stellen, welche die statistischen Informationen in der Form der Wahrscheinlichkeitsmaße berücksichtigen. In der hier beschriebenen Ausführungsform werden diese Abfragen durch Ergänzungen in der bereits oben beschriebenen Abfragesprache SPARQL erreicht.In summary the method according to the invention is implemented in such a way that first the statistical units and the population over a configuration file can be set. Subsequently become the probabilistic nodes in the SUNS sentence automatically selected and the data matrix based on a sample of the population of suitable size. The Size of the sample is dependent from the computing time, which in the appropriate application for the training can be tolerated. It will then be in a learning process the probabilities for the existence of SUMS statements, of which it is not known if they are true based on the estimated above matrix approximations. The knowledge gained thereby becomes in a particularly preferred embodiment used to queries to provide the generated dataset containing the statistical information take into account in the form of the probability measures. In the embodiment described here, these become Queries through additions in the already described above Query language SPARQL reached.

Zur Verdeutlichung wird nachfolgend zunächst eine herkömmliche Abfrage mit der SPARQL-Sprache dargestellt:

For clarity, the following is a conventional query with the SPARQL language shown:

Die obige Abfrage bedeutet, dass in dem Namensraum ex (gekennzeichnet durch die URL http://example.org/ ) nach Schau spielern (actor) gesucht wird, wobei folgende Bedingungen erfüllt sein müssen:
Es wird nach Filmen (movie) gesucht, welche in einer Stadt (city) gefilmt wurden. Dies wird durch die Befehlszeile ausgedrückt:
?movie ex:filmedIN ?cityThe above query means that in the namespace ex (marked by the URL http://example.org/ ) is searched for actors (actor), whereby the following conditions must be met:
It is searched for movies (movie), which were filmed in a city (city). This is expressed by the command line:
movie ex: filmedIN? city

Dabei ist zu beachten, dass gemäß SPARQL Variablen immer mit einem ? vorangestellt werden.there It should be noted that according to SPARQL variables always with a ? be preceded.

Es wird nach Städten (city) gesucht, welche in dem Land Italien (Italy) liegen. Dies wird durch die Befehlszeile ausgedrückt:
?city ex:inCountry ex:Italy.It searches for cities (city) that are in the country Italy (Italy). This is expressed by the command line:
city ex: inCountry ex: Italy.

Basierend darauf wird nach Schauspielern (actor) gesucht, welche in solchen Filmen spielen. Dies wird durch die Befehlszeile ausgedrückt:
?actor ex:actIn ?movie.Based on it is searched for actors (actor) who play in such films. This is expressed by the command line:
actor ex: actIn? movie.

Mit der obigen SPARQL-Abfrage wird somit nach allen Schauspielern gesucht, welche in Filmen mitspielen, die in einer italienischen Stadt gedreht wurden. Mit der erfindungsgemäß erzeugten Datenbasis, welche entsprechende Wahrscheinlichkeitswerte für Relationen enthält, kann nunmehr eine abgewandelte Abfrage in folgender Form erzeugt werden:

The SPARQL query above searches for all actors who are in films shot in an Italian city. With the data base generated according to the invention, which contains corresponding probability values for relations, a modified query can now be generated in the following form:

Die obige SPARQL-Abfrage ähnelt der zuvor beschriebenen herkömmlichen Abfrage, jedoch wurde das Konstrukt WITH RPOB und die Variable ?prob hinzugefügt. Diese Variable entspricht dem oben beschriebenen Wahrscheinlichkeitsmaß und mit dem Befehl WITH PROB werden dann die entsprechenden Wahrscheinlichkeitsmaße ausgegeben. Das heißt, es wird mit der obigen Abfrage nach Schauspielern gesucht, welche wahrscheinlich in Filmen spielen, welche in einer italienischen Stadt gefilmt wurden. Die generierte Ausgabe wird dabei ferner noch mit dem Befehl ORDER BY in Abhängigkeit von der Größe der Wahrscheinlichkeitsmaße geordnet. Das heißt, es wird gemäß der Abfrage eine Liste erzeugt, an deren Anfang Schauspieler stehen, für welche sicher bekannt ist, dass sie in Filmen gespielt haben, welche in einer italienischen Stadt gefilmt wurden. Dies sind diejenigen Relationen mit dem Wahrscheinlichkeitsmaß 1. Anschließend werden in der Liste in absteigender Reihenfolge der Größe der Wahrscheinlichkeitsmaße die anderen gefundenen Schauspieler ausgegeben. Gegebenenfalls kann in der obigen Abfrage auch noch der Befehl DISTINCT enthalten sein, mit dem Redundanz aus der Liste entfernt wird.The The above SPARQL query is similar to the conventional one described above Query, however, the construct WITH RPOB and the variable? Prob added. This variable corresponds to the one described above Probability measure and with the WITH PROB command then output the corresponding probability measures. That is, it is with the above query for actors wanted, which probably play in films, which in one Italian city were filmed. The generated output will furthermore with the command ORDER BY in dependence on the size of the probability measures orderly. That is, it will be according to the Query creates a list with actors at the beginning, which is for sure known that she played in movies have filmed in an Italian city. This are those relations with the probability measure 1. Subsequently, in the list in descending order the size of the probability measures the other found actors spent. If necessary, can in the above query also the command DISTINCT be included with the redundancy removed from the list.

Erfindungsgemäß können auch SPARQL-Abfragen basierend auf einer Mehrzahl von Wahrscheinlichkeitsmaßen erzeugt werden. Beispielsweise kann in einem entsprechenden Datensatz nach Patienten gesucht werden, welche mit einer hohen Wahrscheinlichkeit Diabetes und mit einer hohen Wahrscheinlichkeit Hepatitis haben. Es besteht jedoch nicht die Möglichkeit, einen Patienten zu suchen, der mit hoher Wahrscheinlichkeit Diabetes und Hepatitis hat, da diese Wahrscheinlichkeit eine gemeinsame probabilistische Information erfordert, welche in dem Datensatz nicht gespeichert ist.According to the invention, SPARQL queries can also be generated based on a plurality of probability measures. For example, patients may be searched for in a corresponding record who are with a high probability of having diabetes and with a high probability of hepatitis. However, there is no way to search for a patient who is very likely to have diabetes and hepatitis because this probability requires common probabilistic information that is not stored in the record.

Im Folgenden werden Ergebnisse des erfindungsgemäßen Verfahrens basierend auf einem Beispieldatensatz vorgestellt und mit herkömmlichen Verfahren verglichen. Es wird dabei ein Datensatz basierend auf sog. Friend-of-a-Friend-Daten (auch als FOAF bezeichnet) betrachtet. Diese Daten stammen aus dem FOAF-Projekt (siehe auch [2]), dessen Ziel es ist, ein Netz aus maschinenlesbaren Seiten zu erzeugen, welche Menschen, ihre Relationen, ihre Aktivitäten und Interessen basierend auf der RDF-Technologie beschreiben. Dem FOAF-Datensatz liegt dabei eine Ontologie basierend auf RDFS/OWL zu Grunde, welche formal in der FOAF-Vokabular-Spezifikation 0.91 spezifiziert ist.in the Following are results of the invention Method presented based on a sample data set and compared with conventional methods. It becomes one Record based on so-called. Friend-of-a-Friend data (also called FOAF). These data come from the FOAF project (see also [2]), whose goal is a network of machine-readable To create pages, which people, their relations, their activities and describe interests based on RDF technology. the FOAF data set is an ontology based on RDFS / OWL based formally in the FOAF vocabulary specification 0.91 is specified.

Zum Test des erfindungsgemäßen Verfahrens wird ein FOAF-Datensatz betrachtet, der aus Benutzerprofilen der Community-Webseite LiveJournal.com ( http://www.livejournal.com/ ) generiert wurde. In 2 sind in diesem Datensatz enthaltene semantische Relationen dargestellt, welche beim Test des Verfahrens berücksichtigt wurden. Die Darstellung der 2 beruht auf der üblichen RDF-Graphstruktur. Es werden dabei als semantische Entitäten entsprechend spezifizierte Personen P betrachtet. Diese Personen haben verschiedene Relationen in der Form von Eigenschaften bzw. Prädikaten zu anderen semantischen Entitäten. Gemäß 2 kann zwischen den Personen die Relation k bestehen, welche bedeutet, dass eine Person eine andere Person kennt. Gemäß 2 existieren folgende weitere Relationen, welche nachfolgend als Tripel mit deren Bedeutungsinhalt wiedergegeben werden:

(P, db, D):: eine Person P hat das Geburtsdatum D;
(P, h, I):: eine Person P hat das Bild I;
(P, a, S):: eine Person P besucht die Schule S;
(P, r, L):: eine Person P wohnt an dem Ort L;
(P, ho, OCA):: eine Person P hat das Online-Chat-Benutzerkonto OCA;
(P, po, NBP):: eine Person P hat die Anzahl von Blogs NBP ins Netz eingestellt.

To test the method according to the invention, a FOAF data set consisting of user profiles of the community website LiveJournal.com (FIG. http://www.livejournal.com/ ) was generated. In 2 are shown in this record contained semantic relations, which were considered in the test of the process. The presentation of the 2 is based on the usual RDF graph structure. In this case, correspondingly specified persons P are considered as semantic entities. These persons have different relations in the form of properties or predicates to other semantic entities. According to 2 can the relation k exist between the persons, which means that a person knows another person. According to 2 the following further relations exist, which are subsequently reproduced as a triple with their meaning content:

(P, db, D):: a person P has the date of birth D;
(P, h, I):: a person P has the picture I;
(P, a, S):: a person P attends school S;
(P, r, L):: a person P lives in the place L;
(P, ho, OCA):: a person P has the online chat user account OCA;
(P, po, NBP):: a person P has posted the number of blogs NBP on the net.

In dem durchgeführten Test wurden 636 Personen ausgewählt. Diese Personen haben untereinander Freundschaftsrelationen, welche gemäß 2 durch das Attribut „k” spezifiziert sind, gemäß dem sich zwei Personen kennen. Im Durchschnitt hat dabei eine Person aus dem Datensatz 18 Freunde. Numerische Eigenschaftswerte gemäß 2, wie z. B. das Geburtsdatum D bzw. die Anzahl an Blogs NBP wurden geeignet diskretisiert. Hieraus resultiert eine Datenmatrix, welche nach dem Weglas sen von Spalten mit wenigen Einsen als Einträge 636 Personen (d. h. 636 Zeilen) und 491 Spalten enthält. Dabei beziehen sich 462 der 491 Spalten auf die Freundschafts-Relation gemäß dem Attribut k der 2. Die restlichen Spalten beziehen sich auf die restlichen, in 2 gezeigten Attribute, wie das Alter, den Wohnort, die Anzahl an eingestellten Blogs, die besuchte Schule und dergleichen.In the test 636 persons were selected. These persons have friendship relations with each other, which according to 2 are specified by the attribute "k" according to which two persons know each other. On average, one person from the record has 18 friends. Numeric property values according to 2 , such as B. the date of birth D or the number of blogs NBP were suitably discretized. This results in a data matrix containing 636 individuals (ie 636 rows) and 491 columns after leaving columns with a few ones as entries. In this case, 462 of the 491 columns refer to the friendship relation according to the attribute k of 2 , The remaining columns refer to the remaining, in 2 displayed attributes such as age, place of residence, number of blogs set, the school attended and the like.

Zum Test des erfindungsgemäßen Verfahrens wurden zufällig in dem ursprünglichen Datensatz vorhandene Freundschafts-Relationen zwischen Personen ausgewählt und in der entsprechenden Matrix auf 0 gesetzt, d. h. die entsprechende Freundschafts-Relation wurde als unbekannt eingestuft. Die derart modifizierte Datenmatrix wurde dann gemäß den oben beschriebenen Verfahren verarbeitet, d. h. es wurde eine Matrix-Approximation basierend auf einer Singulärwertzerlegung, einer nicht-negativen Matrixfaktorisierung und auf Latent Dirichlet Allocation durchgeführt. aber die zufällig ausgewählten und als unbekannt eingestuften Freundschafts-Relationen kann dann die Güte des Verfahrens überprüft werden. Sollten die als unbekannt eingestuften Freundschafts-Einträge entsprechend große Wahrscheinlichkeitsmaße im Vergleich zu anderen unbekannten Freundschafts-Relationen aufweisen, wird mit dem erfindungsgemäßen Verfahren eine gute Vorhersage erreicht.To the Test of the method according to the invention became random existing friend relations in the original record selected between persons and in the corresponding Matrix set to 0, d. H. the corresponding friendship relation was classified as unknown. The data matrix modified in this way was then according to the methods described above processed, d. H. it was based on a matrix approximation on a singular value decomposition, a non-negative Matrix factorization and performed on latent Dirichlet allocation. but the random ones and as unknown then classified friendships can be the goodness of the procedure. Should the as unknown classified friend entries accordingly large probability measures compared to with other unknown friend relations, is with the method according to the invention a good prediction reached.

Zur Darstellung der Güte der mit den Matrix-Approximationen generierten Datensätze wurde der sog. NDCG-Wert (NDCG = Normalized Discounted Cumulative Gain) betrachtet, um eine generierte Rangliste der prädizierten Freundschaften mit absteigenden Werten der Wahrscheinlichkeitsmaße in Bezug auf deren Güte zu bewerten. Der NDCG-Wert ist ein aus der Statistik hinlänglich bekanntes Maß und wird deshalb an dieser Stelle nicht weiter erläutert. Je größer der NDCG-Wert ist, desto besser ist die Vorhersage gemäß der Rangliste. Die verschiedenen Matrix-Approximationen gemäß der Erfindung wurden mit Benchmark-Methoden verglichen. Es wurde dabei eine Baseline-Methode betrachtet, bei der eine zufällige Rangliste für alle unbekannten Freundschafts-Relationen erzeugt wurde, d. h. jeder unbekannten Freundschafts-Relation wurde eine zufällige Wahrscheinlichkeit zugewiesen. Als weitere Benchmark-Methode wurde eine One Class Support Vector Machine SVM betrachtet. Dabei wurden als Eingangs-Merkmale zwei verschiedene Merkmals-Sätze betrachtet. In einem ersten Merkmals-Satz wurden nur allgemeine Attribute von Personen, wie Alter, Wohnort, Anzahl der eingestellten Blogs und dergleichen, betrachtet. Die Support Vector Machine basierend auf diesem Merkmals-Satz wird im Folgenden als SVM1 bezeichnet. Als zweiter Merkmals-Satz wurden zusätzlich auch die Freundschafts-Informationen zu allen Personen berücksichtigt. Die Support Vector Machine basierend auf diesem zweiten Merkmals-Satz wird im Folgenden als SVM2 bezeichnet.To represent the quality of the data sets generated with the matrix approximations, the so-called NDCG (Normalized Discounted Cumulative Gain) value was considered in order to evaluate a generated ranking of the predicted friendships with decreasing values of the probability measures with regard to their quality. The NDCG value is a well-known measure from the statistics and will therefore not be explained further here. The larger the NDCG value, the better the prediction according to the ranking. The various matrix approximations according to the invention were compared with benchmark methods. A baseline method was considered in which a random ranking was created for all unknown friend relations, ie each unknown friendship relation was assigned a random probability. Another benchmark method was a One Class Support Vector Machine SVM considered. Two different feature sets were considered as input characteristics. In a first feature set, only general attributes of persons such as age, place of residence, number of blogs set and the like were considered. The Support Vector Machine based on this feature set is hereafter referred to as SVM1. As a second feature set, the friendship information for all persons was also taken into account. The Support Vector Machine based on this second feature set is referred to below as SVM2.

3 zeigt basierend auf dem oben beschriebenen NDCG-Wert die Resultate mit den Varianten des erfindungsgemäßen Verfahrens im Vergleich zu den Benchmark-Verfahren. 3 stellt dabei ein Diagramm DI dar, bei dem der NDCG-Wert in Abhängigkeit von der in den entsprechenden Verfahren betrachteten Anzahl an latenten Variablen LV aufgetragen ist. Die latenten Variablen sind dabei die bei der Approximation berücksichtigten Variablen und können insbesondere mit dem Rang der entsprechend genäherten Matrix gleichgesetzt werden. In 3 zeigt dabei die Kurve K1 das Baseline-Benchmark-Verfahren, die Kurve K2 das Benchmark-Verfahren basierend auf SVM1, die Kurve K3 das Benchmark-Verfahren basierend auf SVM2, die Kurve K4 die Variante des erfindungsgemäßen Verfahrens basierend auf Singulärwertzerlegung, die Kurve K5 die Variante des erfindungsgemäßen Verfahrens basierend auf nicht-negativer Matrixfaktorisierung und die Kurve K6 die Variante des erfindungsgemäßen Verfahrens basierend auf Latent Dirichlet Allocation. 3 shows based on the NDCG value described above, the results with the variants of the method according to the invention compared to the benchmark method. 3 represents a diagram DI, in which the NDCG value is plotted as a function of the number of latent variables LV considered in the corresponding methods. The latent variables are the variables considered in the approximation and can in particular be equated with the rank of the correspondingly approximated matrix. In 3 In this case, curve K1 shows the baseline benchmark method, curve K2 the benchmark method based on SVM1, curve K3 the benchmark method based on SVM2, curve K4 the variant of the inventive method based on singular value decomposition, curve K5 Variant of the method according to the invention based on non-negative matrix factorization and the curve K6 the variant of the method according to the invention based on latent Dirichlet Allocation.

Wie aus 3 ersichtlich ist, werden die höchsten NDCG-Werte immer für die Varianten des erfindungsgemäßen Verfahrens erreicht. Die in den Kurven dargestellten Fehlerbalken zeigen dabei die 95%-Konfidenz-Intervalle basierend auf dem Standardfehler des Mittelwerts. Insbesondere erkennt man, dass die Variante basierend auf Latent Dirichlet Allocation nicht sehr sensitiv auf die Anzahl der latenten Variablen ist, solange die Anzahl ausreichend groß ist. Das Verfahren gemäß der Latent Dirichlet Allocation erreicht dabei den maximalen NDGC-Wert für 50 latente Variablen und die Performanz wird nicht wesentlich schlechter, wenn die Anzahl der latenten Variabeln erhöht wird. Im Gegensatz dazu sind die Varianten basierend auf Singulärwertzerlegung bzw. nicht-negativer Matrixfaktorisierung sensitiv in Bezug auf die Anzahl an latenten Variablen. Beide Varianten erreichen dabei einen maximalen NDGC-Wert bei 20 Variablen.How out 3 As can be seen, the highest NDCG values are always achieved for the variants of the method according to the invention. The error bars shown in the curves show the 95% confidence intervals based on the standard error of the mean. In particular, it can be seen that the variant based on latent Dirichlet Allocation is not very sensitive to the number of latent variables, as long as the number is sufficiently large. The method according to the Latent Dirichlet Allocation achieves the maximum NDGC value for 50 latent variables and the performance does not get significantly worse if the number of latent variables is increased. In contrast, variants based on singular value decomposition and non-negative matrix factorization, respectively, are sensitive to the number of latent variables. Both variants achieve a maximum NDGC value for 20 variables.

Wie sich aus den obigen Ausführungen und insbesondere aus dem Diagramm der 3 ergibt, zeigen verschiedene Ausführungsformen des erfindungsgemäßen Verfahrens eine sehr gute Performanz dahingehend, dass Wahrscheinlichkeitsmaße für unbekannte Relationen gut prädiziert werden können, wobei die Ergebnisse wesentlich besser als mit herkömmlichen Methoden sind. Es konnte somit nachgewiesen werden, dass mit dem erfindungsgemäßen Verfahren sehr gut probabilistische Aussagen im Hinblick auf semantische Relationen abgeleitet werden können. Mit den generierten Informationen können dann geeignete Abfragen unter Berücksichtigung von Wahrscheinlichkeitsmaßen generiert werden. Insbesondere kann die bekannte Abfragesprache SPARQL in geeigneter Weise erweitert werden, um solche Abfragen zu generieren.As can be seen from the above and in particular from the diagram of 3 shows different embodiments of the method according to the invention a very good performance in that probability measures for unknown relations can be well predicted, the results are much better than with conventional methods. It could thus be demonstrated that very good probabilistic statements with regard to semantic relations can be derived with the method according to the invention. The generated information can then be used to generate suitable queries taking into account probability measures. In particular, the well-known query language SPARQL can be suitably extended to generate such queries.

Literaturverzeichnisbibliography

[1] C. Lippert, Y. Huang, S. H. Weber, V. Tresp, M. Schubert, H. P. Kriegel: Relation prediction in multirelational domains using matrix factorization. Technical report, Siemens (2008) [1] C. Lippert, Y. Huang, SH Weber, V. Tresp, M. Schubert, HP Kriegel: Relation prediction in multirelational domains using matrix factorization. Technical report, Siemens (2008)
[2] H. S. Seung, D. D. Lee: Learning the parts of objects by non-negative matrix factorization. Nature (1999) [2] HS Seung, DD Lee: Learning the parts of objects by non-negative matrix factorization. Nature (1999)
[3] D. M. Blei, A. Y. Ng, M. I. Jordan: Latent dirichlet allocation. J. Mach. Learn. Res. 3 (2003) [3] DM Lead, AY Ng, MI Jordan: Latent dirichlet allocation. J. Mach. Learn. Res. 3 (2003)
[4] D. Brickley, L. Miller: The Friend of a Friend (FOAF) project, http://www.foaf-project.org/ [4] D. Brickley, L. Miller: The Friend of a Friend (FOAF) project, http://www.foaf-project.org/

ZITATE ENTHALTEN IN DER BESCHREIBUNGQUOTES INCLUDE IN THE DESCRIPTION

Diese Liste der vom Anmelder aufgeführten Dokumente wurde automatisiert erzeugt und ist ausschließlich zur besseren Information des Lesers aufgenommen. Die Liste ist nicht Bestandteil der deutschen Patent- bzw. Gebrauchsmusteranmeldung. Das DPMA übernimmt keinerlei Haftung für etwaige Fehler oder Auslassungen.This list The documents listed by the applicant have been automated generated and is solely for better information recorded by the reader. The list is not part of the German Patent or utility model application. The DPMA takes over no liability for any errors or omissions.

Zitierte Nicht-PatentliteraturCited non-patent literature

- http://example.org/ [0054] - http://example.org/ [0054]
- http://www.livejournal.com/ [0062] - http://www.livejournal.com/ [0062]

Claims

Verfahren zum rechnergestützten Verarbeiten von digitalen semantisch annotierten Informationen, welche eine Vielzahl von Tripeln (a₁, a₂, ..., b₇) umfassen, wobei ein Tripel (a₁, a₂, ..., b₇) eine semantische Entität (A, B) in der Form eines Subjekts, eine semantische Entität in der Form eines eine Eigenschaft repräsentierenden Prädikats und eine semantische Entität (A, B) in der Form eines einen Eigenschaftswert repräsentierenden Objekts enthält, bei dem: – für eine Entitäts-Menge aus zumindest einem Teil der in der Vielzahl von Tripeln (a₁, a₂, ..., b₇) enthaltenen Entitäten (A, B) eine Matrix umfassend eine erste und zweite Dimension generiert wird, wobei die erste Dimension als Elemente die Entitäten (A, B) der Entitäts-Menge umfasst und die zweite Dimension als Elemente vorgegebene Merkmale von Tripeln (a₁, a₂, ..., b₇) umfasst, wobei zumindest ein Tripel (a₁, a₂, ..., b₇) in der Vielzahl von Tripeln (a₁, a₂, ..., b₇) ein vorgegebenes Merkmal und ein Element der ersten Dimension umfasst, wobei die Matrix für jede Kombination aus einem Element der ersten Dimension und einem Element der zweiten Dimension, für welche ein Tripel (a₁, a₂, ..., b₇) umfassend das Element der ersten Dimension und das Element der zweiten Dimension in der Vielzahl von Tripeln (a₁, a₂, ..., b₇) existiert, als Eintrag einen vorbestimmten Wert ungleich Null aufweist und für die anderen Kombinationen aus Element der ersten Dimension und Element der zweiten Dimension als Eintrag den Wert Null aufweist; – für die Einträge der Matrix mit Null mittels eines multivariaten Vorhersagemodells Prädiktionswerte ermittelt werden, welche die jeweiligen Einträge mit Null ersetzen, wodurch ein digitaler Datensatz generiert wird, der die Kombinationen der Elemente der Matrix als semantische Relationen und den jeweiligen zugehörigen Eintrag der Matrix als Wahrscheinlichkeitsmaß für die Wahrheit der semantischen Relation enthält.A method of computer-aided processing of digital semantically annotated information comprising a plurality of triples (a ₁ , a ₂ , ..., b ₇ ), wherein a triple (a ₁ , a ₂ , ..., b ₇ ) is a semantic Entity (A, B) in the form of a subject, a semantic entity in the form of a predicate representing a property, and a semantic entity (A, B) in the form of an object representing a property value, in which: - for an entity Amount of at least a portion of the entities (A, B) contained in the plurality of triples (a ₁ , a ₂ , ..., b ₇ ) a matrix comprising a first and second dimension is generated, wherein the first dimension as the elements Comprises entities (A, B) of the entity set and the second dimension comprises as elements predetermined features of triples (a ₁ , a ₂ , ..., b ₇ ), wherein at least one triple (a ₁ , a ₂ , .. ., b ₇ ) in the plurality of triples (a ₁ , a ₂ , ..., b ₇ ) a vorgeb For each combination, the matrix comprises a first dimension element and a second dimension element, for which a triple (a ₁ , a ₂ , ..., b ₇ ) comprising the element the first dimension and the second dimension element in the plurality of triples (a ₁ , a ₂ , ..., b ₇ ) exists as an entry having a predetermined nonzero value and for the other first dimension element and element combinations the second dimension has an entry of zero; - For the entries of the matrix with zero using a multivariate predictive model prediction values are determined, which replace the respective entries with zero, whereby a digital data set is generated, the combinations of the elements of the matrix as semantic relations and the respective associated entry of the matrix as a probability measure for the truth of the semantic relation.

Verfahren nach Anspruch 1, bei dem die Tripel (a₁, a₂, ..., b₇) durch einen RDF-Graphen beschrieben werden, wobei die semantischen Entitäten (A, B) der Entitäts-Menge vorzugsweise Ressourcen sind.The method of claim 1, wherein the triples (a ₁ , a ₂ , ..., b ₇ ) are described by an RDF graph, the semantic entities (A, B) of the entity set being preferably resources.

Verfahren nach Anspruch 1 oder 2, bei dem der vorbestimmte Wert der Wert Eins ist.Method according to claim 1 or 2, wherein the predetermined Value is the value one.

Verfahren nach einem der vorhergehenden Ansprüche, bei dem das multivariate Vorhersagemodell auf einer Matrix-Approximation, insbesondere auf einer Matrix-Approximation mit niedrigem Rang, beruht.Method according to one of the preceding claims, where the multivariate predictive model is based on a matrix approximation, especially on a low rank matrix approximation, based.

Verfahren nach Anspruch 4, bei dem die Matrix-Approximation auf einem der folgenden Verfahren beruht: – Singulärwertzerlegung; – nicht-negative Matrixfaktorisierung; – Latent Dirichlet Allocation.The method of claim 4, wherein the matrix approximation is based on one of the following methods: - Singular value decomposition; - non-negative matrix factorization; - Latent Dirichlet Allocation.

Verfahren nach einem der vorhergehenden Ansprüche, bei dem nach der Generierung der Matrix solche Elemente der zweiten Dimension aus der Matrix gestrichen werden, deren Anzahl an Einträgen mit dem vorbestimmten Wert kleiner als eine vorgegebene Schwelle ist.Method according to one of the preceding claims, in which after the generation of the matrix such elements of the second Dimension are deleted from the matrix, their number of entries with the predetermined value smaller than a predetermined threshold is.

Verfahren nach einem der vorhergehenden Ansprüche, bei dem die Matrix als Elemente der zweiten Dimension erste Elemente in der Form von Paaren aus Prädikat und Objekt umfasst, für welche jeweils zumindest ein Tripel (a₁, a₂, ..., b₇) mit einer Entität (A, B) aus der Entitäts-Menge als Subjekt und dem jeweiligen Paar als Prädikat und Objekt existiert, wobei die Matrix für jede Kombination aus einem Element der ersten Dimension und einem ersten Element der zweiten Dimension, für welche ein Tripel (a₁, a₂, ..., b₇) in der Vielzahl von Tripeln (a₁, a₂, ..., b₇) mit dem Element der ersten Dimension als Subjekt und dem ersten Element der zweiten Dimension als Prädikat und Objekt existiert, als Eintrag den vorbestimmten Wert aufweist und für die anderen Kombinationen aus Element der ersten Dimension und erstem Element der zweiten Dimension als Eintrag den Wert Null aufweist.Method according to one of the preceding claims, in which the matrix as elements of the second dimension comprises first elements in the form of pairs of predicate and object, for which in each case at least one triple (a ₁ , a ₂ , ..., b ₇ ) an entity (A, B) of the entity set as the subject and the respective pair exists as the predicate and the object, wherein the matrix for each combination of a first dimension element and a first element of the second dimension for which a triple (a ₁ , a ₂ , ..., b ₇ ) in the plurality of triples (a ₁ , a ₂ , ..., b ₇ ) with the element of the first dimension as subject and the first element of the second dimension as predicate and object exists as the entry has the predetermined value and has the value zero for the other combinations of the first dimension element and the first element of the second dimension as an entry.

Verfahren nach einem der vorhergehenden Ansprüche, bei dem die Matrix als Elemente der zweiten Dimension zweite Elemente in der Form von Paaren aus Subjekt und Prädikat umfasst, für welche jeweils zumindest ein Tripel (a₁, a₂, ..., b₇) mit dem jeweiligen Paar als Subjekt und Prädikat und einer Entität (A, B) aus der Entitäts-Menge als Objekt existiert, wobei die Matrix für jede Kombination aus einem Element der ersten Dimension und einem zweiten Element der zweiten Dimension, für welche ein Tripel (a₁, a₂, ..., b₇) in der Vielzahl von Tripeln (a₁, a₂, ..., b₇) mit dem zweiten Element der zweiten Dimension als Subjekt und Prädikat und dem Element der ersten Dimension als Objekt existiert, als Eintrag den vorbestimmten Wert aufweist und für die anderen Kombinationen aus Element der ersten Dimension und zweitem Element der zweiten Dimension als Eintrag den Wert Null aufweist.Method according to one of the preceding claims, in which the matrix comprises, as elements of the second dimension, second elements in the form of pairs of subject and predicate, for which in each case at least one triple (a ₁ , a ₂ , ..., b ₇ ) exists as an object and predicate and an entity (A, B) from the entity set as an object, the matrix for each combination of one element of the first dimension and a second element of the second dimension for which a triple (a ₁ , a ₂ , ..., b ₇ ) in the plurality of triplets (a ₁ , a ₂ , ..., b ₇ ) with the second element of the second dimension as subject and predicate and the element of the first dimension as object exists as the entry has the predetermined value and for the other combinations of the first dimension element and the second dimension second dimension element has the value zero.

Verfahren einem der vorhergehenden Ansprüche, bei dem die Matrix als Elemente der zweiten Dimension dritte Elemente in der Form von Prädikaten umfasst, für welche jeweils zumindest ein Tripel (a₁, a₂, ..., b₇) mit einer Entität (A, B) aus der Entitäts-Menge als Subjekt, dem jeweiligen Prädikat als Prädikat und einem beliebigen in der Vielzahl von Tripeln (a₁, a₂, ..., b₇) enthaltenen Objekt existiert, wobei die Matrix für jede Kombination aus einem Element der ersten Dimension und einem dritten Element der zweiten Dimension, für welche ein Tripel (a₁, a₂, ..., b₇) in der Vielzahl von Tripeln (a₁, a₂, ..., b₇) mit dem Element der ersten Dimension als Subjekt und dem dritten Element der zweiten Dimension als Prädikat sowie einem beliebigen in der Vielzahl von Tripeln (a₁, a₂, ..., b₇) enthaltenen Objekt existiert, als Eintrag den vorbestimmten Wert aufweist und für die anderen Kombinationen aus Element der ersten Dimension und drittem Element der zweiten Dimension als Eintrag den Wert Null aufweist.Method according to one of the preceding claims, in which the matrix as elements of the second dimension comprises third elements in the form of predicates, for which in each case at least one triple (a ₁ , a ₂ , ..., b ₇ ) is assigned an entity (A, B) consists of the entity set as subject, the respective predicate as predicate and any object contained in the plurality of triples (a ₁ , a ₂ , ..., b ₇ ), the matrix for each combination consisting of one element the first dimension and a third element of the second dimension, for which a triple (a ₁ , a ₂ , ..., b ₇ ) in the plurality of triples (a ₁ , a ₂ , ..., b ₇ ) with the Element of the first dimension exists as a subject and the third element of the second dimension as a predicate and any object contained in the plurality of triples (a ₁ , a ₂ , ..., b ₇ ), as entry has the predetermined value and for the other combinations of element of the first dimension and third element of the second dimension as an entry has the value zero.

Verfahren nach einem der vorhergehenden Ansprüche, bei dem die Matrix als Elemente der zweiten Dimension vierte Elemente in der Form von Prädikaten umfasst, für welche jeweils zumindest ein Tripel (a₁, a₂, ..., b₇) mit einem beliebigen in der Vielzahl von Tripeln (a₁, a₂, ..., b₇) enthaltenen Subjekt, dem jeweiligen Prädikat als Prädikat und einer Entität (A, B) aus der Entitäts-Menge als Objekt existiert, wobei die Matrix für jede Kombination aus einem Element der ersten Dimension und einem vierten Element der zweiten Dimension, für welche ein Tripel (a₁, a₂, ..., b₇) in der Vielzahl von Tripeln (a₁, a₂, ..., b₇) mit einem beliebigen in der Vielzahl von Tripeln (a₁, a₂, ..., b₇) enthaltenen Subjekt, dem vierten Element der zweiten Dimension als Prädikat und dem Element der ersten Dimension als Objekt existiert, als Eintrag den vorbestimmten Wert aufweist und für die anderen Kombinationen aus Element der ersten Dimension und viertem Element der zweiten Dimension als Eintrag den Wert Null aufweist.Method according to one of the preceding claims, in which the matrix comprises, as elements of the second dimension, fourth elements in the form of predicates, for which in each case at least one triple (a ₁ , a ₂ , ..., b ₇ ) is associated with any one of Variety of triples (a ₁ , a ₂ , ..., b ₇ ) containing the respective predicate as a predicate and an entity (A, B) from the entity set exists as an object, the matrix for each combination of a Element of the first dimension and a fourth element of the second dimension, for which a triple (a ₁ , a ₂ , ..., b ₇ ) in the plurality of triples (a ₁ , a ₂ , ..., b ₇ ) with an arbitrary subject contained in the plurality of triples (a ₁ , a ₂ , ..., b ₇ ), the fourth element of the second dimension as a predicate and the element of the first dimension as an object, the entry having the predetermined value, and the other combinations of element of the first dimension and fourth element of the second dimension as an entry has the value zero.

Verfahren nach einem der vorhergehenden Ansprüche, bei dem die Matrix ferner als fünfte Elemente aus der Vielzahl von Tripeln (a₁, a₂, ..., b₇) aggregierte Eigenschaftswerte von einem oder mehreren Entitäten (A, B) enthält, welche mit einem jeweiligen Element der ersten Dimension der Matrix in einem oder mehreren gemeinsamen Tripeln (a₁, a₂, ..., b₇) enthalten sind.Method according to one of the preceding claims, in which the matrix further contains, as fifth elements of the plurality of triples (a ₁ , a ₂ , ..., b ₇ ), aggregated property values of one or more entities (A, B) associated with a respective element of the first dimension of the matrix in one or more common triples (a ₁ , a ₂ , ..., b ₇ ) are included.

Verfahren nach einem der vorhergehenden Ansprüche, bei dem die semantisch annotierten digitalen Informationen medizinische Daten umfassen, insbesondere Informationen zu medizinischen Bilddaten.Method according to one of the preceding claims, where the semantically annotated digital information is medical Include data, in particular information on medical image data.

Verfahren nach einem der vorhergehenden Ansprüche, bei dem die semantisch annotierten digitalen Informationen in semantisch annotierten Webseiten enthalten sind.Method according to one of the preceding claims, in which the semantically annotated digital information is semantic annotated websites are included.

Verfahren zur Abfrage von Informationen aus einem mit einem Verfahren nach einem der vorhergehenden Ansprüche erzeugten digitalen Datensatz, wobei eine oder mehrere Abfragen nach einer semantischen Relationen unter Berücksichtigung des Wahrscheinlichkeitsmaßes für die Wahrheit der semantischen Relation an den digitalen Datensatz gerichtet werden.Method for retrieving information from a with a method according to any one of the preceding claims generated digital record, with one or more queries after a semantic relations taking into account the Probability measure for the truth of the semantic relation to the digital dataset.

Verfahren nach Anspruch 14, bei dem mit der oder den Abfragen nach semantischen Relationen gesucht wird, welche ein Wahrscheinlichkeitsmaß in einem vorbestimmten Werteintervall aufweisen.The method of claim 14, wherein the or the queries are searched for semantic relations, which one Probability measure in a predetermined value interval exhibit.

Verfahren nach Anspruch 14 oder 15, bei dem in den Ergebnissen der Abfrage oder Abfragen die den aufgefundenen semantischen Relationen zugeordneten Wahrscheinlichkeitsmaße enthalten sind.A method according to claim 14 or 15, wherein in the Results of the query or queries that the found semantic Relations associated with probability measures are.

Verfahren nach einem der Ansprüche 14 bis 16, bei dem die Abrage oder Abfragen eine Syntax basierend auf SPARQL enthalten.Method according to one of claims 14 to 16 where the query or queries are syntax based on SPARQL contain.

Computerprogrammprodukt mit einem auf einem maschinenlesbaren Träger gespeicherten Programmcode zur Durchführung eines Verfahrens nach einem der vorhergehenden Ansprüche, wenn das Programm auf einem Rechner abläuft.Computer program product with one on a machine-readable Carrier stored program code to carry out A method according to any one of the preceding claims, if the program runs on a computer.