WO2023208396A1 - A computer-implemented active learning method in the field of an imbalanced matching situation and a corresponding system - Google Patents

A computer-implemented active learning method in the field of an imbalanced matching situation and a corresponding system Download PDF

Info

Publication number
WO2023208396A1
WO2023208396A1 PCT/EP2022/076440 EP2022076440W WO2023208396A1 WO 2023208396 A1 WO2023208396 A1 WO 2023208396A1 EP 2022076440 W EP2022076440 W EP 2022076440W WO 2023208396 A1 WO2023208396 A1 WO 2023208396A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
sampling strategy
data points
uncertainty
executing
Prior art date
Application number
PCT/EP2022/076440
Other languages
French (fr)
Inventor
Jonathan Fuerst
Bin Cheng
Original Assignee
NEC Laboratories Europe GmbH
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Laboratories Europe GmbH filed Critical NEC Laboratories Europe GmbH
Publication of WO2023208396A1 publication Critical patent/WO2023208396A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/091Active learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition

Definitions

  • the present invention relates to a computer-implemented active learning method in the field of an imbalanced matching situation, for finding at least one relationship between data points in different data sets.
  • the present invention relates to a corresponding system for carrying out the computer-implemented active learning method.
  • This document discloses a feature generation method for long-tail classification using a deep learning approach and describes that the method includes the creation of input signals/samples as data input to a generative classifier model and the generative classifier model generates labels for the corresponding input data samples. The said labels and the output of the generative model are transferred to train the discriminative model to generate long-tail classification predictions. Furthermore, the document discloses a two-stage sampling approach including instance sampling and balanced sampling referred to as long-tail sampling.
  • users are interested to find same-as/equal relationships between elements in disjoint datasets.
  • users might need to integrate various data sources related to a non-residential building such as Industry Foundation Classes, IFC, building management system, BMS, data, energy management system, EMS, data, facility management system, FMS, Calendar entries, OpenStreetMap, Security System etc. to realize some novel application, such as advanced energy/carbon saving applications.
  • This integration scenario involves matching different data schemas/ontologies and subsequently matching records that belong to the same real-world entity, e.g., records that refer to the same room in the building.
  • Active Learning describes ML methods which actively query an oracle, e.g., a system, expert user, database etc., to label data points. These labeled data points are then used to train a supervised ML model.
  • an oracle e.g., a system, expert user, database etc.
  • the most informative samples are usually the ones which the trained model is most uncertain about, i.e. , they are at the decision boundary. Different metrics have been proposed to quantify this uncertainty, e.g.:
  • Entropy is the basis for mutual information, which quantifies relation between two things and the basis for relative entropy - kullback-leibler distance - and cross entropy. Higher values of entropy indicate greater uncertainty.
  • Active learning can be bootstrapped with a random sampling strategy, which is not efficient for imbalanced datasets, such as the described matching problems, only ⁇ 0.1 % of all combinations are matches, [5],
  • weak supervision provides an initial set of noisy weak outputs in this context. Active learning is then used to improve the quality of these noisy weak outputs by providing high-quality labels, i.e. , highly accurate, for a small set of informative samples.
  • the problem with using existing uncertainty measures in this scenario is that the query selection strategy is biased to the provided weak-supervision signals. This is especially an issue for imbalanced classification problems.
  • weak supervision signals e.g., encoded in data programming-style labeling functions
  • weak supervision signals might only use a subset of attributes of the entity in their matching heuristics, i.e., to decide if a combination is a match or a non-match.
  • Diversity sampling-based approaches are in general addressing this issue by trying to sample data points that are more representative of the underlying population, e.g., through:
  • the aforementioned object is accomplished by a computer-implemented active learning method in the field of an imbalanced matching situation, for finding at least one relationship between data points in different data sets, comprising the following steps:
  • a system for carrying out the computer-implemented active learning method in the field of an imbalanced matching situation, for finding at least one relationship between data points in different data sets comprising:
  • - outputting means for outputting a final model for prediction, if a defined budget of labels for the data points is depleted.
  • the method and system according to the invention provide a high degree of performance of the method by simple means.
  • the uncertainty-based sampling strategy can comprise providing a set X of candidate match pairs of data points from a first source A and a second source B.
  • candidate match pairs can for example comprise ontology class pairs or entity pairs.
  • the uncertainty-based sampling strategy can comprise creating of a set of weak supervision signals or functions for a matching problem and outputting a set of weak supervision sources L.
  • Such weak supervision signals can be in the form of weak supervision functions.
  • the matching problem can comprise for example ontology, schema, entity etc.
  • the weak supervision signals or functions can be combined in a generative process and at least one label can be created for non-abstained candidate match pairs.
  • the input in this step can be the data pairs X and the set of weak supervision sources L.
  • the output can be predictions and probabilities for nonabstained input data.
  • the at least one label and a set of features can be used to train a discriminative machine learning model to then create predictions for all candidate matches or candidate match pairs or create a temporary classifier.
  • the input in this step can be the input data pairs X and the predictions mentioned in the last paragraph.
  • the output can be predictions and probabilities for all the input data.
  • sampled data points can be provided to an oracle, wherein the oracle can be a computer system or database.
  • the oracle can provide at least one label for the sampled data points.
  • the oracle can comprise a defined budget of labels.
  • the at least one label can be used to re-train a discriminative model or to tune a generative modeling process or to suggest/create new weak supervision sources or signals or functions.
  • the generative modeling process can be tuned through influencing the modelled accuracy of each affected weak supervision source.
  • performing a detection of a defined saturation level in an improvement of a model performance can be based on validating the trained model or uncertainty-based sampling strategy with some test dataset or on a defined value of distance between the outputs of the generative model and the discriminative model.
  • the input data X and the predictions and/or probabilities for all the input data can be inputted.
  • the output can be oracle labels/predictions for selected data points.
  • the long-tail sampling strategy can comprise computing a feature vector for each data point that needs to be matched. This will contribute to high performance of the method.
  • the long-tail sampling strategy can comprise a MinHash Locality-Sensitive Hashing, LSH, based similarity computation.
  • MinHash signature can be created for each data point.
  • the signature encodes one or more feature vectors. This is a very efficient encoding of feature vectors.
  • the long-tail sampling strategy can comprise computing of informative metric for each data point based on a linear combination of uncertainty, similarity and weak label availability. This allows for a flexible and adjustable linear combination of information contained in weak labels, model uncertainty and Jaccard similarity.
  • a set of candidate match pairs of data points from Source A and Source B e.g., ontology class pairs or entity pairs.
  • a set of weak supervision sources L ⁇ Li, L 2 , L 3 ...LN ⁇ .
  • Input The input data X and the set of weak supervision sources L.
  • Predictions and probabilities for all the input data y a ii. 4) Perform saturation point detection based on outputs of generative and discriminative model and then based on outcome: a. Perform traditional active learning sampling strategy, e.g., entropy based. b. Perform long-tail focused sampling strategy.
  • Input The input data X and the predictions y a ii.
  • Embodiments of the invention provide a more efficient sampling strategy for imbalanced matching problems. With state-of-the-art methods, one needs to sample and label many more data points to achieve the same performance. This results in more effort/cost. Embodiments of the invention are able to sample more informative data points and thus are able to achieve better matching performance.
  • An embodiment of the present invention comprises a two-step active learning method for imbalanced matching problems, combining weak-supervision with two subsequent active learning phases, preferably for detecting difficult currently occurring long-tail matches.
  • a further embodiment of the present invention can be a weakly-supervised active learning method.
  • Embodiments of the invention can present a two-step active learning method for imbalanced matching problems, e.g., ontology, schema, entity matching, that can combine weak-supervision with two subsequent active learning phases to detect difficult but commonly occurring long-tail matches. It can use traditional uncertaintybased sampling with performance saturation detection and can then switch to a 2 nd sampling method that combines traditional uncertainty metrics, an efficient MinHash LSH based similarity computation, and the available weak supervision signals. Overall, this can achieve among others a better cost-performance trade-off compared to existing methods.
  • Fig. 1 shows in a diagram an overview of an embodiment of the present invention
  • Fig. 2 shows an algorithm sketch of an embodiment of a long-tail sampling strategy
  • Fig. 3 shows a further embodiment wherein matched data is used to create a digital twin and execute optimized building control
  • Fig. 4 shows a further embodiment wherein different city data is matched to enable cross-silo optimization applications and
  • Fig. 5 shows a further embodiment comprising creation and deployment of matchings between heterogeneous railroad device infrastructure to enable Smart Railway Management.
  • Embodiments of the invention present a two-phase active learning strategy that addresses the “long tail” problem of finding potential informative matches efficiently.
  • the first phase there is executed a traditional sampling strategy until it is detected a saturation point in model performance improvement.
  • it is switched to a second sampling strategy that focuses on these long tail samples using a computationally efficient method, tailored to such matching problems that is based on MinHash LSH based similarity computation, traditional uncertainty metrics, and the available weak supervision signals.
  • this two-step method achieves higher performance with the same amount of labeling effort. I.e., less time and effort is needed for annotating data points to achieve equal or even better matching quality.
  • a set of weak-supervision functions creates a set of labels in a generative process for the data points. These labels are then used to train a discriminative model, creating a temporary classifier. The temporary classifier then labels the data. As long as there is labeling budget available, active learning iterations as follows are then performed: First, a saturation detection is performed to decide if default, uncertainty-based sampling strategies are still sufficient. Then it is executed either such a default sampling strategy, or if saturation has been reached, it is performed a long tail sampling strategy. In either case the sampled data points are provided to the oracle such as a computer system or database, which then will provide labels for the sampled data points. These new, annotated labels can 1.) directly be used to retrain the discriminative model 2.) to tune the generative modeling process - through influencing the modelled accuracy of each affected weak supervision source - and 3.) an extension to suggest/create new weak-supervision sources.
  • the last trained discriminative classifier is outputted. This classifier is then used as final model for prediction.
  • the simplest way to detect saturation would be to validate the trained model with some test dataset.
  • obtaining a representative test data set is costly and active learning is often applied in scenarios where users exactly want to avoid collecting large, labeled datasets.
  • Compute a feature vector for each data point e.g., a class in an ontology, a data record in a database.
  • Such a feature vector may employ word and sentence embedding models for text attributes or apply data augmentation techniques to augment the raw data.
  • Output A feature vector for each data point that needs to be matched.
  • Non-residential buildings such as office buildings account for a large amount of that energy.
  • These buildings consist usually of several sub systems such as for lighting, heating ventilation and air conditioning, HVAC, access control, room booking etc.
  • different data schemas/ontologies exist that usually originate in the respective sub-systems and might thus using different naming conventions.
  • BIM Building Information Models
  • IFC Industry Foundation Classes
  • Embodiments of the present invention will accelerate the integration and with little effort enable a digital twin construction and a more optimal operation of the physical building twin.
  • Smart railways are usually managed with a large amount of heterogenous devices from various manufactures and suppliers. For better system efficiency and automation, those devices very often need to interact with each other either directly or indirectly via some intermediate data processing.
  • domain experts must spend lots of effort and time on mapping their data schemas. This is a tedious and time-consuming task. Also, it is not a onetime task. As old devices become defect and get replaced with new devices over time and the same device might come from different vendors, domain experts must be always engaged into this data integration loop to ensure the correctness of the data schema mapping.
  • Embodiments of the present invention could help domain experts to map the data schemas between devices more efficiently and faster.
  • embodiments of the long-tail sampling strategy can also be used as the only sampling strategy, fully replacing traditional sampling strategies. This can make sense when either the weak supervision sources are representative and accurate, so that there is no (or not much) room for improvement, or when instead of weak-supervision signals, the ML model is trained with a representative set of labeled of data points.
  • the oracle might also provide new weak-supervision signals or the oracle’s annotations might be used to generate new weak-supervision signals or tune existing signals.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

For providing a high degree of performance of a computer-implemented active learning method by simple means a computer-implemented active learning method in the field of an imbalanced matching situation, for finding at least one relationship between data points in different data sets, is provided, comprising the following steps: executing an uncertainty-based sampling strategy on a defined set of data points; performing a detection of a defined saturation level in an improvement of a model performance, wherein the detection is based on outputs of a generative model and a discriminative model; further executing the uncertainty-based sampling strategy, if such a saturation level is not detected, and switching to executing a long-tail sampling strategy, if such a saturation level is detected; and outputting a final model for prediction, if a defined budget of labels for the data points is depleted. Further, a corresponding system for carrying out the computer-implemented active learning method in the field of an imbalanced matching situation, for finding at least one relationship between data points in different data sets, is provided, comprising: executing means for executing an uncertainty-based sampling strategy on a defined set of data points; performing means for performing a detection of a defined saturation level in an improvement of a model performance, wherein the detection is based on outputs of a generative model and a discriminative model; executing means for further executing the uncertainty-based sampling strategy, if such a saturation level is not detected, and switching means for switching to executing means for executing a long-tail sampling strategy, if such a saturation level is detected; and outputting means for outputting a final model for prediction, if a defined budget of labels for the data points is depleted.

Description

A COMPUTER-IMPLEMENTED ACTIVE LEARNING METHOD IN THE FIELD OF AN IMBALANCED MATCHING SITUATION AND A CORRESPONDING SYSTEM
The present invention relates to a computer-implemented active learning method in the field of an imbalanced matching situation, for finding at least one relationship between data points in different data sets.
Further, the present invention relates to a corresponding system for carrying out the computer-implemented active learning method.
Corresponding prior art documents are listed as follows:
[1] Vlachos, Andreas. "A stopping criterion for active learning." Computer Speech & Language 22, no. 3, 2008: 295-312.
[2] Laws, F. and Schutze, H., 2008, August. Stopping criteria for active learning of named entity recognition. In Proceedings of the 22nd International Conference on Computational Linguistics, Coling, 2008, pp. 465-472.
[3] Broder, Andrei Z. "Identifying and filtering near-duplicate documents." In Annual Symposium on Combinatorial Pattern Matching, pp. 1-10. Springer, Berlin, Heidelberg, 2000.
[4] Ruhling Cachay, Salva, Benedikt Boecking, and Artur Dubrawski. "End-to-End Weak Supervision." Advances in Neural Information Processing Systems 34, 2021.
[5] Stephen Mussmann, Robin Jia, and Percy Liang. 2020. On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks. In Findings of the Association for Computational Linguistics: EMNLP 2020. 3400-3413. Further prior art documents:
Sunita Sarawagi, Anuradha Bhamidipaty. “Interactive Deduplication using Active Learning”, 2002. This document discloses an interactive deduplication method using an active learning approach and describes that the method includes the creation of weak supervision signals as noisy data input to a generative classifier model and the generative classifier model generates labels for the corresponding noisy data. The said labels and the output of the generative model are transferred to train the discriminative model to generate predictions. Furthermore, the uncertainty or saturation point is detected based on a comparison of the outputs of both the classifiers and then weighted sampling is performed. The final output of the last discriminative model is used for further predictions.
Rahul Vigneswaran, Marc T. Law, Vineeth N. Balasubramanian, Makarand Tapaswi. “Feature Generation for Long-tail Classification”, November 10, 2021 . This document discloses a feature generation method for long-tail classification using a deep learning approach and describes that the method includes the creation of input signals/samples as data input to a generative classifier model and the generative classifier model generates labels for the corresponding input data samples. The said labels and the output of the generative model are transferred to train the discriminative model to generate long-tail classification predictions. Furthermore, the document discloses a two-stage sampling approach including instance sampling and balanced sampling referred to as long-tail sampling.
Shyamgopal Karthik, Jerome Revaud, Chidlovskii Boris. “Learning From Long- Tailed Data With Noisy Labels”, September 12, 2021. This document discloses a learning approach for long-tailed data with noisy labels and describes that the method includes the creation of input signals/samples as noisy data input to the first - generative classifier - model and the respective model generates labels for the corresponding input data samples. The said labels and the output of the first model are transferred to train the discriminative model to generate long-tail classification predictions. Furthermore, the document discloses a two-stage sampling approach including instance sampling and long-tail sampling. In data integration/preparation, users are faced with several matching problems, from schema and ontology matching to entity matching/resolution. In essence, in these problems users are interested to find same-as/equal relationships between elements in disjoint datasets. E.g., users might need to integrate various data sources related to a non-residential building such as Industry Foundation Classes, IFC, building management system, BMS, data, energy management system, EMS, data, facility management system, FMS, Calendar entries, OpenStreetMap, Security System etc. to realize some novel application, such as advanced energy/carbon saving applications. This integration scenario involves matching different data schemas/ontologies and subsequently matching records that belong to the same real-world entity, e.g., records that refer to the same room in the building. To solve these matching problems, others have applied various solutions from heuristic and unsupervised, e.g., matching heuristics and lexical similarity based, approaches to recently increasingly supervised machine learning, ML, - most recent deep learning - based solutions, which tend to outperform previous works. However, one problem with supervised machine learning is the need for large, labeled datasets. One way to reduce this labeling bottleneck is Active Learning.
Active Learning describes ML methods which actively query an oracle, e.g., a system, expert user, database etc., to label data points. These labeled data points are then used to train a supervised ML model. As the oracle represents a bottleneck in terms of cost/effort, it is of paramount importance to select the “most informative” samples for labeling, opposed to just random sampling. The most informative samples are usually the ones which the trained model is most uncertain about, i.e. , they are at the decision boundary. Different metrics have been proposed to quantify this uncertainty, e.g.:
• Least Confidence Uncertainty
Based on the difference between the most confident class prediction and 100% confidence, selecting the sample which has the lowest probability.
• Smallest Margin Uncertainty/Margin of confidence
Select based on the smallest difference/margin between the two topmost predictions. • Largest Margin Uncertainty
Select based on the smallest difference/margin between the top and the bottom prediction.
• Ratio of Uncertainty
Based on the ratio of the top two predicted classes.
• Entropy Reduction
Entropy is the basis for mutual information, which quantifies relation between two things and the basis for relative entropy - kullback-leibler distance - and cross entropy. Higher values of entropy indicate greater uncertainty.
Entropy = - £p( )log(p( ))
Active learning can be bootstrapped with a random sampling strategy, which is not efficient for imbalanced datasets, such as the described matching problems, only < 0.1 % of all combinations are matches, [5], Thus, to bootstrap active learning faster, with less effort, weak supervision has been proposed [4], Weak supervision provides an initial set of noisy weak outputs in this context. Active learning is then used to improve the quality of these noisy weak outputs by providing high-quality labels, i.e. , highly accurate, for a small set of informative samples. However, the problem with using existing uncertainty measures in this scenario is that the query selection strategy is biased to the provided weak-supervision signals. This is especially an issue for imbalanced classification problems. For example, in entity matching, weak supervision signals, e.g., encoded in data programming-style labeling functions, might only use a subset of attributes of the entity in their matching heuristics, i.e., to decide if a combination is a match or a non-match.
If an un-used attribute is however a strong discriminator for the matching task, users might miss the affected entity pairs and consider them non-matches based on the other attributes. These false negatives are unfortunately also unlikely to be found through training a discriminative ML model. This is because the potentially discriminative features will be deemed not discriminative with the provided labeled weak-supervision data. Even worse, this leads to the problem that data samples appear to have a low uncertainty, as they are considered clear non-matches based on the labeling functions and features used, and are thus not likely to be selected with priority by the active learning sampling strategy. In practical applications of active learning, this observed behavior results in a quick improvement during the first iterations, i.e., the active learning query strategy is selecting valuable data points, but then to a saturation in which the selection strategy is often not better than random and the trained ML model is only improving marginally.
Diversity sampling-based approaches are in general addressing this issue by trying to sample data points that are more representative of the underlying population, e.g., through:
• Discovering model-based outliers by picking data points with low activation in logits and hidden layers.
• Clustering techniques to pre-segment the data and per-cluster sampling technique.
• Representative sampling by finding items most representative of the target domain.
These approaches work only sub-optimal in matching problems due to the strong imbalance that requires strategies tailored to sampling additional, potential matches. E.g., data points with a low activation will in most cases be non-matches and sampling diversely from different clusters is still mostly resulting in non-matches picks.
It is an object of the present invention to improve and further develop a computer- implemented active learning method in the field of an imbalanced matching situation and a corresponding system for carrying out this method for providing a high degree of performance of the method by simple means.
In accordance with the invention, the aforementioned object is accomplished by a computer-implemented active learning method in the field of an imbalanced matching situation, for finding at least one relationship between data points in different data sets, comprising the following steps:
- executing an uncertainty-based sampling strategy on a defined set of data points; - performing a detection of a defined saturation level in an improvement of a model performance, wherein the detection is based on outputs of a generative model and a discriminative model;
- further executing the uncertainty-based sampling strategy, if such a saturation level is not detected, and switching to executing a long-tail sampling strategy, if such a saturation level is detected; and
- outputting a final model for prediction, if a defined budget of labels for the data points is depleted.
Further, the aforementioned object is accomplished by a system for carrying out the computer-implemented active learning method in the field of an imbalanced matching situation, for finding at least one relationship between data points in different data sets, comprising:
- executing means for executing an uncertainty-based sampling strategy on a defined set of data points;
- performing means for performing a detection of a defined saturation level in an improvement of a model performance, wherein the detection is based on outputs of a generative model and a discriminative model;
- executing means for further executing the uncertainty-based sampling strategy, if such a saturation level is not detected, and switching means for switching to executing means for executing a long-tail sampling strategy, if such a saturation level is detected; and
- outputting means for outputting a final model for prediction, if a defined budget of labels for the data points is depleted.
According to the invention it has been recognized that it is possible to provide a high degree of performance of the method by using different sampling strategies or methods in a smart way. It has been further recognized that simply performing a detection of a defined saturation level in an improvement of a model performance, wherein the detection is based on outputs of a generative model and a discriminative model, provides a suitable moment in time during performing the whole method for switching from one sampling strategy or method to another sampling strategy or method. Further according to the invention, firstly an uncertainty-based sampling strategy is executed on a defined set of data points. Then, if such a saturation level is detected, switching from the uncertainty-based sampling strategy to a long-tail sampling strategy is executed. As a result, more potential matches between data points can be found efficiently. At the end of the method a final model for prediction is output, if a defined budget of labels for the data points is depleted.
Thus, the method and system according to the invention provide a high degree of performance of the method by simple means.
According to an embodiment of the invention the uncertainty-based sampling strategy can comprise providing a set X of candidate match pairs of data points from a first source A and a second source B. Such candidate match pairs can for example comprise ontology class pairs or entity pairs.
Within a further embodiment the uncertainty-based sampling strategy can comprise creating of a set of weak supervision signals or functions for a matching problem and outputting a set of weak supervision sources L. Such weak supervision signals can be in the form of weak supervision functions. The matching problem can comprise for example ontology, schema, entity etc.
In a further embodiment the weak supervision signals or functions can be combined in a generative process and at least one label can be created for non-abstained candidate match pairs. The input in this step can be the data pairs X and the set of weak supervision sources L. The output can be predictions and probabilities for nonabstained input data.
According to a further embodiment the at least one label and a set of features can be used to train a discriminative machine learning model to then create predictions for all candidate matches or candidate match pairs or create a temporary classifier. The input in this step can be the input data pairs X and the predictions mentioned in the last paragraph. The output can be predictions and probabilities for all the input data. Within a further embodiment sampled data points can be provided to an oracle, wherein the oracle can be a computer system or database.
In a further embodiment the oracle can provide at least one label for the sampled data points. The oracle can comprise a defined budget of labels.
According to a further embodiment the at least one label can be used to re-train a discriminative model or to tune a generative modeling process or to suggest/create new weak supervision sources or signals or functions. The generative modeling process can be tuned through influencing the modelled accuracy of each affected weak supervision source.
Within a further embodiment performing a detection of a defined saturation level in an improvement of a model performance can be based on validating the trained model or uncertainty-based sampling strategy with some test dataset or on a defined value of distance between the outputs of the generative model and the discriminative model. In this step the input data X and the predictions and/or probabilities for all the input data can be inputted. The output can be oracle labels/predictions for selected data points.
According to a further embodiment the long-tail sampling strategy can comprise computing a feature vector for each data point that needs to be matched. This will contribute to high performance of the method.
In a further embodiment the long-tail sampling strategy can comprise a MinHash Locality-Sensitive Hashing, LSH, based similarity computation.
In a further embodiment a MinHash signature can be created for each data point.
According to a further embodiment the signature encodes one or more feature vectors. This is a very efficient encoding of feature vectors.
Within a further embodiment the long-tail sampling strategy can comprise computing of informative metric for each data point based on a linear combination of uncertainty, similarity and weak label availability. This allows for a flexible and adjustable linear combination of information contained in weak labels, model uncertainty and Jaccard similarity.
Advantages and aspects of embodiments of the present invention are summarized as follows:
Embodiments can comprise an efficient two-step weakly-supervised active learning method based on the combination of:
1. The detection of an active learning saturation point for traditional, uncertaintybased sampling.
2. A long-tail focused sampling strategy tailored for matching problems, to provide more valuable matches in later stages of active learning.
Further embodiments can comprise the following steps:
Overall input data: A set of candidate match pairs of data points from Source A and Source B, e.g., ontology class pairs or entity pairs. The set is: X= {(Ai,Bi), (Ai, B2), (A2,B3)...}.
The method comprising the steps of
1) Creation of a set of weak supervision signals, e.g., in form of weak supervision functions, for a matching problem, e.g., ontology, schema, entity...
Output: A set of weak supervision sources L = {Li, L2, L3...LN}.
2) Combination of these weak signals in a generative process, creating labels for the non-abstained candidate matches.
Input: The input data X and the set of weak supervision sources L.
Output: Predictions and probabilities for the non-abstained input data: ynon_abstained.
3) Use the labels and a set of features to train a discriminative machine learning model to then created predictions for all the candidate matches.
Input: The input data X and the predictions ynon_abstained-
Output: Predictions and probabilities for all the input data: yaii. 4) Perform saturation point detection based on outputs of generative and discriminative model and then based on outcome: a. Perform traditional active learning sampling strategy, e.g., entropy based. b. Perform long-tail focused sampling strategy.
Input: The input data X and the predictions yaii.
Output: Oracle labels/predictions for selected data points: yoracle.
5) Finish when oracle budget is depleted.
6) Output final predictions for matches.
Embodiments of the invention provide a more efficient sampling strategy for imbalanced matching problems. With state-of-the-art methods, one needs to sample and label many more data points to achieve the same performance. This results in more effort/cost. Embodiments of the invention are able to sample more informative data points and thus are able to achieve better matching performance.
An embodiment of the present invention comprises a two-step active learning method for imbalanced matching problems, combining weak-supervision with two subsequent active learning phases, preferably for detecting difficult currently occurring long-tail matches.
A further embodiment of the present invention can be a weakly-supervised active learning method.
Based on embodiments of the present invention more potential matches can be found efficiently, with less time and effort compared with prior art methods. The embodiments provide a key to improve matching performance.
Embodiments of the invention can present a two-step active learning method for imbalanced matching problems, e.g., ontology, schema, entity matching, that can combine weak-supervision with two subsequent active learning phases to detect difficult but commonly occurring long-tail matches. It can use traditional uncertaintybased sampling with performance saturation detection and can then switch to a 2nd sampling method that combines traditional uncertainty metrics, an efficient MinHash LSH based similarity computation, and the available weak supervision signals. Overall, this can achieve among others a better cost-performance trade-off compared to existing methods.
There are several ways how to design and further develop the teaching of the present invention in an advantageous way. To this end it is to be referred to the following explanation of examples of embodiments of the invention, illustrated by the drawing. In the drawing
Fig. 1 shows in a diagram an overview of an embodiment of the present invention,
Fig. 2 shows an algorithm sketch of an embodiment of a long-tail sampling strategy,
Fig. 3 shows a further embodiment wherein matched data is used to create a digital twin and execute optimized building control,
Fig. 4 shows a further embodiment wherein different city data is matched to enable cross-silo optimization applications and
Fig. 5 shows a further embodiment comprising creation and deployment of matchings between heterogeneous railroad device infrastructure to enable Smart Railway Management.
Embodiments of the invention present a two-phase active learning strategy that addresses the “long tail” problem of finding potential informative matches efficiently. In the first phase, there is executed a traditional sampling strategy until it is detected a saturation point in model performance improvement. Then, it is switched to a second sampling strategy that focuses on these long tail samples using a computationally efficient method, tailored to such matching problems that is based on MinHash LSH based similarity computation, traditional uncertainty metrics, and the available weak supervision signals. Compared to existing works this two-step method achieves higher performance with the same amount of labeling effort. I.e., less time and effort is needed for annotating data points to achieve equal or even better matching quality.
The overall flow of an embodiment of the invention is depicted in Fig. 1 . A set of weak-supervision functions creates a set of labels in a generative process for the data points. These labels are then used to train a discriminative model, creating a temporary classifier. The temporary classifier then labels the data. As long as there is labeling budget available, active learning iterations as follows are then performed: First, a saturation detection is performed to decide if default, uncertainty-based sampling strategies are still sufficient. Then it is executed either such a default sampling strategy, or if saturation has been reached, it is performed a long tail sampling strategy. In either case the sampled data points are provided to the oracle such as a computer system or database, which then will provide labels for the sampled data points. These new, annotated labels can 1.) directly be used to retrain the discriminative model 2.) to tune the generative modeling process - through influencing the modelled accuracy of each affected weak supervision source - and 3.) an extension to suggest/create new weak-supervision sources.
When the defined labeling budget has been depleted, the last trained discriminative classifier is outputted. This classifier is then used as final model for prediction.
In the following two core buildings blocks, Saturation Detection and Long Tail Sampling Strategy, of embodiments are described in more detail:
Saturation Detection:
The simplest way to detect saturation would be to validate the trained model with some test dataset. However, obtaining a representative test data set is costly and active learning is often applied in scenarios where users exactly want to avoid collecting large, labeled datasets.
There have been developed active learning stopping mechanisms on overall, average uncertainty, based on some uncertainty measure such as entropy or margin uncertainty [1], which is however model dependent and has shown instability, especially in multiclass settings [2], In embodiments of the present invention, there is proposed a different method. There is exploited the fact that there are two prediction outputs: 1.) the predictions of the generative model and 2.) the predictions of the discriminative model. There is the insight to implement a stopping mechanism based on the overall distances between these two output datasets, i.e., using of sum of squared residuals is possible. Then, the initial sampling strategy is stopped, when the change between successive active learning iterations becomes small enough - defined by some parameter -, and it is switched to the long-tail sampling strategy.
Long-Tail Sampling Strategy:
It is proposed the following approach for sampling informative data points for long- tail examples, see also Algorithm 1 for a sketch of an implementation of its core functions:
1. Compute a feature vector for each data point, e.g., a class in an ontology, a data record in a database. Such a feature vector may employ word and sentence embedding models for text attributes or apply data augmentation techniques to augment the raw data.
Output: A feature vector for each data point that needs to be matched.
2. Create MinHash signatures [3] for each data point. These signatures encode the feature vector efficiently. With these MinHash signatures, Jaccard Similarity between combinations in linear time can be quickly estimated: Jaccard Similarity:
Figure imgf000015_0001
3. While oracle budget is available, it is started from a defined similarity_threshold parameter, e.g., it is started from 1 , the highest possible Jaccard similarity): a. Create a LSH index with the defined similarity threshold that impacts the chosen number and size of bands used. b. Insert all MinHash signatures from dataset A using the data point ID as key. c. Query with all signatures from dataset B. This will return all matches above the defined similarity threshold and only requires constant time for each query (thus linear overall). d. Filter results by already oracle annotated data points in previous iterations. e. Compute informative metric for each data point based on a linear combination of uncertainty, similarity and weak label availability:
/(%) = a * \weakiabei \ + (3 * uncertainty + y * jaccard_similarity
This allows for a flexible and adjustable - based on parameters a, (3, y - linear combination of information contained in weak labels, model uncertainty and Jaccard similarity. f. Sort data points based on computed informative metric. g. While oracle budget available: i. Pick batch of highest informative samples to ask the Oracle. ii. If all samples have been provided to the oracle, or the LSH query has not returned any samples, reduce the similarity_threshold by a defined stepsize parameter. This will loosen the Jaccard similarity restriction and thus return more candidates for active sampling.
Further Embodiment 1 : Match Building Related Data for Digital Twin and Smart Building Applications
Buildings account for roughly 40% of all energy consumed in developed countries. Non-residential buildings, such as office buildings account for a large amount of that energy. These buildings consist usually of several sub systems such as for lighting, heating ventilation and air conditioning, HVAC, access control, room booking etc. For each of this sub-systems, different data schemas/ontologies exist that usually originate in the respective sub-systems and might thus using different naming conventions.
In the architecture, engineering and construction/facility management industry, there exists a common standard for Building Information Models, BIM, representation, the namely Industry Foundation Classes, IFC, standard. As the BIM represented in IFC contains various in-depth info of the building, including its structure and materials, it can serve as a good basis towards a common data model of a building digital twin. In this digital twin, the detailed IFC information can be used to perform building simulations and as input to optimally actuate the building components. It can also be used to calculate building carbon emissions and develop strategies to reduce these emissions.
The problem is that the underlying data schemas/ontologies are not aligned, and thus a digital twin would need to be constructed for each individual building expensively with lots of man hours. It could be shown that weakly-supervised ontology and schema matching shows great potential for such problems but suffers when labeling functions are not covering the specific matching problems of the building.
On the basis of embodiments of the present invention, active learning can be added effectively, discovering matches beyond the ones covered by the knowledge of the labeling functions. Embodiments of the present invention will accelerate the integration and with little effort enable a digital twin construction and a more optimal operation of the physical building twin.
Further Embodiment 2: Use to match smart city data for Super Smart City Applications
In cities, data is generated independently by different departments and even private businesses. To create a common data space between these data producers, their non-integrated data needs to be transferred to a common data model. A key challenge here is to provide automated matching on schema and entity level with little available machine learning training data. Enabling such automated matching is greatly accelerated with embodiments of this invention. The then integrated data space can be used for many purposes that affect the city and its operation. E.g., Data from businesses, current events, public transport, and deployed transport sensing infrastructure can be used jointly to optimize traffic flow in the city, e.g., by adjusting traffic lights or deploying more buses. Further Embodiment 3: Use to match data schemas between devices for smart railway management
Smart railways are usually managed with a large amount of heterogenous devices from various manufactures and suppliers. For better system efficiency and automation, those devices very often need to interact with each other either directly or indirectly via some intermediate data processing. To enable such interoperability between devices, domain experts must spend lots of effort and time on mapping their data schemas. This is a tedious and time-consuming task. Also, it is not a onetime task. As old devices become defect and get replaced with new devices over time and the same device might come from different vendors, domain experts must be always engaged into this data integration loop to ensure the correctness of the data schema mapping. Embodiments of the present invention could help domain experts to map the data schemas between devices more efficiently and faster.
Extension 1 :
As an extension/alternative use, embodiments of the long-tail sampling strategy can also be used as the only sampling strategy, fully replacing traditional sampling strategies. This can make sense when either the weak supervision sources are representative and accurate, so that there is no (or not much) room for improvement, or when instead of weak-supervision signals, the ML model is trained with a representative set of labeled of data points.
Extension 2:
The oracle might also provide new weak-supervision signals or the oracle’s annotations might be used to generate new weak-supervision signals or tune existing signals.
Many modifications and other embodiments of the invention set forth herein will come to mind to the one skilled in the art to which the invention pertains having the benefit of the teachings presented in the foregoing description and the associated drawings. Therefore, it is to be understood that the invention is not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

Claims

C l a i m s
1. A computer-implemented active learning method in the field of an imbalanced matching situation, for finding at least one relationship between data points in different data sets, comprising the following steps:
- executing an uncertainty-based sampling strategy on a defined set of data points;
- performing a detection of a defined saturation level in an improvement of a model performance, wherein the detection is based on outputs of a generative model and a discriminative model;
- further executing the uncertainty-based sampling strategy, if such a saturation level is not detected, and switching to executing a long-tail sampling strategy, if such a saturation level is detected; and
- outputting a final model for prediction, if a defined budget of labels for the data points is depleted.
2. The method according to claim 1 , wherein the uncertainty-based sampling strategy comprises providing a set of candidate match pairs of data points from a first source (A) and a second source (B).
3. The method according to claim 1 or 2, wherein the uncertainty-based sampling strategy comprises creating of a set of weak supervision signals or functions for a matching problem and outputting a set of weak supervision sources (L).
4. The method according to claim 3, wherein the weak supervision signals or functions are combined in a generative process and at least one label is created for non-abstained candidate match pairs.
5. The method according to claim 4, wherein the at least one label and a set of features is used to train a discriminative machine learning model to then create predictions for all candidate matches or candidate matches or match pairs or create a temporary classifier.
6. The method according to any one of claims 1 to 5, wherein sampled data points are provided to an oracle, wherein the oracle can be a computer system or database.
7. The method according to claim 6, wherein the oracle provides at least one label for the sampled data points.
8. The method according to claim 7, wherein the at least one label is used to retrain a discriminative model or to tune a generative modeling process or to suggest/create new weak supervision sources or signals or functions.
9. The method according to any one of claims 1 to 8, wherein performing a detection of a defined saturation level in an improvement of a model performance is based on validating the trained model or uncertainty-based sampling strategy with some test dataset or on a defined value of distance between the outputs of the generative model and the discriminative model.
10. The method according to any one of claims 1 to 9, wherein the long-tail sampling strategy comprises computing a feature vector for each data point that needs to be matched.
11. The method according to any one of claims 1 to 10, wherein the long-tail sampling strategy comprises a MinHash Locality-Sensitive Hashing, LSH, based similarity computation.
12. The method according to any one of claims 1 to 11 , wherein a MinHash signature is created for each data point.
13. The method according to claim 12, wherein the signature encodes one or more feature vectors.
14. The method according to any one of claims 1 to 13, wherein the long-tail sampling strategy comprises computing of informative metric for each data point based on a linear combination of uncertainty, similarity and weak label availability.
15. A system for carrying out the computer-implemented active learning method in the field of an imbalanced matching situation, according to any one of claims 1 to 14, for finding at least one relationship between data points in different data sets, comprising:
- executing means for executing an uncertainty-based sampling strategy on a defined set of data points;
- performing means for performing a detection of a defined saturation level in an improvement of a model performance, wherein the detection is based on outputs of a generative model and a discriminative model;
- executing means for further executing the uncertainty-based sampling strategy, if such a saturation level is not detected, and switching means for switching to executing means for executing a long-tail sampling strategy, if such a saturation level is detected; and
- outputting means for outputting a final model for prediction, if a defined budget of labels for the data points is depleted.
PCT/EP2022/076440 2022-04-29 2022-09-22 A computer-implemented active learning method in the field of an imbalanced matching situation and a corresponding system WO2023208396A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP22170719.3 2022-04-29
EP22170719 2022-04-29

Publications (1)

Publication Number Publication Date
WO2023208396A1 true WO2023208396A1 (en) 2023-11-02

Family

ID=81448838

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2022/076440 WO2023208396A1 (en) 2022-04-29 2022-09-22 A computer-implemented active learning method in the field of an imbalanced matching situation and a corresponding system

Country Status (1)

Country Link
WO (1) WO2023208396A1 (en)

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
AJAY J JOSHI ET AL: "Coverage optimized active learning for k - NN classifiers", ROBOTICS AND AUTOMATION (ICRA), 2012 IEEE INTERNATIONAL CONFERENCE ON, IEEE, 14 May 2012 (2012-05-14), pages 5353 - 5358, XP032194649, ISBN: 978-1-4673-1403-9, DOI: 10.1109/ICRA.2012.6225054 *
BRODER, ANDREI Z.: "Annual Symposium on Combinatorial Pattern Matching", 2000, SPRINGER, article "Identifying and filtering near-duplicate documents", pages: 1 - 10
LAWS, F.SCHUTZE, H.: "Stopping criteria for active learning of named entity recognition", PROCEEDINGS OF THE 22ND INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS, COLING, August 2008 (2008-08-01), pages 465 - 472, XP058104428
RAHUL VIGNESWARANMARC T. LAWVINEETH N. BALASUBRAMANIANMAKARAND TAPASWI, FEATURE GENERATION FOR LONG-TAIL CLASSIFICATION, 10 November 2021 (2021-11-10)
RUHLING CACHAYSALVA, BENEDIKT BOECKINGARTUR DUBRAWSKI: "End-to-End Weak Supervision", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, vol. 34, 2021
SHYAMGOPAL KARTHIKJEROME REVAUDCHIDLOVSKII BORIS, LEARNING FROM LONG-TAILED DATA WITH NOISY LABELS, 12 September 2021 (2021-09-12)
STEPHEN MUSSMANNROBIN JIAPERCY LIANG: "On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks", FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: EMNLP, 2020, pages 3400 - 3413
SUNITA SARAWAGIANURADHA BHAMIDIPATY, INTERACTIVE DEDUPLICATION USING ACTIVE LEARNING, 2002
VLACHOS, ANDREAS: "A stopping criterion for active learning", COMPUTER SPEECH & LANGUAGE, vol. 22, no. 3, 2008, pages 295 - 312

Similar Documents

Publication Publication Date Title
US20120124037A1 (en) Multimedia data searching method and apparatus and pattern recognition method
CN111832289A (en) Service discovery method based on clustering and Gaussian LDA
CN110659367B (en) Text classification number determination method and device and electronic equipment
Fidler et al. A coarse-to-fine taxonomy of constellations for fast multi-class object detection
CN111325264A (en) Multi-label data classification method based on entropy
CN114461890A (en) Hierarchical multi-modal intellectual property search engine method and system
Chua et al. Eff2Match results for OAEI 2010
CN117171413B (en) Data processing system and method for digital collection management
CN113516189A (en) Website malicious user prediction method based on two-stage random forest algorithm
Jurek et al. Classification by cluster analysis: A new meta-learning based approach
US20230259761A1 (en) Transfer learning system and method for deep neural network
Liu et al. The design of error-correcting output codes algorithm for the open-set recognition
WO2023208396A1 (en) A computer-implemented active learning method in the field of an imbalanced matching situation and a corresponding system
Liyanage et al. Automating the classification of urban issue reports: an optimal stopping approach
Liang et al. An improved algorithm based on KNN and random forest
CN116186298A (en) Information retrieval method and device
Guarascio et al. Movie tag prediction: An extreme multi-label multi-modal transformer-based solution with explanation
Su et al. Deep supervised hashing with hard example pairs optimization for image retrieval
Lan et al. A new model of combining multiple classifiers based on neural network
CN113723111A (en) Small sample intention recognition method, device, equipment and storage medium
Szymczak et al. Coreference detection in XML metadata
Kittler et al. Serial multiple classifier systems exploiting a coarse to fine output coding
Chien et al. Large-scale image annotation with image–text hybrid learning models
Athira et al. An efficient solution for multi-label classification problem using apriori algorithm (MLC-A)
Archana et al. Improvement in K-Means Clustering Using Variant Techniques

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22786036

Country of ref document: EP

Kind code of ref document: A1