WO2023208396A1

WO2023208396A1 - A computer-implemented active learning method in the field of an imbalanced matching situation and a corresponding system

Info

Publication number: WO2023208396A1
Application number: PCT/EP2022/076440
Authority: WO
Inventors: Jonathan Fuerst; Bin Cheng
Original assignee: NEC Laboratories Europe GmbH
Priority date: 2022-04-29
Filing date: 2022-09-22
Publication date: 2023-11-02

Abstract

For providing a high degree of performance of a computer-implemented active learning method by simple means a computer-implemented active learning method in the field of an imbalanced matching situation, for finding at least one relationship between data points in different data sets, is provided, comprising the following steps: executing an uncertainty-based sampling strategy on a defined set of data points; performing a detection of a defined saturation level in an improvement of a model performance, wherein the detection is based on outputs of a generative model and a discriminative model; further executing the uncertainty-based sampling strategy, if such a saturation level is not detected, and switching to executing a long-tail sampling strategy, if such a saturation level is detected; and outputting a final model for prediction, if a defined budget of labels for the data points is depleted. Further, a corresponding system for carrying out the computer-implemented active learning method in the field of an imbalanced matching situation, for finding at least one relationship between data points in different data sets, is provided, comprising: executing means for executing an uncertainty-based sampling strategy on a defined set of data points; performing means for performing a detection of a defined saturation level in an improvement of a model performance, wherein the detection is based on outputs of a generative model and a discriminative model; executing means for further executing the uncertainty-based sampling strategy, if such a saturation level is not detected, and switching means for switching to executing means for executing a long-tail sampling strategy, if such a saturation level is detected; and outputting means for outputting a final model for prediction, if a defined budget of labels for the data points is depleted.

Description

A COMPUTER-IMPLEMENTED ACTIVE LEARNING METHOD IN THE FIELD OF AN IMBALANCED MATCHING SITUATION AND A CORRESPONDING SYSTEM

The present invention relates to a computer-implemented active learning method in the field of an imbalanced matching situation, for finding at least one relationship between data points in different data sets.

Further, the present invention relates to a corresponding system for carrying out the computer-implemented active learning method.

Corresponding prior art documents are listed as follows:

[1] Vlachos, Andreas. "A stopping criterion for active learning." Computer Speech & Language 22, no. 3, 2008: 295-312.

[2] Laws, F. and Schutze, H., 2008, August. Stopping criteria for active learning of named entity recognition. In Proceedings of the 22nd International Conference on Computational Linguistics, Coling, 2008, pp. 465-472.

[3] Broder, Andrei Z. "Identifying and filtering near-duplicate documents." In Annual Symposium on Combinatorial Pattern Matching, pp. 1-10. Springer, Berlin, Heidelberg, 2000.

[4] Ruhling Cachay, Salva, Benedikt Boecking, and Artur Dubrawski. "End-to-End Weak Supervision." Advances in Neural Information Processing Systems 34, 2021.

[5] Stephen Mussmann, Robin Jia, and Percy Liang. 2020. On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks. In Findings of the Association for Computational Linguistics: EMNLP 2020. 3400-3413. Further prior art documents:

Sunita Sarawagi, Anuradha Bhamidipaty. “Interactive Deduplication using Active Learning”, 2002. This document discloses an interactive deduplication method using an active learning approach and describes that the method includes the creation of weak supervision signals as noisy data input to a generative classifier model and the generative classifier model generates labels for the corresponding noisy data. The said labels and the output of the generative model are transferred to train the discriminative model to generate predictions. Furthermore, the uncertainty or saturation point is detected based on a comparison of the outputs of both the classifiers and then weighted sampling is performed. The final output of the last discriminative model is used for further predictions.

Rahul Vigneswaran, Marc T. Law, Vineeth N. Balasubramanian, Makarand Tapaswi. “Feature Generation for Long-tail Classification”, November 10, 2021 . This document discloses a feature generation method for long-tail classification using a deep learning approach and describes that the method includes the creation of input signals/samples as data input to a generative classifier model and the generative classifier model generates labels for the corresponding input data samples. The said labels and the output of the generative model are transferred to train the discriminative model to generate long-tail classification predictions. Furthermore, the document discloses a two-stage sampling approach including instance sampling and balanced sampling referred to as long-tail sampling.

Shyamgopal Karthik, Jerome Revaud, Chidlovskii Boris. “Learning From Long- Tailed Data With Noisy Labels”, September 12, 2021. This document discloses a learning approach for long-tailed data with noisy labels and describes that the method includes the creation of input signals/samples as noisy data input to the first - generative classifier - model and the respective model generates labels for the corresponding input data samples. The said labels and the output of the first model are transferred to train the discriminative model to generate long-tail classification predictions. Furthermore, the document discloses a two-stage sampling approach including instance sampling and long-tail sampling. In data integration/preparation, users are faced with several matching problems, from schema and ontology matching to entity matching/resolution. In essence, in these problems users are interested to find same-as/equal relationships between elements in disjoint datasets. E.g., users might need to integrate various data sources related to a non-residential building such as Industry Foundation Classes, IFC, building management system, BMS, data, energy management system, EMS, data, facility management system, FMS, Calendar entries, OpenStreetMap, Security System etc. to realize some novel application, such as advanced energy/carbon saving applications. This integration scenario involves matching different data schemas/ontologies and subsequently matching records that belong to the same real-world entity, e.g., records that refer to the same room in the building. To solve these matching problems, others have applied various solutions from heuristic and unsupervised, e.g., matching heuristics and lexical similarity based, approaches to recently increasingly supervised machine learning, ML, - most recent deep learning - based solutions, which tend to outperform previous works. However, one problem with supervised machine learning is the need for large, labeled datasets. One way to reduce this labeling bottleneck is Active Learning.

Active Learning describes ML methods which actively query an oracle, e.g., a system, expert user, database etc., to label data points. These labeled data points are then used to train a supervised ML model. As the oracle represents a bottleneck in terms of cost/effort, it is of paramount importance to select the “most informative” samples for labeling, opposed to just random sampling. The most informative samples are usually the ones which the trained model is most uncertain about, i.e. , they are at the decision boundary. Different metrics have been proposed to quantify this uncertainty, e.g.:

• Least Confidence Uncertainty

Based on the difference between the most confident class prediction and 100% confidence, selecting the sample which has the lowest probability.

• Smallest Margin Uncertainty/Margin of confidence

Select based on the smallest difference/margin between the two topmost predictions. • Largest Margin Uncertainty

Select based on the smallest difference/margin between the top and the bottom prediction.

• Ratio of Uncertainty

Based on the ratio of the top two predicted classes.

• Entropy Reduction

Entropy is the basis for mutual information, which quantifies relation between two things and the basis for relative entropy - kullback-leibler distance - and cross entropy. Higher values of entropy indicate greater uncertainty.

Entropy = - £p( )log(p( ))

Active learning can be bootstrapped with a random sampling strategy, which is not efficient for imbalanced datasets, such as the described matching problems, only < 0.1 % of all combinations are matches, [5], Thus, to bootstrap active learning faster, with less effort, weak supervision has been proposed [4], Weak supervision provides an initial set of noisy weak outputs in this context. Active learning is then used to improve the quality of these noisy weak outputs by providing high-quality labels, i.e. , highly accurate, for a small set of informative samples. However, the problem with using existing uncertainty measures in this scenario is that the query selection strategy is biased to the provided weak-supervision signals. This is especially an issue for imbalanced classification problems. For example, in entity matching, weak supervision signals, e.g., encoded in data programming-style labeling functions, might only use a subset of attributes of the entity in their matching heuristics, i.e., to decide if a combination is a match or a non-match.

If an un-used attribute is however a strong discriminator for the matching task, users might miss the affected entity pairs and consider them non-matches based on the other attributes. These false negatives are unfortunately also unlikely to be found through training a discriminative ML model. This is because the potentially discriminative features will be deemed not discriminative with the provided labeled weak-supervision data. Even worse, this leads to the problem that data samples appear to have a low uncertainty, as they are considered clear non-matches based on the labeling functions and features used, and are thus not likely to be selected with priority by the active learning sampling strategy. In practical applications of active learning, this observed behavior results in a quick improvement during the first iterations, i.e., the active learning query strategy is selecting valuable data points, but then to a saturation in which the selection strategy is often not better than random and the trained ML model is only improving marginally.

Diversity sampling-based approaches are in general addressing this issue by trying to sample data points that are more representative of the underlying population, e.g., through:

• Discovering model-based outliers by picking data points with low activation in logits and hidden layers.

• Clustering techniques to pre-segment the data and per-cluster sampling technique.

• Representative sampling by finding items most representative of the target domain.

These approaches work only sub-optimal in matching problems due to the strong imbalance that requires strategies tailored to sampling additional, potential matches. E.g., data points with a low activation will in most cases be non-matches and sampling diversely from different clusters is still mostly resulting in non-matches picks.

It is an object of the present invention to improve and further develop a computer- implemented active learning method in the field of an imbalanced matching situation and a corresponding system for carrying out this method for providing a high degree of performance of the method by simple means.

In accordance with the invention, the aforementioned object is accomplished by a computer-implemented active learning method in the field of an imbalanced matching situation, for finding at least one relationship between data points in different data sets, comprising the following steps:

- executing an uncertainty-based sampling strategy on a defined set of data points; - performing a detection of a defined saturation level in an improvement of a model performance, wherein the detection is based on outputs of a generative model and a discriminative model;

- further executing the uncertainty-based sampling strategy, if such a saturation level is not detected, and switching to executing a long-tail sampling strategy, if such a saturation level is detected; and

- outputting a final model for prediction, if a defined budget of labels for the data points is depleted.

Further, the aforementioned object is accomplished by a system for carrying out the computer-implemented active learning method in the field of an imbalanced matching situation, for finding at least one relationship between data points in different data sets, comprising:

- executing means for executing an uncertainty-based sampling strategy on a defined set of data points;

- performing means for performing a detection of a defined saturation level in an improvement of a model performance, wherein the detection is based on outputs of a generative model and a discriminative model;

- executing means for further executing the uncertainty-based sampling strategy, if such a saturation level is not detected, and switching means for switching to executing means for executing a long-tail sampling strategy, if such a saturation level is detected; and

- outputting means for outputting a final model for prediction, if a defined budget of labels for the data points is depleted.

According to the invention it has been recognized that it is possible to provide a high degree of performance of the method by using different sampling strategies or methods in a smart way. It has been further recognized that simply performing a detection of a defined saturation level in an improvement of a model performance, wherein the detection is based on outputs of a generative model and a discriminative model, provides a suitable moment in time during performing the whole method for switching from one sampling strategy or method to another sampling strategy or method. Further according to the invention, firstly an uncertainty-based sampling strategy is executed on a defined set of data points. Then, if such a saturation level is detected, switching from the uncertainty-based sampling strategy to a long-tail sampling strategy is executed. As a result, more potential matches between data points can be found efficiently. At the end of the method a final model for prediction is output, if a defined budget of labels for the data points is depleted.

Thus, the method and system according to the invention provide a high degree of performance of the method by simple means.

According to an embodiment of the invention the uncertainty-based sampling strategy can comprise providing a set X of candidate match pairs of data points from a first source A and a second source B. Such candidate match pairs can for example comprise ontology class pairs or entity pairs.

Within a further embodiment the uncertainty-based sampling strategy can comprise creating of a set of weak supervision signals or functions for a matching problem and outputting a set of weak supervision sources L. Such weak supervision signals can be in the form of weak supervision functions. The matching problem can comprise for example ontology, schema, entity etc.

In a further embodiment the weak supervision signals or functions can be combined in a generative process and at least one label can be created for non-abstained candidate match pairs. The input in this step can be the data pairs X and the set of weak supervision sources L. The output can be predictions and probabilities for nonabstained input data.

According to a further embodiment the at least one label and a set of features can be used to train a discriminative machine learning model to then create predictions for all candidate matches or candidate match pairs or create a temporary classifier. The input in this step can be the input data pairs X and the predictions mentioned in the last paragraph. The output can be predictions and probabilities for all the input data. Within a further embodiment sampled data points can be provided to an oracle, wherein the oracle can be a computer system or database.

In a further embodiment the oracle can provide at least one label for the sampled data points. The oracle can comprise a defined budget of labels.

According to a further embodiment the at least one label can be used to re-train a discriminative model or to tune a generative modeling process or to suggest/create new weak supervision sources or signals or functions. The generative modeling process can be tuned through influencing the modelled accuracy of each affected weak supervision source.

Within a further embodiment performing a detection of a defined saturation level in an improvement of a model performance can be based on validating the trained model or uncertainty-based sampling strategy with some test dataset or on a defined value of distance between the outputs of the generative model and the discriminative model. In this step the input data X and the predictions and/or probabilities for all the input data can be inputted. The output can be oracle labels/predictions for selected data points.

According to a further embodiment the long-tail sampling strategy can comprise computing a feature vector for each data point that needs to be matched. This will contribute to high performance of the method.

In a further embodiment the long-tail sampling strategy can comprise a MinHash Locality-Sensitive Hashing, LSH, based similarity computation.

In a further embodiment a MinHash signature can be created for each data point.

According to a further embodiment the signature encodes one or more feature vectors. This is a very efficient encoding of feature vectors.

Within a further embodiment the long-tail sampling strategy can comprise computing of informative metric for each data point based on a linear combination of uncertainty, similarity and weak label availability. This allows for a flexible and adjustable linear combination of information contained in weak labels, model uncertainty and Jaccard similarity.

Advantages and aspects of embodiments of the present invention are summarized as follows:

Embodiments can comprise an efficient two-step weakly-supervised active learning method based on the combination of:

1. The detection of an active learning saturation point for traditional, uncertaintybased sampling.

2. A long-tail focused sampling strategy tailored for matching problems, to provide more valuable matches in later stages of active learning.

Further embodiments can comprise the following steps:

Overall input data: A set of candidate match pairs of data points from Source A and Source B, e.g., ontology class pairs or entity pairs. The set is: X= {(Ai,Bi), (Ai, B2), (A₂,B₃)...}.

The method comprising the steps of

1) Creation of a set of weak supervision signals, e.g., in form of weak supervision functions, for a matching problem, e.g., ontology, schema, entity...

Output: A set of weak supervision sources L = {Li, L₂, L₃...LN}.

2) Combination of these weak signals in a generative process, creating labels for the non-abstained candidate matches.

Input: The input data X and the set of weak supervision sources L.

Output: Predictions and probabilities for the non-abstained input data: ynon_abstained.

3) Use the labels and a set of features to train a discriminative machine learning model to then created predictions for all the candidate matches.

Input: The input data X and the predictions ynon_abstained-

Output: Predictions and probabilities for all the input data: y_aii. 4) Perform saturation point detection based on outputs of generative and discriminative model and then based on outcome: a. Perform traditional active learning sampling strategy, e.g., entropy based. b. Perform long-tail focused sampling strategy.

Input: The input data X and the predictions y_aii.

Output: Oracle labels/predictions for selected data points: yoracle.

5) Finish when oracle budget is depleted.

6) Output final predictions for matches.

Embodiments of the invention provide a more efficient sampling strategy for imbalanced matching problems. With state-of-the-art methods, one needs to sample and label many more data points to achieve the same performance. This results in more effort/cost. Embodiments of the invention are able to sample more informative data points and thus are able to achieve better matching performance.

An embodiment of the present invention comprises a two-step active learning method for imbalanced matching problems, combining weak-supervision with two subsequent active learning phases, preferably for detecting difficult currently occurring long-tail matches.

A further embodiment of the present invention can be a weakly-supervised active learning method.

Based on embodiments of the present invention more potential matches can be found efficiently, with less time and effort compared with prior art methods. The embodiments provide a key to improve matching performance.

Embodiments of the invention can present a two-step active learning method for imbalanced matching problems, e.g., ontology, schema, entity matching, that can combine weak-supervision with two subsequent active learning phases to detect difficult but commonly occurring long-tail matches. It can use traditional uncertaintybased sampling with performance saturation detection and can then switch to a 2^nd sampling method that combines traditional uncertainty metrics, an efficient MinHash LSH based similarity computation, and the available weak supervision signals. Overall, this can achieve among others a better cost-performance trade-off compared to existing methods.

There are several ways how to design and further develop the teaching of the present invention in an advantageous way. To this end it is to be referred to the following explanation of examples of embodiments of the invention, illustrated by the drawing. In the drawing

Fig. 1 shows in a diagram an overview of an embodiment of the present invention,

Fig. 2 shows an algorithm sketch of an embodiment of a long-tail sampling strategy,

Fig. 3 shows a further embodiment wherein matched data is used to create a digital twin and execute optimized building control,

Fig. 4 shows a further embodiment wherein different city data is matched to enable cross-silo optimization applications and

Fig. 5 shows a further embodiment comprising creation and deployment of matchings between heterogeneous railroad device infrastructure to enable Smart Railway Management.

Embodiments of the invention present a two-phase active learning strategy that addresses the “long tail” problem of finding potential informative matches efficiently. In the first phase, there is executed a traditional sampling strategy until it is detected a saturation point in model performance improvement. Then, it is switched to a second sampling strategy that focuses on these long tail samples using a computationally efficient method, tailored to such matching problems that is based on MinHash LSH based similarity computation, traditional uncertainty metrics, and the available weak supervision signals. Compared to existing works this two-step method achieves higher performance with the same amount of labeling effort. I.e., less time and effort is needed for annotating data points to achieve equal or even better matching quality.

The overall flow of an embodiment of the invention is depicted in Fig. 1 . A set of weak-supervision functions creates a set of labels in a generative process for the data points. These labels are then used to train a discriminative model, creating a temporary classifier. The temporary classifier then labels the data. As long as there is labeling budget available, active learning iterations as follows are then performed: First, a saturation detection is performed to decide if default, uncertainty-based sampling strategies are still sufficient. Then it is executed either such a default sampling strategy, or if saturation has been reached, it is performed a long tail sampling strategy. In either case the sampled data points are provided to the oracle such as a computer system or database, which then will provide labels for the sampled data points. These new, annotated labels can 1.) directly be used to retrain the discriminative model 2.) to tune the generative modeling process - through influencing the modelled accuracy of each affected weak supervision source - and 3.) an extension to suggest/create new weak-supervision sources.

When the defined labeling budget has been depleted, the last trained discriminative classifier is outputted. This classifier is then used as final model for prediction.

In the following two core buildings blocks, Saturation Detection and Long Tail Sampling Strategy, of embodiments are described in more detail:

Saturation Detection:

The simplest way to detect saturation would be to validate the trained model with some test dataset. However, obtaining a representative test data set is costly and active learning is often applied in scenarios where users exactly want to avoid collecting large, labeled datasets.

There have been developed active learning stopping mechanisms on overall, average uncertainty, based on some uncertainty measure such as entropy or margin uncertainty [1], which is however model dependent and has shown instability, especially in multiclass settings [2], In embodiments of the present invention, there is proposed a different method. There is exploited the fact that there are two prediction outputs: 1.) the predictions of the generative model and 2.) the predictions of the discriminative model. There is the insight to implement a stopping mechanism based on the overall distances between these two output datasets, i.e., using of sum of squared residuals is possible. Then, the initial sampling strategy is stopped, when the change between successive active learning iterations becomes small enough - defined by some parameter -, and it is switched to the long-tail sampling strategy.

Long-Tail Sampling Strategy:

It is proposed the following approach for sampling informative data points for long- tail examples, see also Algorithm 1 for a sketch of an implementation of its core functions:

1. Compute a feature vector for each data point, e.g., a class in an ontology, a data record in a database. Such a feature vector may employ word and sentence embedding models for text attributes or apply data augmentation techniques to augment the raw data.

Output: A feature vector for each data point that needs to be matched.

2. Create MinHash signatures [3] for each data point. These signatures encode the feature vector efficiently. With these MinHash signatures, Jaccard Similarity between combinations in linear time can be quickly estimated: Jaccard Similarity:

3. While oracle budget is available, it is started from a defined similarity_threshold parameter, e.g., it is started from 1 , the highest possible Jaccard similarity): a. Create a LSH index with the defined similarity threshold that impacts the chosen number and size of bands used. b. Insert all MinHash signatures from dataset A using the data point ID as key. c. Query with all signatures from dataset B. This will return all matches above the defined similarity threshold and only requires constant time for each query (thus linear overall). d. Filter results by already oracle annotated data points in previous iterations. e. Compute informative metric for each data point based on a linear combination of uncertainty, similarity and weak label availability:

/(%) = a * \weak_iabei \ + (3 * uncertainty + y * jaccard_similarity

This allows for a flexible and adjustable - based on parameters a, (3, y - linear combination of information contained in weak labels, model uncertainty and Jaccard similarity. f. Sort data points based on computed informative metric. g. While oracle budget available: i. Pick batch of highest informative samples to ask the Oracle. ii. If all samples have been provided to the oracle, or the LSH query has not returned any samples, reduce the similarity_threshold by a defined stepsize parameter. This will loosen the Jaccard similarity restriction and thus return more candidates for active sampling.

Further Embodiment 1 : Match Building Related Data for Digital Twin and Smart Building Applications

Buildings account for roughly 40% of all energy consumed in developed countries. Non-residential buildings, such as office buildings account for a large amount of that energy. These buildings consist usually of several sub systems such as for lighting, heating ventilation and air conditioning, HVAC, access control, room booking etc. For each of this sub-systems, different data schemas/ontologies exist that usually originate in the respective sub-systems and might thus using different naming conventions.

In the architecture, engineering and construction/facility management industry, there exists a common standard for Building Information Models, BIM, representation, the namely Industry Foundation Classes, IFC, standard. As the BIM represented in IFC contains various in-depth info of the building, including its structure and materials, it can serve as a good basis towards a common data model of a building digital twin. In this digital twin, the detailed IFC information can be used to perform building simulations and as input to optimally actuate the building components. It can also be used to calculate building carbon emissions and develop strategies to reduce these emissions.

The problem is that the underlying data schemas/ontologies are not aligned, and thus a digital twin would need to be constructed for each individual building expensively with lots of man hours. It could be shown that weakly-supervised ontology and schema matching shows great potential for such problems but suffers when labeling functions are not covering the specific matching problems of the building.

On the basis of embodiments of the present invention, active learning can be added effectively, discovering matches beyond the ones covered by the knowledge of the labeling functions. Embodiments of the present invention will accelerate the integration and with little effort enable a digital twin construction and a more optimal operation of the physical building twin.

Further Embodiment 2: Use to match smart city data for Super Smart City Applications

In cities, data is generated independently by different departments and even private businesses. To create a common data space between these data producers, their non-integrated data needs to be transferred to a common data model. A key challenge here is to provide automated matching on schema and entity level with little available machine learning training data. Enabling such automated matching is greatly accelerated with embodiments of this invention. The then integrated data space can be used for many purposes that affect the city and its operation. E.g., Data from businesses, current events, public transport, and deployed transport sensing infrastructure can be used jointly to optimize traffic flow in the city, e.g., by adjusting traffic lights or deploying more buses. Further Embodiment 3: Use to match data schemas between devices for smart railway management

Smart railways are usually managed with a large amount of heterogenous devices from various manufactures and suppliers. For better system efficiency and automation, those devices very often need to interact with each other either directly or indirectly via some intermediate data processing. To enable such interoperability between devices, domain experts must spend lots of effort and time on mapping their data schemas. This is a tedious and time-consuming task. Also, it is not a onetime task. As old devices become defect and get replaced with new devices over time and the same device might come from different vendors, domain experts must be always engaged into this data integration loop to ensure the correctness of the data schema mapping. Embodiments of the present invention could help domain experts to map the data schemas between devices more efficiently and faster.

Extension 1 :

As an extension/alternative use, embodiments of the long-tail sampling strategy can also be used as the only sampling strategy, fully replacing traditional sampling strategies. This can make sense when either the weak supervision sources are representative and accurate, so that there is no (or not much) room for improvement, or when instead of weak-supervision signals, the ML model is trained with a representative set of labeled of data points.

Extension 2:

The oracle might also provide new weak-supervision signals or the oracle’s annotations might be used to generate new weak-supervision signals or tune existing signals.

Many modifications and other embodiments of the invention set forth herein will come to mind to the one skilled in the art to which the invention pertains having the benefit of the teachings presented in the foregoing description and the associated drawings. Therefore, it is to be understood that the invention is not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

Claims

C l a i m s

1. A computer-implemented active learning method in the field of an imbalanced matching situation, for finding at least one relationship between data points in different data sets, comprising the following steps:

- executing an uncertainty-based sampling strategy on a defined set of data points;

- performing a detection of a defined saturation level in an improvement of a model performance, wherein the detection is based on outputs of a generative model and a discriminative model;

2. The method according to claim 1 , wherein the uncertainty-based sampling strategy comprises providing a set of candidate match pairs of data points from a first source (A) and a second source (B).

3. The method according to claim 1 or 2, wherein the uncertainty-based sampling strategy comprises creating of a set of weak supervision signals or functions for a matching problem and outputting a set of weak supervision sources (L).

4. The method according to claim 3, wherein the weak supervision signals or functions are combined in a generative process and at least one label is created for non-abstained candidate match pairs.

5. The method according to claim 4, wherein the at least one label and a set of features is used to train a discriminative machine learning model to then create predictions for all candidate matches or candidate matches or match pairs or create a temporary classifier.

6. The method according to any one of claims 1 to 5, wherein sampled data points are provided to an oracle, wherein the oracle can be a computer system or database.

7. The method according to claim 6, wherein the oracle provides at least one label for the sampled data points.

8. The method according to claim 7, wherein the at least one label is used to retrain a discriminative model or to tune a generative modeling process or to suggest/create new weak supervision sources or signals or functions.

9. The method according to any one of claims 1 to 8, wherein performing a detection of a defined saturation level in an improvement of a model performance is based on validating the trained model or uncertainty-based sampling strategy with some test dataset or on a defined value of distance between the outputs of the generative model and the discriminative model.

10. The method according to any one of claims 1 to 9, wherein the long-tail sampling strategy comprises computing a feature vector for each data point that needs to be matched.

11. The method according to any one of claims 1 to 10, wherein the long-tail sampling strategy comprises a MinHash Locality-Sensitive Hashing, LSH, based similarity computation.

12. The method according to any one of claims 1 to 11 , wherein a MinHash signature is created for each data point.

13. The method according to claim 12, wherein the signature encodes one or more feature vectors.

14. The method according to any one of claims 1 to 13, wherein the long-tail sampling strategy comprises computing of informative metric for each data point based on a linear combination of uncertainty, similarity and weak label availability.

15. A system for carrying out the computer-implemented active learning method in the field of an imbalanced matching situation, according to any one of claims 1 to 14, for finding at least one relationship between data points in different data sets, comprising: