CN112836753B

CN112836753B - Method, apparatus, device, medium, and article for domain adaptive learning

Info

Publication number: CN112836753B
Application number: CN202110162210.0A
Authority: CN
Inventors: 许鹏飞; 王雅田; 宋晓林; 赵思成
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2021-02-05
Filing date: 2021-02-05
Publication date: 2024-06-18
Anticipated expiration: 2041-02-05
Also published as: CN112836753A; WO2022166578A1

Abstract

The present disclosure relates to methods, apparatus, devices, media, and products for domain adaptive learning. The method described herein includes obtaining a source video sample set of a source domain and a target video sample set of a target domain, the source video samples in the source video sample set being labeled as belonging to one of a plurality of known categories; determining probabilities that the plurality of target video samples each belong to an unknown class based on a plurality of similarities between the plurality of target video samples in the target video sample set and the source video sample set; and adapting a multi-classification model previously pre-trained with the set of source video samples to the target domain based at least on the plurality of target video samples and their respective probabilities. Thereby, domain-adaptive learning of more reliable video classification can be achieved.

Description

Method, apparatus, device, medium, and article for domain adaptive learning

Technical Field

The present disclosure relates generally to the field of computer vision, and more particularly to methods, apparatus, electronic devices, computer-readable storage media, and computer program products for domain adaptive learning.

Background

Domain adaptation (domain adaptation, DA) has advanced capabilities in the field of computer vision. Domain adaptation refers to the migration of knowledge learned from a relevant domain with sufficient supervision information (called the "source domain") to another domain without supervision information (called the "target domain"). For example, domain adaptation refers to migrating a classifier, trained from training data with classification tags, to a data domain where the classification tags are unknown, for classification tasks.

Most methods of domain adaptive learning that have been proposed today are based on the assumption of a closed set, i.e. assuming that the source domain contains all the classes in the target domain. However, such assumptions may not hold in many practical applications. A more common scenario in practical applications is that the target domain has classes that do not correspond to the source domain, e.g. the target domain has more classes than the source domain. Such domain adaptation is referred to as "open set" domain adaptation. The existence of unknown classes in open set domain adaptation makes model migration learning from source domain to target domain more difficult and error migration problems easily occur compared to the assumption of closed set.

Furthermore, as video data is more complex in information in both spatial and temporal dimensions, domain adaptive learning for video data also faces greater challenges. It is desirable to be able to provide a more reliable domain adaptive learning scheme, in particular for video data.

Disclosure of Invention

According to some embodiments of the present disclosure, a scheme for domain adaptive learning is provided.

In a first aspect of the present disclosure, a method for domain adaptive learning is provided. The method includes obtaining a source video sample set of a source domain and a target video sample set of a target domain, the source video samples in the source video sample set being labeled as belonging to one of a plurality of known categories; determining probabilities that the plurality of target video samples each belong to an unknown class based on a plurality of similarities between the plurality of target video samples in the target video sample set and the source video sample set; and adapting a multi-classification model previously pre-trained with the set of source video samples to the target domain based at least on the plurality of target video samples and their respective probabilities.

In a second aspect of the present disclosure, an apparatus for domain adaptive learning is provided. The apparatus includes an acquisition module configured to acquire a source video sample set of a source domain and a target video sample set of a target domain, the source video samples in the source video sample set being labeled as belonging to one of a plurality of known categories; a probability determination module configured to determine probabilities that the plurality of target video samples each belong to an unknown class based on a plurality of similarities between the plurality of target video samples in the target video sample set and the source video sample set; and a model adaptation module configured to adapt a multi-classification model previously pre-trained with the set of source video samples to the target domain based at least on the plurality of target video samples and their respective probabilities.

In a third aspect of the present disclosure, there is provided an electronic device comprising a memory and a processor, wherein the memory is for storing computer executable instructions that are executed by the processor to implement a method according to the first aspect of the present disclosure.

In a fourth aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer-executable instructions, wherein the computer-executable instructions are executed by a processor to implement a method according to the first aspect of the present disclosure.

In a fifth aspect of the present disclosure, there is provided a computer program product comprising computer programs/instructions which, when executed by a processor, implement a method according to the first aspect of the present disclosure.

Drawings

Features, advantages, and other aspects of various implementations of the disclosure will become apparent with reference to the following detailed description when taken in conjunction with the accompanying drawings. Several implementations of the present disclosure are illustrated herein by way of example and not by way of limitation, in the figures of the accompanying drawings:

FIG. 1 illustrates a block diagram of an example environment for domain adaptive learning, according to some embodiments of the present disclosure;

FIG. 2 illustrates an architectural block diagram for domain adaptive learning in accordance with some embodiments of the present disclosure;

Fig. 3 illustrates a block diagram of an example structure of an Open Set Discriminator (OSD) of fig. 2, according to some embodiments of the present disclosure;

FIG. 4 illustrates a block diagram of an example structure of the domain discriminator of FIG. 2, according to some embodiments of the disclosure;

FIG. 5 illustrates a flow chart of a process for domain adaptive learning according to some embodiments of the present disclosure;

FIG. 6 illustrates a block diagram of an apparatus for domain adaptive learning, according to some embodiments of the present disclosure; and

FIG. 7 illustrates a block diagram of a computing system in which one or more embodiments of the disclosure may be implemented.

Detailed Description

Preferred implementations of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred implementations of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the implementations set forth herein. Rather, these implementations are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one example implementation" and "one implementation" mean "at least one example implementation". The term "another implementation" means "at least one additional implementation". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.

As used herein, the term "model" may learn the association between the respective inputs and outputs from training data so that, for a given input, a corresponding output may be generated after training is completed. The generation of the model may be based on machine learning techniques. Deep learning is a machine learning algorithm that processes inputs and provides corresponding outputs through the use of multiple layers of processing units. The neural network model is one example of a deep learning-based model. The "model" may also be referred to herein as a "machine learning model," "machine learning network," or "learning network," which terms are used interchangeably herein.

Generally, machine learning may generally include three phases, namely a training phase, a testing phase, and a use phase (also referred to as an inference phase). The training phase, also called learning phase. During the training phase, a given model may be trained using a large amount of training data, iterating until the model reaches the desired goal. By training, the model may be considered to be able to learn the association between input and output (also referred to as input to output mapping) from the training data. Parameter values of the trained model are determined. In the test phase, test inputs are applied to the trained model to test whether the model is capable of providing the correct outputs, thereby determining the performance of the model. In the usage phase, the model may be used to process the actual input based on the trained parameter values, determining the corresponding output.

As mentioned above, it is desirable to be able to provide a more reliable domain adaptive learning scheme, in particular for video data.

Embodiments of the present disclosure propose an improved domain adaptive learning scheme. According to this scheme, a multi-classification model for multiple known classes has been pre-trained in the source domain using a set of source video samples. And determining the probability that each target video sample belongs to an unknown class by measuring the respective similarity between a plurality of target video samples in the target domain and a source video sample set in the source domain. The multi-classification model is adapted to the target domain based at least on the plurality of target video samples and their probabilities of belonging to unknown categories. Thereby, domain-adaptive learning of more reliable video classification can be achieved.

Some example embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Example Environment

Referring initially to fig. 1, an example environment 100 for domain adaptive learning is illustrated in accordance with some embodiments of the present disclosure. In environment 100, multi-classification model 110 is configured to classify video. The multi-classification model 110 has been pre-trained in the source domain 120. Specifically, the multi-classification model 110 is pre-trained with a set of source video samples 140 in the source domain 120. The source video sample set 140 includes a plurality of source video samples, such as source video samples 142-1, 142-2, etc. (collectively or individually referred to as "source video samples 142" for ease of discussion).

Herein, "source domain" refers to a domain that includes videos that can be divided into known multiple known categories. The multi-classification model 110 trained in the source domain can have been trained with video samples with classification labels in the source domain 120. The classification labels serve as supervision information in the training process. The multi-classification model 110 is trained to learn how to distinguish between the various classes of video in the source domain. For example, if source field 120 includes a video that presents multiple action categories, including riding a horse, playing a football, etc., multi-classification model 110 can learn characteristics of actions in the video, such as riding a horse, playing a football, etc., so that the probability of occurrence of these action categories in the newly input video can be predicted.

In some cases, it is desirable to be able to generalize the multi-classification model 110 that has been pre-trained in the source domain 120 to the target domain 130. Herein, "target domain" refers to a domain in which the category of video is not annotated. That is, the category of video in the target domain 130 may be identical to the category in the source domain 120 or may also include unknown categories not involved in the source domain 120. However, the categories of the individual videos in the target domain 130 are not annotated, so that the number of categories in the target domain 130 and the differences from the categories in the source domain 120 cannot be accurately determined. For example, in a scenario classifying actions presented in a video, target field 130 may include the same action categories as source field 120, such as riding a horse, playing a football, and the like, while also including action categories that are different at source field 120, such as archery.

The generalization of the model from the source domain 120 to the target domain 130 is referred to as domain adaptive learning. The computing system 112 is configured to perform domain-adaptive learning for the multi-classification model 110 to obtain the multi-classification model 110 that is adapted to the target domain 130. Domain adaptive learning enables the multi-classification model 110 to learn the characteristics of the various classes of video in the target domain 130. Thus, the multi-classification model 110 adapted to the target domain 130 is able to prepare classification of videos in the target domain 130.

The computing system 112 utilizes the source video sample set 140 in the source domain 120 and the target video sample set 150 in the target domain 130 to perform a domain adaptive learning process for the multi-classification model 110. The source video sample set 140 has an associated class label 144. The classification tag 144 marks each source video sample 142 as belonging to one of a plurality of known categories. The target video sample set 150 includes a plurality of target video samples, such as target video samples 152-1, 152-2, 152-3, etc. (collectively or individually referred to as "target video samples 152" for ease of discussion). The target video samples 152 may not have an associated category label for marking the respective category.

The source video sample 142 and the target video sample 152 may each comprise a plurality of frames. Although not specifically shown, each category may include multiple video samples in the source video sample set 140 and the target video sample set 150. It should be appreciated that fig. 1 only shows an example of a source video sample set 140 and a target video sample set 150. There may be more, fewer, or more or fewer categories of video samples.

In environment 100, computing system 112 may be a variety of devices with computing capabilities. For example, computing system 112 may be a server device or a terminal device. The server devices may be, for example, centralized servers, distributed servers, mainframes, edge computing devices, clouds, etc. The terminal device may be, for example, various portable or fixed terminals such as smartphones, tablet computers, desktop computers, notebook computers, car devices, navigation devices, multimedia player devices, smart speakers, smart wearable devices (such as smart watches, smart glasses), and so on. Note that while shown as a single system in fig. 1, in some cases, the functionality of computing system 112 may be implemented by multiple physical devices/systems. It should be understood that the video samples shown in fig. 1 are merely examples and are not intended to limit the scope of the present disclosure in any way.

Some embodiments of domain adaptive learning of the present disclosure will be described in detail below with continued reference to the accompanying drawings.

Example architecture

Fig. 2 illustrates a block diagram of an architecture 200 for domain adaptive learning, according to some embodiments of the present disclosure. For ease of discussion, domain adaptive learning is discussed with reference to fig. 1. Accordingly, architecture 200 is suitable for implementation by computing system 112 for performing domain adaptation of multi-classification model 110 from source domain 120 to target domain 130.

In embodiments of the present disclosure, rather than a closed set assumption for the source domain 120 and the target domain 130, domain adaptive learning is performed based on an open set assumption. The open set assumption does not restrict the source domain 120 from having to include all of the categories in the target domain 130, but rather allows the target domain 130 to have one or more known categories that do not correspond to the source domain 120. Open set video domain adaptation for video may also be referred to as Open Set Video Domain Adaptation (OSVDA).

Assume thatAnd/>Representing semantically a set of categories for the source domain 120 and the target domain 130, respectively. The goal of domain adaptive learning is typically: using labeled source video sample set 140 and unlabeled target video sample set 150 to promote classification of multi-classification model 110 pairs into/>Is included in the target video sample. Assume that the number of categories in common between source domain 120 and target domain 130 is/>In an embodiment of domain adaptive learning of the present disclosure, a class common between the source domain 120 and the target domain 130 is identified as a known class, and the remaining one or more known classes that may exist in the source domain 120 and the target domain 130 are identified as a single class, referred to as an "unknown class". The open-domain adaptive learning problem is converted into a closed-domain adaptive problem in (m+1) categories.

In accordance with embodiments of the present disclosure, in domain adaptive learning, it is proposed to discern whether respective target video samples 152 in a target domain video sample set 150 may belong to a known class (e.g., a plurality of known classes to which source video samples 142 in source domain 120 belong) or may belong to an unknown class (e.g., a class other than a plurality of known classes in source domain 120) based on a similarity between respective target video samples 152 in target domain 130 and source video sample set 140 in source domain 120.

That is, in embodiments of the present disclosure, the probability that the target video sample 152 belongs to a category other than the plurality of known categories is determined based on the similarity between each target video sample 152 and the set of source video samples 140 in the source domain 120. Because the target video sample 152 in the target domain 130 does not carry annotation information, the number of other categories in addition to the plurality of known categories, collectively referred to herein as unknown categories, and the specific category information is unknown. The probability that the target video sample 152 belongs to an unknown class is used with the target video sample 152 to train the multi-classification model 110, resulting in a multi-classification model 110 that is adapted to the target domain 130.

To better explain how the multi-classification model 110 is trained, an example model structure of the multi-classification model 110 is briefly described in conjunction with FIG. 2. In the example of fig. 2, the multi-classification model 110 may be configured to explore features of the video from multiple levels to complete classification decisions. The plurality of levels may include two or more of a frame level, a temporal level across the plurality of frames, and a video level.

Specifically, as shown in FIG. 2, the multi-classification model 110 includes a feature extractor (denoted "φ (-)") 210 configured to perform frame-by-frame feature extraction on an input video sample to obtain initial frame-level features (denoted "φ (-)") of the video sample) 212. The multi-classification model 110 may also include a frame embedding layer 220 configured to characterize/>, for an initial frame level212 Performs a linear transformation to obtain a frame level feature (denoted as F _f) 222 of the video samples. The frame level feature F _f 222 may be considered as capable of characterizing spatial information of individual frames of a video sample in its two-dimensional space at the frame level.

The richer information in video is mainly represented in the time dimension compared to images. Thus, in some embodiments, the multi-classification model 110 may not only need to focus on features of each frame as spatial information, but may also focus on dynamically changing features that are presented when multiple frames are chronologically combined together. For the task of video domain adaptive information, because the domain offset may occur in both the space and time dimensions, features can be extracted for the space information and features in the time dimension can be fully mined at the same time, so that open set identification and migration between two domains can be better realized.

Accordingly, the multi-classification model 110 may also include a plurality of temporal integration portions 230-1, 230-3 (collectively, "temporal integration portions" 230), each temporal integration portion 230 including a temporal integration unit 232 configured to extract temporal relationships between frames in different temporal regions of the video sample. For example, different time domain integration units 232 in different time domain integration portions 230 may be configured to determine a time domain relationship between different numbers of frames (e.g., two frames in succession, three frames, etc.) in a video sample. For a plurality of frames to be considered, the input of the time domain integration unit 232 is the frame level features F _f 222,222 of the frames, and the output is the time domain level features (denoted as F _t) 232 for the frames. The temporal level features F _t 232 can be used to characterize the feature information of the video samples at the temporal level.

The multi-classification model 110 may further include a video level integration unit 240 configured to combine the plurality of temporal level features F _t, 232 to generate a video level feature (denoted as "F _v") 242 of the video sample. For example, the video level integration unit 240 may be configured to weight sum the plurality of temporal level features F _t 232,232.

The feature extractor 210, the frame embedding layer 220, the temporal integration portion 230, and the video-level integration unit 240 perform feature extraction on the input video samples from the frame level, the temporal level, and the video level, respectively. The resulting video level features F _v 242 are provided to a video classifier (denoted as "G _cls") 250 in the multi-classification model 110, which is configured to determine the class of video samples, e.g., determine what the probability that a video sample belongs to a particular class, based on the video level features F _v 242.

It should be appreciated that the structure of the multi-classification model 110 shown in fig. 2 is only one example, and that the multi-classification model 110 may be configured with other structures, as desired for a particular application. For example, in the feature extraction stage, the multi-classification model 110 may lack one or more levels of feature extraction. The multi-classification model 110 may also include more or fewer processing portions for performing feature extraction. Embodiments of the disclosure are not limited in this respect.

In the architecture 200 of fig. 2, it is proposed to determine the similarity between the target video sample 152 and the source video sample set 140 in the source domain 120 by means of a discriminator, thereby determining the corresponding probability that the target video sample 152 belongs to an unknown class. The discriminator may perform corresponding processing based on features extracted from the video samples. Since in embodiments of the present disclosure, domain adaptive learning is performed based on open set assumptions, such a discriminator may be referred to as an open set discriminator (open set discriminator, OSD).

In the example shown in fig. 2, if the multi-classification model 110 is configured to explore features of the input target video samples 151 from at least one of the frame level, the temporal level, and the video level, respectively, then the similarity between the target video samples 152 and the source video sample set 140 may be determined at the corresponding levels, respectively, and the probability that the target video samples 152 belong to unknown categories at the corresponding levels may be determined, respectively. The probabilities of the target video samples 152 belonging to unknown categories at multiple categories in the frame level, temporal level, and video level are aggregated for training of the multi-classification model 110.

For example, as shown in FIG. 2, architecture 200 includes spatial OSD 202-1, temporal OSD 202-2, and video OSD 202-3, exploring features of video samples from frame level, temporal level, and video level, respectively, to perform classification discrimination and probability determination. In some embodiments, one or both of the spatial OSD 202-1, the temporal OSD 202-2, and the video OSD 202-3 may also be omitted. Hereinafter, for ease of discussion, the spatial OSD 202-1, the temporal OSD 202-2 and the video OSD 202-3 may be collectively or individually referred to as the OSD 202. The functionality at a single OSD 202 will be discussed in detail first below.

In some embodiments, the architecture 200 may optionally include a domain discriminator (domain discriminator, DD) in addition to the OSD 202, configured to discriminate whether a video sample input to the multi-classification model 110 belongs to the source domain 120 or the target domain 130. The DD may output a probability that the video sample belongs to the source domain 120 or the target domain 130. In FIG. 2, architecture 200 is shown to include spatial DD 204-1, temporal DD 204-2, and video DD 204-3 for distinguishing the domain to which the video samples belong from the frame level, temporal level, and video level, respectively. Hereinafter, for ease of discussion, the spatial DD 204-1, the temporal DD 204-2 and the video DD 204-3 may be referred to collectively or individually as the DD 204.

The OSD 202 and DD 204 are mainly used for training the multi-classification model 110 in the domain adaptive learning phase. After the multi-classification model 110 is trained, these discriminators do not participate in the classification process of the multi-classification model 110 on the actually input video.

Hereinafter, the process in the OSD 202 will be first discussed in detail. The processing in DD 204 and the specific training process for multi-class model 110 based on OSD 202 and DD 204 then continues.

Open set discrimination based on similarity

As mentioned above, it is desirable to achieve an initial partitioning of the target video samples 152 by measuring the similarity between each target video sample 152 in the target video sample set 150 and the source video sample set 140. In some embodiments, the first similarity (also referred to as "implicit similarity") between each target video sample 152 and the source video sample set 140 may be measured by training a binary classifier. Alternatively or additionally, a second similarity (also referred to as "explicit similarity") between each target video sample 152 and the source video sample set 140 may also be measured by an optimal transmission distance between the target video sample 152 and known reference features of a plurality of known categories. In some embodiments, the first similarity and the second similarity may also be considered simultaneously to determine a respective probability that each target video sample 152 belongs to an unknown class.

Fig. 3 illustrates an example structure of an OSD 202 according to some embodiments of the present disclosure. As shown in fig. 3, OSD 202 includes a double-metric discriminator (DMD) 305 configured to measure a first similarity and/or a second similarity between each target video sample 152 and source video sample set 140.

DMD 305 includes a binary classifier (denoted as "G _c") 310 and a ranking and grouping module 330 configured to measure a first degree of similarity between each target video sample 152 and source video sample set 140 to divide the plurality of target video samples 152 into a first known category candidate group and a first unknown category candidate group. DMD 305 may also include a typical optimal transmission module 320 configured to measure a second similarity between each target video sample 152 and source video sample set 140 for partitioning the plurality of target video samples 152 into a second known class candidate set and a second unknown class candidate set.

In some embodiments, the partitioning of the plurality of target video samples 152 based on the first similarity and/or the second similarity may be used to determine discrimination learning weights 342 for each of the plurality of target video samples 152, which is implemented by the re-weighting module 340. The discrimination learning weights for each of the plurality of target video samples 152 may be provided to a binary discriminator (denoted as "G _b") 350 included in the OSD 202 for affecting the determination of the probability that the target video sample 152 belongs or does not belong to an unknown class.

Next, the determination of the first similarity (i.e., implicit similarity) and the second similarity (i.e., explicit similarity) will be described first.

Determination of first similarity

In OSD 202, binary classifier G _c 310 is configured to assign respective probabilities of whether target video samples 152 in target video sample set 150 belong to known multiple known categories or to unknown categories. As mentioned above, the plurality of known categories refers to a common category between the source domain 120 and the target domain 130, assuming M. The binary classifier G _c 310 may be configured to include a plurality of binary classification modules, each configured to determine a probability that the video sample belongs to one of a plurality of known categories and unknown categories (m+1 total categories).

In some embodiments, the binary classifier G _c 310,310 is trained using the set of source video samples 140. Because, in general, the similarity between the target video samples 152 and the source video samples 142 belonging to M known categories is greater than the similarity between the target video samples 152 and the source video samples 142 belonging to unknown categories (which are known to all belong to M known categories). Thus, by training the binary classifier G _c 310 with the source video sample set 140, the binary classifier G _c 310 may be enabled to scale the target video samples 152 more or less similar to the source video sample set 140.

Since the source video sample set 140 has an associated classification tag 144 that marks which of the M known categories the source video sample 142 belongs to, the binary classifier G _c 310 can be trained in a supervised learning manner. In some embodiments, training may be performed by constructing a learning penalty when training binary classifier G _c 310,310. For example, the learning penalty of binary classifier G _c 310 can be constructed as follows:

wherein the method comprises the steps of Representing the learning loss of the binary classifier G _c 310; /(I)Representing a source video sample set 140; /(I)Representing features extracted from the video sample x (represented in fig. 3 as features 302) that are input to a binary classifier G _c 310 for determining classification results; l _bce denotes cross entropy loss, y denotes the class of source video samples x marked for training during the training phase. Since binary classifier G _c 310 includes binary class unit components for the (M + 1) classes,For controlling the processing for each category. If y=c,/>Otherwise the first set of parameters is selected,

Based on learning lossesDuring training, learning loss/>, can be achieved by iteratively updating the values of the parameter sets of the binary classifier G _c 310,310The desired goal is minimized or reached so that the training is completed. Training may be accomplished using a variety of machine learning or deep learning training techniques, embodiments of the disclosure are not limited in this respect.

During training of binary classifier G _c 310, the features of source video sample 142 in source domain 120, namelyIs input to binary classifier G _c 310 for forward propagation. In some embodiments, the similarity between the target video samples 152 and the set of source video samples 140 is measured at a frame level, a temporal level, or a video level depending on the OSD 202 at which the binary classifier G _c 310 is located, and the frame level features, temporal level features, or video level features, which may be video samples, are provided to the binary classifier G _c 310 for performing classification processing.

As can be seen from equation (1), although the binary classifier G _c 310 is to classify for the (m+1) th category, the binary classifier G _c 310 may not learn the relevant features of the (m+1) th category since the source video sample set 140 may not contain video samples of the (m+1) th category, i.e., the unknown category.

After binary classifier G _c 310 is trained, target video samples 152 in target video sample set 150 may be input to trained binary classifier G _c 310. More specifically, features may be extracted from the target video sample 152, and the binary classifier G _c 310 determines, based on the features of the target video sample 152, the respective probabilities 312 that the target video sample 152 belongs to M known categories. Note again that this feature may be a frame-level feature, a temporal-level feature, or a video-level feature, depending on where OSD 202 is deployed in architecture 200.

Suppose the j-th frame of the i-th target video sample 152Is input to a binary classifier G _c 310,310, which can output/>The corresponding probabilities belonging to M known categories are expressed asWhere p _i represents the probability (which may also be referred to as a "score") of belonging to the ith category of the M known categories, m=m. At other levels, such as a temporal level or a video level, the binary classifier G _c 310 in the corresponding OSD 202 may also derive the probability that the target video sample 152 belongs to an unknown class at the corresponding level.

In some embodiments, a first similarity between the target video sample 152 and the set of source video samples 140 may be obtained by determining the probability 312 of the target video sample 152 with the trained binary classifier G _c 310. In some examples, for each target video sample 152, the probability 312 output by the binary classifier G _c may include M probability values corresponding to M known categories, respectively, and the first similarity may be determined based on a largest probability value of the M probability values. For example, for each target video sample 152, the first similarity may be determined to be equal to the maximum probability value in the probabilities 312. The first similarity may also be determined as other values. In general, for each target video sample 152, the greater the maximum probability value in the probabilities 312, the greater the first similarity.

Further, some of the target video samples in the set of target video samples 150 are partitioned into a first set of known category candidates (represented as) And partitioning further target video samples 152 in the set of target video samples 150 into a first unknown class candidate set (denoted/>)。

Specifically, in the example of fig. 3, the probabilities 312 are provided to the ordering and grouping module 330. The ranking and grouping module 330 may utilize the probability 312 for each target video sample 152 to obtain a first similarity for that target video sample 152 to the source video sample set 140. Upon determining to be divided into a first known class candidate groupOr first unknown class candidate group/>The sorting and grouping module 330 may compare the first similarity of each target video sample 152 to a threshold value when the target video samples 152 are in. In some embodiments, if the first similarity of the target video sample 152 exceeds (e.g., is greater than or equal to) a first threshold, the ranking and grouping module 330 may determine to partition the target video sample 152 into a first known class candidate group/>If the first similarity of the target video sample 152 is below (e.g., less than or equal to) the second threshold, the ranking and grouping module 330 may determine to partition the target video sample 152 into a first unknown class candidate group/>

In some embodiments, the second threshold does not exceed the first threshold. For example, the second threshold may be the same as the first threshold, that is, the same threshold is set to divide the target video samples 152. In some examples, the second threshold may be less than the first threshold. In this way, the target video samples 152 with a relatively high first similarity can be partitioned into a first known class candidate setAnd the target video sample 152 with the lower first similarity is divided into a first unknown class candidate group/>

In some embodiments, in partitioning the target video samples 152, the sorting and grouping module 330 may sort the plurality of target video samples 152 in the target video sample set 150 based on the first similarity. The sorting and grouping module 330 may divide a top-ranked number of the target video samples 152 in the set of target video samples 150 into a first known class candidate groupAnd the last ordered number of target video samples 152 is partitioned into a first unknown class candidate set/>Candidate group/>, in the first known classAnd a first unknown class candidate group/>The number of target video samples 152 included in (a) may be predetermined and may be the same or different. In such an embodiment, the first threshold and the second threshold may be considered to be set based on the current ranking result. In some embodiments, the first and second thresholds may be set to other values.

In some embodiments in which the partitioning is performed based on dual thresholds, some of the target video samples 152 in the set of target video samples 150 (e.g., for those target video samples 152 that lie between the first threshold and the second threshold) may not be partitioned into the first known class candidate setNor is it partitioned into a first unknown class candidate set/>This is because the binary classifier G _c 310 has a high confidence in determining whether these target video samples 152 belong to or do not belong to M known categories.

By utilizing the binary classifier G _c 310, target video samples 152 belonging to M known categories and target video samples 152 belonging to unknown categories can be identified from the set of target video samples 150 with a higher confidence. Due to the first known class candidate setAnd a first unknown class candidate group/>Instead of directly computing the similarity, the partitioning of the binary classifier G _c 310 trained from the source video sample set 140 is utilized, so the probability 312 given by the binary classifier G _c 310 may be considered as an "implicit" similarity between the target video sample 152 and the source video sample set 140.

In some embodiments, it is divided into a first known class candidate groupThe target video samples 152 in (1) may be labeled with a "pseudo" known category label, partitioned into a first unknown category candidate set/>The target video sample 152 in (1) may be marked with a "false" unknown class label.

Determination of second similarity

As mentioned above, in OSD 202, DMD 305 may also include a typical optimal transmission module 320 configured to measure a second similarity between the plurality of target video samples 152 and source video sample set 140.

In domain adaptive learning, the presence of domain offsets makes it difficult to guarantee the division of the target video samples 152 into known and unknown categories in a supervised manner. Accordingly, it is desirable to be able to explore more about the similarity between the multiple target video samples 152 and the source video sample set 140 in order to divide the target video samples 152 into known and unknown categories as accurately as possible.

In some embodiments, the exemplary optimal transmission module 320 is configured to divide some of the target video samples 152 in the target video sample set 150 into a second known class candidate group and to divide some of the target video samples 152 in the target video sample set 150 into a second unknown class candidate group based on a method of optimal transmission distance (optimal transport distance).

Specifically, the exemplary optimal transmission module 320 may be configured to determine, based on the source video sample set 140, reference characteristics for each of the (m+1) categories (including M known categories and one unknown category) with respect to the source domain 120. Each reference feature may correspond to one of the (m+1) categories, which may be considered a representative feature representation in the source domain 120. The set of reference features corresponding to the (M+1) categories may be represented asWherein/>Representing the reference features of the kth class with respect to the source domain 120, k ranges from 1 to (m+1).

In some examples, the reference features corresponding to each category may be iteratively updated based on an impulse (momentum) averaging policy. For example, the number of the cells to be processed, Where M represents an impulse factor, which may also be a preset value, k represents the kth category of the (M+1) categories,/>Representing source video samples 142 labeled as belonging to the kth category,/>The number of all source videos spurious edition in a batch (batch) of source video samples representing the set of source video samples 140 that are labeled as belonging to the kth category. In the initial stage,/>Initialized to zero. With the input of the source video sample 142,Is updated continuously.

The optimal transmission problem refers to finding an optimal mapping between two probability distributions that enables a minimum transmission cost (also referred to as "optimal transmission distance"). In the task of dividing the target video samples, the optimal transmission problem refers to an optimal mapping when an optimal transmission distance is reached between the plurality of target video samples 152 to the plurality of reference features. The exemplary optimal transmission module 320 may be configured to determine each target video sample 152 in the target video sample set 150 with a plurality of reference featuresAn optimal transmission distance between reference features, such as a wasperstein distance. In some embodiments, target features of the target video sample 152 may be extracted and utilized to calculate an optimal transmission distance from the reference feature.

In some examples, each target video sample 152 in the set of target video samples 150 may be obtained with a plurality of reference features by determining an optimal transmission distanceOptimal mapping between the two. The optimal mapping indicates that each target video sample 152 is matched (or mapped) to a reference feature/>Is included. Target video sample 152 and matching reference featuresWith an optimal transmission distance therebetween. In some embodiments, the optimal transmission distance may be determined by Sinkhorn algorithm to achieve each target video sample 152 in the target video sample set 150 with reference features/>Optimal mapping between the two.

In some embodiments, the exemplary optimal transmission module 320 may be configured to determine a second similarity between each target video sample 152 and the source video sample set 140 based on the determined optimal transmission distance for that target video sample 152. For example, the second similarity may be determined as or proportional to the optimal transmission distance determined for each target video sample 152. In the latter case, the larger the optimal transmission distance, the larger the second similarity may be determined.

Based on the second similarity of each target video sample 152, the exemplary optimal transmission module 320 may divide some of the target video samples 152 in the set of target video samples 150 into a second set of known category candidates (represented as) And some of the target video samples 152 in the set of target video samples 150 are partitioned into a second unknown class candidate set (represented as). In particular, the exemplary optimal transmission module 320 may be configured to determine one reference feature from the plurality of reference features that matches each target video sample 152 based on a second similarity between each target video sample 152 and the source video sample set 140. That is, each target video sample 152 may be assigned a matching reference feature. The second similarity for each target video sample is obtained by an optimal transmission distance between the target video sample and the matching reference feature. The exemplary optimal transmission module 320 may partition the target video samples 152 that match the reference features corresponding to the M known categories into a second known category candidate set/>And partitioning target video samples matching reference features corresponding to unknown categories into a second unknown category candidate set/>

For example, assume a target video sample 152 corresponds to a reference feature of category ζMatching. If it is(I.e., belonging to M known categories in common between the source domain 120 and the target domain 130), then the target video sample 152 is partitioned into a second set of known category candidates/>, corresponding to the M known categoriesIs a kind of medium. If it is(I.e., not belonging to M known categories), then the target video sample 152 is partitioned into a second unknown category candidate set/>, corresponding to the unknown category

Is divided into a second known class candidate groupIs considered more likely to belong to M known categories, and is partitioned into a second unknown category candidate set/>Is considered more likely to belong to an unknown class. Based on the optimal transmission distance, the exemplary optimal transmission module 320 may classify each target video sample 152 in the target video sample set 150 as/>Or/>

In some embodiments, the second set of known class candidates is partitioned intoThe target video samples 152 in (1) may be labeled with a "pseudo" known category label, partitioned into a second unknown category candidate set/>The target video sample 152 in (1) may be marked with a "false" unknown class label.

Classification discrimination based on double similarity

As mentioned above, the re-weighting module 340 may utilize the results of the partitioning of the set of target video samples 150 based on the first similarity and/or the second similarity to determine the discriminative learning weights 342 for each of the plurality of target video samples 152. The discrimination learning weights are used to influence the determination of the probability that the target video sample 152 belongs or does not belong to an unknown class by the binary discriminator G _b.

In general, the partitioning of the first similarity may give better classification accuracy, while the partitioning based on the second similarity may give high robustness to model generalization. Thus, by combining the first similarity and the second similarity, classification of the target video sample set 150 may be further enhanced.

In some embodiments, the re-weighting module 340 may be based on the target video sample 152 in the first known class candidate setFirst unknown class candidate group/>Second known class candidate group/>And a second unknown class candidate setTo determine a discrimination learning weight for the target video sample 152. The discrimination learning weight for each target video sample 152 indicates the degree of interest of that target video sample 152 in the training of the binary discriminator G _b.

Specifically, for each target video sample 152 in the set of target video samples 150, the re-weighting module 340 may determine whether the target video sample 152 is partitioned into a set of known class candidates based on both the first similarity and based on the second similarity (e.g., partitioned into the first set of known class candidates simultaneouslyAnd a second known class candidate group/>Both) or both are partitioned into unknown class candidate groups (e.g., simultaneously partitioned into a first unknown class candidate group/>And a second unknown class candidate group/>Both). If so, the re-weighting module 340 may assign a greater first recognition learning weight to the target video sample 152. If the target video sample is partitioned into different ones of the known class candidate group and the unknown class candidate group based on the first similarity and based on the second similarity, the re-weighting module 340 may assign a second, smaller recognition learning weight to the target video sample. That is, the second discrimination learning weight may be smaller than the first discrimination learning weight.

That is, when both the first and second similarities determine to partition a certain target video sample 152 into a known class candidate group or an unknown class candidate group, the confidence of such a partition is higher, and thus the target video sample 152 may have a higher discrimination learning weight. Otherwise, the target video sample 152 has a relatively low recognition learning weight. In some examples, the first discrimination learning weight may be determined to be 1 and the second discrimination learning weight may be determined to be a value between 0 and 1 and less than 1, such as 0.5, 0.4, 0.6, etc. Of course, the first discrimination learning weight and the second discrimination learning weight may be set to other values as well.

The determination of the discriminative learning weights by the re-weighting module 340 may be expressed as follows:

If (2)

In other cases

Where a ε { k, u }, w _a (x) represents the discriminating learning weight of the target video sample x. According to equation (2), a first discrimination learning weight is determined to be 1 and a second discrimination learning weight is determined to be δ, which may be, for example, 0.5 or other value.

In some embodiments, the binary discriminator G _b 350,350 may be trained based on the respective discrimination learning weights of the target video samples 152, as well as the target video samples 152 themselves. The binary discriminator G _b is used to discriminate whether a video sample belongs to a known class (i.e., M known classes) or to an unknown class. Through training, the binary discriminator 350 is enabled to determine the probability that a video sample belongs to an unknown class. The binary discriminator 350 may output the probability that the video sample belongs to a known category and the probability of an unknown category simultaneously. Or the binary discriminator G _b may output the probability that the video sample belongs to a known class or an unknown class, and the probability of another class may be the probability of subtracting the output from 1.

In some embodiments, training may be performed by constructing a learning penalty when training the binary discriminator G _b, 350. For example, the learning penalty of the binary discriminator G _b may be constructed as follows:

wherein the method comprises the steps of A learning loss representing the binary discriminator G _b; a is { k, u }, and/>Representing a distribution of target video samples partitioned into a known class candidate set (or an unknown class candidate set); l _bce represents cross entropy loss; /(I)Representing a feature extracted from the target video sample x (represented as feature 302 in fig. 3), which is input to the binary discriminator G _b 350 for determining a discrimination result; y represents the (candidate) class of the determined target video sample x.

As can be appreciated from equation (3), for a target video sample with a higher classification confidence (whether classified into a known class or an unknown class), its discrimination learning weight is set (e.g., 1) to be relatively larger, so that the target video sample is more focused in the training of the binary discriminator G _b, and has a greater influence on the learning of the binary discriminator G _b. Conversely, target video samples with less classification confidence may also have less impact on the learning of the binary discriminator G _b, and the binary discriminator G _b may be less focused on such target video samples during training.

In some embodiments, learning loss basedDuring training, learning loss/>, can be achieved by iteratively updating the values of the parameter set of the binary discriminator G _b 350The desired goal is minimized or reached so that the training is completed. Training may be accomplished using a variety of machine learning or deep learning training techniques, embodiments of the disclosure are not limited in this respect.

After training is complete, the trained binary discriminator G _b can determine the respective probabilities that each target video sample 152 in the set of target video samples 150 belongs to an unknown class and/or the respective probabilities that each target video sample belongs to a known class.

Similar to the training process and subsequent application of the binary classifier G _c 310, the characteristics of the target video samples x input to the binary discriminator G _b for classification discrimination may measure the similarity between the plurality of target video samples 152 and the set of source video samples 140 depending on whether the OSD 202 is at the frame level, the temporal level, or the video level.

It should be appreciated that in the above embodiments, determining the respective probabilities that the target video samples 152 belong to unknown categories based on the first and second likelihoods is discussed. In some embodiments, only one of the first and second similarities may be considered. In other embodiments, the similarity between the plurality of target video samples 152 and the source video sample set 140 may additionally or alternatively be measured in other ways.

Training of multiple classification models

After OSD 202 is able to provide a respective probability of belonging to an unknown class for target video sample 152 in target video sample set 150, computing system 112 may train multi-classification model 110 based on respective target video samples 153 and respective probabilities of belonging to unknown classes determined for each to enable adaptation of multi-classification model 110 from source domain 120 to target domain 130. In some embodiments, still referring to the example of FIG. 2, spatial OSD 202-1, temporal OSD 202-2, and video OSD 202-3 are deployed at the frame level, temporal level, and video level, respectively, each with processing functions similar to OSD 202 discussed above with reference to FIG. 3, except that their respective input features are different.

In some embodiments, the training of the multi-classification model 110 may be accomplished using an attention mechanism. The probabilities that target video samples 152 in target video sample set 150 belong to unknown categories may be used to determine learning weights for corresponding target video samples 152 when training multi-classification model 110. The classification learning weight of each target video sample 152 indicates the degree of interest of that target video sample 152 in the training of the multi-classification model 110.

In some embodiments, in training the multi-classification model 110, for the video classifier G _cls, the classification learning weight of each target video sample 152 may be assigned based on the probability that target video sample 152 belongs to an unknown class, such that target video samples 152 with a smaller probability may be assigned as having a smaller first classification learning weight. That is, the first classification learning weight of each of the plurality of target video samples 152 is positively correlated with the probability that each belongs to an unknown class. In some examples, the first classification learning weight of the target video sample 152 (denoted as x) may be set equal to the probability that it belongs to an unknown class (denoted as σ _u (x)). In other examples, the first class learning weight may be set to other values as long as its magnitude is positively correlated with σ _u (x).

The computing system 112 may train the multi-classification model 110 with the plurality of target video samples 152 and their respective first classification learning weights such that the multi-classification model 110 learns characteristics of unknown categories from the plurality of target video samples 152 to fit into the target domain 130. In some embodiments, based on the respective first classification learning weights of the target video samples 152, the computing system 112 may construct a first learning penalty when the multi-classification model 110 classifies the target video samples 152 into unknown categories, and train the multi-classification model 110 based on the first learning penalty. That is, training of the multi-classification model 110 is performed with the target video sample set 150 in the target domain 130 and the "pseudo" unknown class labels labeled for the target video sample set 150. In some examples, the first learning penalty may be constructed as follows:

/>

wherein the method comprises the steps of Representing a first learning penalty for the multi-classification model 110 constructed on the basis of the target video sample set 150; /(I)A set of target video samples 150 is represented, indicating that each target video sample 152 therein is labeled with a "pseudo" unknown class; σ _u represents the distribution of the first classification learning weights σ _u (x) of the plurality of target video samples 152, the first classification learning weight of each target video sample 152 being set equal to the probability that target video sample 152 belongs to an unknown class; l _bce represents cross entropy loss; /(I)Representing features extracted from the target video sample x (which may include, for example, frame-level features, temporal-level features, or video-level features),/>Representing a "false" unknown class label.

First learning lossConsider that multi-classification model 110 learns how to classify video samples into unknown categories from target video sample set 150, while "pseudo" unknown category labels in target video sample set 150 about target video samples 152 may serve as supervisory information in training. However, since the classification of the target video samples 152 is not entirely confident, the classification learning weights σ _u are used to influence the degree of interest of the multi-classification model 110 in the respective target videos 152 during the training process. According to equation (4), the multi-classification model 110 may focus more on target video samples that have a greater probability of being classified into an unknown class (i.e., σ _u (x) is greater).

In some embodiments, in training the multi-classification model 110, an attention mechanism may also be utilized to introduce entropy loss (entropy loss) on the target video sample set 150 for training of the multi-classification model 110 in order to further increase the confidence of the classification of the final video classifier 250.

In particular, in constructing the entropy loss, the probability σ _u (x) that each target video sample 152 in the set of target video samples 150 belongs to an unknown class may be used to determine a second learning weight for that target video sample 152. The target video sample 152 with the smaller probability σ _u (x) is determined to have the larger second learning weight. That is, the second class learning weights of each of the plurality of target video samples 152 are inversely related to the probabilities that each belongs to the unknown class. The second class learning weight of each target video sample 152 indicates the degree of interest of that target video sample 152 in the training of the multi-class model 110.

In some examples, the second learning weight may be determined as a probability (denoted as (1- σ _u (x)) that the target video sample 152 belongs to a known class, the probability being inversely related to a probability σ _u (x) that the target video sample 152 belongs to an unknown class. That is, if a target video sample 152 is determined to have a smaller probability of belonging to an unknown class, then the probability that the target video sample 152 belongs to a known class is correspondingly greater. In other examples, the second learning weight of the target video sample 152 may be set to other values as long as its magnitude is inversely related to σ _u (x).

The computing system 112 may train the multi-classification model 110 with the plurality of target video samples 152 and their respective second classification learning weights such that the multi-classification model 110 learns characteristics of a plurality of known classes from the plurality of target video samples 152 to fit into the target domain 130. In some embodiments, based on the second learning weights of each of the plurality of target video samples 152, the computing system 112 may construct a first entropy of the classification result of the respective target video samples 152 at the multi-classification model 110. Here, the classification result indicates the corresponding probability that the target video sample 152 determined by the video classifier 150 belongs to the (m+1) categories. In some embodiments, the computing system 112 may also construct a second entropy of the domain discrimination result for the target video sample 152. The domain discrimination result indicates a corresponding probability that the target video sample 152 determined by the DD 204 belongs to the source domain 120 or the target domain 130. The computing system 112 may construct a second learning penalty based on the respective second learning weights, the first entropy, and the second entropy of the target video samples 152 and train the multi-classification model 110 based on the second learning penalty.

The second learning loss based on entropy construction may be constructed as an entropy loss, which may be determined as follows:

wherein the method comprises the steps of Representing a second learning loss (also called entropy loss); /(I)Representing a set of target video samples 150, wherein the target video sample x under consideration is selected from the set; /(I)Representing the video classifier 250 pair/>Classification results of the plurality of target video samples; /(I)Representing a first entropy determined based on the classification result; /(I)Representing the video level DD 204-3 pairDomain discrimination results for a plurality of target video samples in (a); /(I)Representing a second entropy based on the domain discrimination result.

By equation (5), during the training of the multi-classification model 110, the multi-classification model 110 may focus more on target video samples that have a greater probability of being classified into a known class (i.e., (1- σ _u (x) is greater).

In some embodiments, in addition to the target video sample set 150, the source video sample set 140 and the labels (e.g., indicated by the classification labels 144) of the individual source video samples 142 in the source video sample set 140 over M known categories are also used for training the multi-classification model 110. For example, the computing system 112 may construct another learning penalty based on the markers on each source video sample 142 and their respective M known categories as follows:

wherein the method comprises the steps of Representing learning loss for the multi-classification model 110 constructed on the basis of the source video sample set 140; /(I)Representing a source video sample set 140; l _bce represents cross entropy loss; /(I)Representing features extracted from the source video sample x (which may include, for example, frame-level features, temporal-level features, or video-level features), y represents a label of classification of the source video sample x, which may be one of M known classes.

In some embodiments, as shown in FIG. 2, a DD 204 for discriminating between the source domain 120 and the target domain 130 of the video sample is also included in the architecture 200 for training the multi-classification model 110. An example structure of the DD 204 is shown in FIG. 4. As shown, the DD 204 may include a gradient inversion layer (GRL) 410 and a domain classifier 420. The input to the DD 204 is a characteristic of the video sample. The features that are input may be frame level features, temporal level features, and video level features, depending on the deployment of the DD 204. The GRL 410 is configured to perform gradient inversion during training of the multi-classification model 110, and the domain classifier 420 is configured to determine the probability that a video sample belongs to the source domain 120 or the target domain 130.

Fig. 4 illustrates only one example of a DD 204, which may also be configured in other structures to enable discrimination of whether a determined video sample belongs to the source domain 120 or the target domain 130. In some embodiments, during the training of the multi-classification model 110, if the source video sample set 140 and the target video sample set 150 are used together to train the multi-classification module 110, the dd 204 is used to perform domain discrimination on both the source video sample 142 and the target video sample 152.

In some embodiments, the training of the DD 204 is performed in conjunction with the training of the multi-classification model 110. To train the DD 204, the computing system 112 may assign respective domain discrimination weights to the plurality of target video samples 152 based on the probabilities that the plurality of target video samples 152 each belong to an unknown class such that target video samples 152 with a smaller probability are assigned a greater domain discrimination weight. That is, the assignment of respective domain discrimination weights by the plurality of target video samples is inversely related to the probability that they belong to the unknown class. In some examples, the domain discrimination weight may be determined as a probability (denoted as (1- σ _u (x)) that the target video sample 152 belongs to a known class, which probability is inversely related to a probability σ _u (x) that the target video sample 152 belongs to an unknown class. That is, if a target video sample 152 is determined to have a smaller probability of belonging to an unknown class, then the probability that the target video sample 152 belongs to a known class is correspondingly greater. The probability that the target video sample 152 belongs to a known category may be determined by subtracting the probability of belonging to an unknown category from 1 or may be directly output by the binary discriminator 250 of the OSD 202. In other examples, the domain discrimination weight of the target video sample 152 may be set to other values as long as its magnitude is inversely related to the probability σ _u (x). The domain discrimination weight for each target video sample 152 indicates the degree of interest of that target video sample 152 in the training of the DD 204.

The computing system 112 may train the DD 204 based on the plurality of target video samples 152 and their respective domain discrimination weights, thereby enabling training of the multi-classification model 110 simultaneously. Specifically, the computing system 112 may construct a learning penalty (referred to as a "third learning penalty") for the DD 204 based on the domain discrimination weights for each of the plurality of target video samples 152 and train the DD 204 with the third learning penalty and simultaneously train the multi-classification model 110. The computing system 112 may also utilize an attention mechanism to perform training of the DD 204.

In some embodiments, if there is a spatial DD 204-1 at the frame level, a temporal DD 204-2 at the temporal level, and a video DD 204-3 at the video level, a corresponding third learning penalty may be constructed for each DD 204. In some examples, the third learning penalty may be determined as follows:

wherein l e { f, t, v } represents frame level, temporal level, and video level, respectively; a third learning penalty of DD 204 that represents the corresponding level/; /(I) Representing a union of the source video sample set 140 and the target video sample set 150 from which the video sample x to be domain-distinguished comes; l _bce represents cross entropy loss; g _d,l denotes a spatial DD204-1 at the frame level, a temporal DD 204-2 at the temporal level, or a video DD 204-3 at the video level; /(I)Representing the domain to which the video sample x belongs (i.e., either the source domain 120 or the target domain 130).

In equation (7), 1- σ _u (x) represents the probability that the source or target video sample x is determined to belong to a known class. Note that since the labels of the source video samples are known and generally all belong to a known class, the probability σ _u (x) that the source video sample belongs to an unknown class, which may be defaulted to 0, may not need to be determined by OSD 202. According to equation (7), the DD 204 for performing domain discrimination may focus more on target video samples that have a greater probability of being classified into a known category (i.e., 1- σ _u (x) is greater). In the domain adaptive learning process, the multi-classification model 110 is generally tuned to obtain higher performance on video samples belonging to known classes in the target domain, while avoiding as much as possible the negative impact of samples of unknown class on domain adaptive learning, so 1- σ _u (x) can make the multi-classification model 110 pay more attention to video samples belonging to known classes during the training process.

Learning losses that can be built during the training of the multi-classification model 110, such as those given by equations (4) through (7), are discussed above. In some embodiments, these learning losses may be jointly considered for training the multi-classification model 110. For example, the total learning penalty of training of the multi-class model 110Can be expressed as:

wherein the method comprises the steps of And/>Is a preset value respectively indicating learning loss/>And/>The corresponding weights in the total learning penalty. /(I)

In some embodiments, learning loss basedDuring training, learning loss/>, can be achieved by iteratively updating the values of the parameter sets of the multi-classification model 110The desired goal is minimized or reached so that the training is completed. Training may be accomplished using a variety of machine learning or deep learning training techniques, embodiments of the disclosure are not limited in this respect.

In some embodiments, one or more learning losses in equation (8) may be disregarded. For example, if the DD204 is not included in the architecture 200 or a certain level of DD204 is not included, the corresponding learning penalty may be omitted from equation (8).

In some embodiments, the training of some components in OSD 202 discussed above may be done outside of the training of multi-classification model 110 to be able to provide the information about the probabilities required for the training of multi-classification model 110. In some embodiments, training of components in DMD 305 in OSD 202, which learns the lossCan be summarized as follows:

Wherein learning losses at different levels (frame level f, temporal level t and video level v) for binary classifier 310 Can be combined together for training. In some embodiments, learning loss in DMD 305 may also take into account learning loss/>, for multi-classification model 110, constructed on the basis of source video sample set 140

In some embodiments, training of the binary discriminator 350 in the OSD 202 is based on a loss functionThus, the learning loss/>, of OSD 202 as a wholeExpressed as:

Example flow

Fig. 5 illustrates a flow chart of a process 500 for domain adaptive learning according to some embodiments of the present disclosure. In some embodiments, the process 500 may be implemented at the computing system 112 discussed above. For ease of discussion, the process 500 is described from the perspective of the computing system 112.

At block 510, the computing system 112 obtains a source video sample set for a source domain and a target video sample set for a target domain. The source video samples in the source video sample set are labeled as belonging to one of a plurality of known categories.

At block 520, the computing system 112 determines probabilities that the plurality of target video samples each belong to an unknown class based on a plurality of similarities between the plurality of target video samples in the set of target video samples and the set of source video samples.

At block 530, the computing system 112 adapts a multi-classification model previously pre-trained with the set of source video samples to the target domain based at least on the plurality of target video samples and their respective probabilities.

In some embodiments, the plurality of similarities is determined by at least one of: determining probabilities that the plurality of target video samples each belong to a plurality of known categories using a binary classifier to obtain a first similarity between each target video sample and the source video sample set, the binary classifier being pre-trained based on the source video sample set; and determining an optimal transmission distance between each target video sample to one of a plurality of reference features corresponding to a plurality of known categories and unknown categories and determined based on the source video sample set to obtain a second similarity between each target video sample and the source video sample set.

In some embodiments, determining the probability that each of the plurality of target video samples belongs to an unknown class comprises: dividing the plurality of target video samples into a known category candidate group corresponding to the plurality of known categories and an unknown category candidate group corresponding to the unknown category based on at least one of the first similarity and the second similarity; and determining a probability that each of the plurality of target video samples belongs to an unknown class based on the result of the partitioning.

In some embodiments, partitioning includes: partitioning the target video samples having a first similarity exceeding a first threshold into a first set of known category candidates; and classifying the target video samples having a first similarity below a second threshold into a first unknown class candidate set, wherein the second threshold does not exceed the first threshold.

In some embodiments, partitioning includes: determining a reference feature matching each target video sample from a plurality of reference features, wherein a second similarity of each target video sample is obtained by an optimal transmission distance between the target video sample and the matching reference feature; partitioning target video samples matching reference features corresponding to a plurality of known categories into a second known category candidate set; and dividing the target video samples matched with the reference features corresponding to the unknown categories into a second unknown category candidate set.

In some embodiments, determining the probability that each of the plurality of target video samples belongs to an unknown class based on the outcome of the partitioning comprises: assigning respective recognition learning weights to the plurality of target video samples based on the result of the partitioning, wherein the recognition learning weight of each target video sample indicates a degree of interest of that target video sample in training of the binary discriminator; training a binary discriminator based on the plurality of target video samples and their respective discriminating learning weights; and determining a probability that each of the plurality of target video samples belongs to an unknown class using the trained binary discriminator.

In some embodiments, assigning respective discriminative learning weights to the plurality of target video samples comprises: for each target video sample, if the target video sample is classified into a known class candidate group or an unknown class candidate group based on both the first similarity and the second similarity, assigning a first recognition learning weight to the target video sample; and if the target video sample is partitioned into different ones of the known class candidate group and the unknown class candidate group based on the first similarity and based on the second similarity, assigning a second discriminative learning weight to the target video sample. The second discrimination learning weight is less than the first discrimination learning weight.

In some embodiments, adapting the multi-classification model to the target domain includes: assigning respective first classification learning weights to the plurality of target video samples based on respective probabilities of the plurality of target video samples, the respective first classification learning weights of the plurality of target video samples being positively correlated with the respective probabilities, wherein the first classification learning weight of each target video sample indicates a degree of interest of the target video sample in training of the multi-classification model; and training the multi-classification model based on the plurality of target video samples and their respective first classification learning weights such that the multi-classification model learns characteristics of the unknown class from the plurality of target video samples to adapt to the target domain.

In some embodiments, adapting the multi-classification model to the target domain includes: assigning respective second classification weights to the plurality of target video samples based on respective probabilities of the plurality of target video samples, the respective second classification learning weights of the plurality of target video samples being inversely related to the respective probabilities, wherein the second classification learning weights of each target video sample are indicative of a degree of interest of the target video sample in training of the multi-classification model; and training the multi-classification model based on the plurality of target video samples and their respective second classification learning weights such that the multi-classification model learns characteristics of the plurality of known classes from the plurality of target video samples to adapt to the target domain.

In some embodiments, adapting the multi-classification model to the target domain includes: assigning respective domain discrimination weights to a plurality of target video samples based on respective probabilities of the plurality of target video samples, the plurality of target video samples being assigned respective said domain discrimination weights in positive correlation with respective said probabilities, wherein the domain discrimination weight of each target video sample is indicative of a degree of interest of that target video sample in training of a domain discriminator configured to discriminate whether the target video sample belongs to a source domain or a target domain; and training the multi-classification model to adapt to the target domain while training the domain discriminator based on the plurality of target video samples and their respective domain discrimination weights.

In some embodiments, the adaptation of the multi-classification model to the target domain is also based on the source video sample set.

In some embodiments, determining the probability that each of the plurality of target video samples belongs to an unknown class comprises: extracting target features of each of the plurality of target video samples at least one of a frame level, a temporal level across the plurality of frames, and a video level of the plurality of target video samples; and determining a probability that each of the plurality of target video samples belongs to the unknown class at least one level based on the target features.

In some embodiments, adapting the multi-classification model to the target domain includes: the multi-classification model is adapted to the target domain based on the plurality of target video samples and their probabilities at least one level.

Example apparatus

Fig. 6 illustrates a block diagram of an apparatus 600 for domain adaptive learning, according to some embodiments of the disclosure. The apparatus 600 may be implemented as or included in the computing system 112.

As shown, the apparatus 600 includes an acquisition module 610 configured to acquire a source video sample set of a source domain and a target video sample set of a target domain. The source video samples in the source video sample set are labeled as belonging to one of a plurality of known categories. The apparatus 600 further comprises a probability determination module 620 configured to determine a probability that each of the plurality of target video samples belongs to an unknown class based on a plurality of similarities between the plurality of target video samples in the set of target video samples and the set of source video samples. The apparatus 600 further comprises a model adaptation module 630 configured to adapt a multi-classification model previously pre-trained with the set of source video samples to the target domain based at least on the plurality of target video samples and their respective probabilities.

In some embodiments, the apparatus 600 further comprises a similarity determination module configured to determine a plurality of similarities by at least one of: determining probabilities that the plurality of target video samples each belong to a plurality of known categories using a binary classifier to obtain a first similarity between each target video sample and the source video sample set, the binary classifier being pre-trained based on the source video sample set; and determining an optimal transmission distance between each target video sample to one of a plurality of reference features corresponding to a plurality of known categories and unknown categories and determined based on the source video sample set to obtain a second similarity between each target video sample and the source video sample set.

In some embodiments, the probability determination module 620 includes: a candidate partitioning module configured to partition the plurality of target video samples into a known category candidate group corresponding to the plurality of known categories and an unknown category candidate group corresponding to the unknown category based on at least one of the first similarity and the second similarity; and a division-based probability determination module configured to determine probabilities that the plurality of target video samples each belong to an unknown class based on a result of the division.

In some embodiments, the candidate partitioning module comprises: a first known group partitioning module configured to partition target video samples having a first similarity exceeding a first threshold into a first known category candidate group; and a first unknown group partitioning module configured to partition target video samples having a first similarity below a second threshold into a first unknown class candidate group, wherein the second threshold does not exceed the first threshold.

In some embodiments, the candidate partitioning module comprises: a matching feature determination module configured to determine a reference feature matching each target video sample from a plurality of reference features, wherein a second similarity of each target video sample is obtained by an optimal transmission distance between the target video sample and the matching reference feature; a second known group partitioning module configured to partition target video samples that match reference features corresponding to a plurality of known categories into a second known category candidate group; and a second unknown group partitioning module configured to partition target video samples that match reference features corresponding to the unknown categories into second unknown category candidate groups.

In some embodiments, the partition-based probability determination module includes: a discrimination learning weight determination module configured to assign respective discrimination learning weights to a plurality of target video samples based on a result of the partitioning, wherein the discrimination learning weight of each target video sample indicates a degree of interest of that target video sample in training of the binary discriminator; a discriminator training module configured to train a binary discriminator based on the plurality of target video samples and their respective discriminating learning weights; and a discriminator-based probability determination module configured to determine probabilities that the plurality of target video samples each belong to an unknown class using the trained binary discriminator.

In some embodiments, the discriminating learning weight determining module includes: for each target video sample, a first assignment module configured to assign a first recognition learning weight to the target video sample if the target video sample is assigned to a known class candidate group or is assigned to an unknown class candidate group based on both the first similarity and the second similarity; and a second assigning module configured to assign a second recognition learning weight to the target video sample if the target video sample is partitioned into different ones of the known category candidate group and the unknown category candidate group based on the first similarity and based on the second similarity. The second discrimination learning weight is less than the first discrimination learning weight.

In some embodiments, the model adaptation module 630 includes: a first classification learning weight assignment module configured to assign respective first classification learning weights to a plurality of target video samples based on respective probabilities of the plurality of target video samples, the respective first classification learning weights of the plurality of target video samples being positively correlated with the respective probabilities, wherein the first classification learning weights of each target video sample are indicative of a degree of interest of the target video sample in training of a multi-classification model; and a first training module configured to train the multi-classification model based on the plurality of target video samples and their respective first classification learning weights such that the multi-classification model learns characteristics of unknown categories from the plurality of target video samples to fit into the target domain.

In some embodiments, the model adaptation module 630 includes: a second classification weight assignment module configured to assign respective second classification weights to a plurality of target video samples based on respective probabilities of the plurality of target video samples, the respective second classification learning weights of the plurality of target video samples being inversely related to the respective probabilities, wherein the second classification learning weights of each target video sample are indicative of a degree of interest of the target video sample in training of a multi-classification model; and a second training module configured to train the multi-classification model based on the plurality of target video samples and their respective second classification learning weights such that the multi-classification model learns characteristics of the plurality of known classes from the plurality of target video samples to adapt to the target domain.

In some embodiments, the model adaptation module 630 includes: a domain discrimination weight assignment module configured to assign respective domain discrimination weights to a plurality of target video samples based on respective probabilities of the plurality of target video samples, the respective domain discrimination weights being positively correlated with the respective probabilities, wherein the domain discrimination weights of each target video sample are indicative of a degree of interest of the target video sample in training of a domain discriminator configured to discriminate whether the target video sample belongs to a source domain or a target domain; and a third training module configured to train the multi-classification model to adapt to the target domain while training the domain discriminator based on the plurality of target video samples and their respective domain discrimination weights.

In some embodiments, the probability determination module 620 includes: a per-level feature extraction module configured to extract target features of each of the plurality of target video samples at least one of a frame level, a temporal level across the plurality of frames, and a video level of the plurality of target video samples; a per-probability determination module is configured to determine a probability that each of the plurality of target video samples belongs to an unknown class on at least one level based on the target features.

In some embodiments, the model adaptation module 630 includes: an adaptation module based on progressive probabilities is configured to adapt the multi-classification model to the target domain based on the plurality of target video samples and their probabilities at least one level.

Example apparatus

Fig. 7 illustrates a block diagram that shows a computing system/server 700 in which one or more embodiments of the disclosure may be implemented. It should be understood that the computing device/server 700 illustrated in fig. 7 is merely exemplary and should not be taken as limiting the functionality and scope of the embodiments described herein. The computing system 112 of fig. 1 or the apparatus 600 of fig. 6 may be implemented as or included in a computing device/server 700.

As shown in fig. 7, computing device/server 700 is in the form of a general purpose computing device. Components of computing device/server 700 may include, but are not limited to, one or more processors or processing units 710, memory 720, storage 730, one or more communication units 740, one or more input devices 750, and one or more output devices 760. The processing unit 710 may be an actual or virtual processor and is capable of performing various processes according to programs stored in the memory 720. In a multiprocessor system, multiple processing units execute computer-executable instructions in parallel to increase the parallel processing capabilities of computing device/server 700.

Computing device/server 700 typically includes a number of computer storage media. Such media may be any available media that is accessible by computing device/server 700 and includes, but is not limited to, volatile and non-volatile media, removable and non-removable media. The memory 720 may be volatile memory (e.g., registers, cache, random Access Memory (RAM)), non-volatile memory (e.g., read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory), or some combination thereof. Storage device 730 may be a removable or non-removable media and may include machine-readable media such as flash drives, magnetic disks, or any other media that may be capable of storing information and/or data (e.g., training data for training) and may be accessed within computing device/server 700.

Computing device/server 700 may further include additional removable/non-removable, volatile/nonvolatile storage media. Although not shown in fig. 7, a magnetic disk drive for reading from or writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk may be provided. In these cases, each drive may be connected to a bus (not shown) by one or more data medium interfaces. Memory 720 may include a computer program product 725 having one or more program modules configured to perform the various methods or acts of the various embodiments of the disclosure.

Communication unit 740 enables communication with other computing devices via a communication medium. Additionally, the functionality of the components of computing device/server 700 may be implemented in a single computing cluster or in multiple computing machines capable of communicating over a communication connection. Accordingly, computing device/server 700 may operate in a networked environment using logical connections to one or more other servers, a network Personal Computer (PC), or another network node.

The input device 750 may be one or more input devices such as a mouse, keyboard, trackball, etc. The output device 760 may be one or more output devices such as a display, speakers, printer, etc. Computing device/server 700 may also communicate with one or more external devices (not shown), such as storage devices, display devices, etc., as needed through communication unit 740, with one or more devices that enable users to interact with computing device/server 700, or with any device (e.g., network card, modem, etc.) that enables computing device/server 700 to communicate with one or more other computing devices. Such communication may be performed via an input/output (I/O) interface (not shown).

According to an exemplary implementation of the present disclosure, a computer-readable storage medium is provided, on which computer-executable instructions or programs are stored, wherein the computer-executable instructions or programs are executed by a processor to implement the methods or functions described above. The computer-readable storage medium may include a non-transitory computer-readable medium. According to an exemplary implementation of the present disclosure, there is also provided a computer program product comprising a computer program/instruction which, when executed by a processor, implements the method or function described above. The computer program product may be tangibly embodied on a non-transitory computer-readable medium.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus, devices, and computer program products implemented according to the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various implementations of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of implementations of the present disclosure has been provided for illustrative purposes, is not exhaustive, and is not limited to the implementations disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various implementations described. The terminology used herein was chosen in order to best explain the principles of each implementation, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand each implementation disclosed herein.

Claims

1. A method for domain adaptive learning, comprising:

obtaining a source video sample set of a source domain and a target video sample set of a target domain, the source video samples in the source video sample set being labeled as belonging to one of a plurality of known categories;

Determining a probability that each of the plurality of target video samples belongs to an unknown class based on a plurality of similarities between the plurality of target video samples in the set of target video samples and the set of source video samples; and

Adapting a multi-classification model previously pre-trained with the set of source video samples to the target domain based at least on the plurality of target video samples and their respective probabilities;

wherein the plurality of similarities is determined by at least one of:

Determining probabilities that the plurality of target video samples each belong to the plurality of known categories using a binary classifier to obtain a first similarity between each target video sample and the set of source video samples, the binary classifier being pre-trained based on the set of source video samples; and

An optimal transmission distance between each target video sample to one of a plurality of reference features corresponding to the plurality of known categories and the unknown category is determined to obtain a second similarity between each target video sample and the set of source video samples, and is determined based on the set of source video samples.

2. The method of claim 1, wherein determining a probability that each of the plurality of target video samples belongs to the unknown class comprises:

Dividing the plurality of target video samples into a known category candidate group corresponding to the plurality of known categories and an unknown category candidate group corresponding to the unknown category based on at least one of the first similarity and the second similarity; and

Based on the results of the partitioning, a probability that each of the plurality of target video samples belongs to the unknown class is determined.

3. The method of claim 2, wherein the partitioning comprises:

partitioning the target video samples for which the first similarity exceeds a first threshold into a first set of known category candidates; and

And dividing the target video samples with the first similarity lower than a second threshold value into a first unknown class candidate group, wherein the second threshold value does not exceed the first threshold value.

4. The method of claim 2, wherein the partitioning comprises:

Determining a reference feature matching each target video sample from the plurality of reference features, wherein the second similarity for each target video sample is obtained by an optimal transmission distance between the target video sample and the matching reference feature;

Partitioning target video samples matching reference features corresponding to the plurality of known categories into a second known category candidate set; and

And dividing the target video samples matched with the reference features corresponding to the unknown categories into a second unknown category candidate group.

5. The method of claim 2, wherein determining a probability that each of the plurality of target video samples belongs to the unknown class based on a result of the partitioning comprises:

Assigning respective recognition learning weights to the plurality of target video samples based on the result of the dividing, wherein the recognition learning weight of each target video sample indicates a degree of interest of that target video sample in training of a binary discriminator;

training the binary discriminator based on the plurality of target video samples and their respective discrimination learning weights; and

Determining a probability that each of the plurality of target video samples belongs to the unknown class using the trained binary discriminator.

6. The method of claim 5, wherein assigning respective discriminative learning weights to the plurality of target video samples comprises: for each of the target video samples,

Assigning a first recognition learning weight to the target video sample if the target video sample is classified into the known class candidate set based on the first similarity and the second similarity or is classified into the unknown class candidate set; and

If the target video sample is partitioned into different ones of the known class candidate set and the unknown class candidate set based on the first similarity and based on the second similarity, a second discrimination learning weight is assigned to the target video sample,

Wherein the second recognition learning weight is less than the first recognition learning weight.

7. The method of claim 1, wherein adapting the multi-classification model to the target domain comprises:

Assigning respective first classification learning weights to the plurality of target video samples based on the probabilities of the respective plurality of target video samples, the respective first classification learning weights of the plurality of target video samples being positively correlated with the respective probabilities, wherein the first classification learning weights of each target video sample are indicative of a degree of interest of the target video sample in training of the multi-classification model; and

The multi-classification model is trained based on the plurality of target video samples and their respective first classification learning weights such that the multi-classification model learns characteristics of the unknown class from the plurality of target video samples to adapt to the target domain.

8. The method of claim 1, wherein adapting the multi-classification model to the target domain comprises:

Assigning respective second classification weights to the plurality of target video samples based on the probabilities of the respective plurality of target video samples, the respective second classification learning weights of the plurality of target video samples being inversely related to the respective probabilities, wherein the second classification learning weights of each target video sample are indicative of a degree of interest of the target video sample in training of the multi-classification model; and

The multi-classification model is trained based on the plurality of target video samples and their respective second classification learning weights such that the multi-classification model learns characteristics of the plurality of known classes from the plurality of target video samples to adapt to the target domain.

9. The method of claim 1, wherein adapting the multi-classification model to the target domain comprises:

Assigning respective domain discrimination weights to the plurality of target video samples based on the probabilities of the respective plurality of target video samples, the respective domain discrimination weights being positively correlated with the respective probabilities, wherein the domain discrimination weights of each target video sample are indicative of a degree of interest of that target video sample in training of a domain discriminator configured to discriminate whether a target video sample belongs to the source domain or the target domain; and

The multi-classification model is trained to adapt to the target domain while the domain discriminator is trained based on the plurality of target video samples and their respective domain discrimination weights.

10. The method of claim 1, wherein the adaptation of the multi-classification model to the target domain is further based on the source video sample set.

11. The method of any of claims 1-10, wherein determining a probability that each of the plurality of target video samples belongs to the unknown class comprises:

Extracting respective target features of the plurality of target video samples at least one of a frame level, a temporal level across the plurality of frames, and a video level of the plurality of target video samples; and

A probability that each of the plurality of target video samples belongs to the unknown class at the at least one level is determined based on the target features.

12. The method of claim 11, wherein adapting the multi-classification model to the target domain comprises:

The multi-classification model is adapted to the target domain based on the plurality of target video samples and the probabilities thereof at the at least one level.

13. An apparatus for domain adaptive learning, comprising:

An acquisition module configured to acquire a source video sample set of a source domain and a target video sample set of a target domain, the source video samples in the source video sample set being labeled as belonging to one of a plurality of known categories;

A probability determination module configured to determine probabilities that the plurality of target video samples each belong to an unknown class based on a plurality of similarities between the plurality of target video samples in the target video sample set and the source video sample set; and

A model adaptation module configured to adapt a multi-classification model previously pre-trained with the set of source video samples to the target domain based at least on the plurality of target video samples and their respective probabilities;

Wherein the apparatus further comprises a similarity determination module configured to determine the plurality of similarities by at least one of: determining probabilities that the plurality of target video samples each belong to the plurality of known categories using a binary classifier to obtain a first similarity between each target video sample and the set of source video samples, the binary classifier being pre-trained based on the set of source video samples; and determining an optimal transmission distance between each target video sample to one of a plurality of reference features corresponding to the plurality of known categories and the unknown category and determined based on the source video sample set to obtain a second similarity between each target video sample and the source video sample set.

14. The apparatus of claim 13, wherein the probability determination module comprises:

A candidate partitioning module configured to partition the plurality of target video samples into a known category candidate group corresponding to the plurality of known categories and an unknown category candidate group corresponding to the unknown category based on at least one of the first similarity and the second similarity; and

A partition-based probability determination module configured to determine a probability that each of the plurality of target video samples belongs to the unknown class based on a result of the partitioning.

15. The device of claim 14, wherein the candidate partitioning module comprises:

A first known group partitioning module configured to partition target video samples for which the first similarity exceeds a first threshold into a first known category candidate group; and

A first unknown group partitioning module configured to partition target video samples having the first similarity below a second threshold into a first unknown class candidate group, wherein the second threshold does not exceed the first threshold.

16. The device of claim 14, wherein the candidate partitioning module comprises:

A matching feature determination module configured to determine a reference feature matching each target video sample from the plurality of reference features, wherein the second similarity for each target video sample is obtained by an optimal transmission distance between the target video sample and the matching reference feature;

A second known group partitioning module configured to partition target video samples matching reference features corresponding to the plurality of known categories into a second known category candidate group; and

A second unknown group partitioning module configured to partition target video samples matching reference features corresponding to the unknown categories into second unknown category candidate groups.

17. The device of claim 14, wherein the partition-based probability determination module comprises:

A discrimination learning weight determination module configured to assign respective discrimination learning weights to the plurality of target video samples based on a result of the partitioning, wherein the discrimination learning weight of each target video sample indicates a degree of interest of that target video sample in training of a binary discriminator;

A discriminator training module configured to train the binary discriminator based on the plurality of target video samples and their respective discrimination learning weights; and

A discriminator-based probability determination module is configured to determine probabilities that the plurality of target video samples each belong to the unknown class using the trained binary discriminators.

18. The apparatus of claim 17, wherein the discriminating learning weight determining module comprises: for each of the target video samples,

A first assigning module configured to assign a first recognition learning weight to the target video sample if the target video sample is divided into the known class candidate group or into the unknown class candidate group based on both the first similarity and the second similarity; and

A second assigning module configured to assign a second discrimination learning weight to the target video sample if the target video sample is partitioned into different ones of the known class candidate set and the unknown class candidate set based on the first similarity and based on the second similarity,

19. The apparatus of claim 14, wherein the model adaptation module comprises:

A first classification learning weight assignment module configured to assign respective first classification learning weights to the plurality of target video samples based on the probabilities of the respective plurality of target video samples, the respective first classification learning weights of the plurality of target video samples being positively correlated with the respective probabilities, wherein the first classification learning weight of each target video sample is indicative of a degree of interest of that target video sample in training of the multi-classification model; and

A first training module configured to train the multi-classification model based on the plurality of target video samples and their respective first classification learning weights such that the multi-classification model learns characteristics of the unknown class from the plurality of target video samples to adapt to the target domain.

20. The apparatus of claim 14, wherein the model adaptation module comprises:

A second classification weight assignment module configured to assign respective second classification weights to the plurality of target video samples based on the probabilities of the respective plurality of target video samples, the respective second classification learning weights of the plurality of target video samples being inversely related to the respective probabilities, wherein the second classification learning weights of each target video sample are indicative of a degree of interest of that target video sample in training of the multi-classification model; and

A second training module configured to train the multi-classification model based on the plurality of target video samples and their respective second classification learning weights such that the multi-classification model learns characteristics of the plurality of known classes from the plurality of target video samples to fit the target domain.

21. The apparatus of claim 14, wherein the model adaptation module comprises:

A domain discrimination weight assignment module configured to assign respective domain discrimination weights to the plurality of target video samples based on the probabilities of the respective plurality of target video samples, the respective domain discrimination weights being positively correlated with the respective probabilities, wherein the domain discrimination weights of each target video sample are indicative of a degree of interest of that target video sample in training of a domain discriminator configured to discriminate whether a target video sample belongs to the source domain or the target domain; and

A third training module configured to train the multi-classification model to adapt to the target domain while training the domain discriminator based on the plurality of target video samples and their respective domain discrimination weights.

22. The apparatus of claim 14, wherein the adaptation of the multi-classification model to the target domain is further based on the source video sample set.

23. The apparatus of any of claims 14 to 22, wherein the probability determination module comprises:

a per-level feature extraction module configured to extract target features of each of the plurality of target video samples at least one of a frame level, a temporal level across a plurality of frames, and a video level of the plurality of target video samples;

A per-probability determination module is configured to determine a probability that each of the plurality of target video samples belongs to the unknown class at the at least one level based on the target features.

24. The apparatus of claim 23, wherein the model adaptation module comprises:

An adaptation module based on progressive probabilities is configured to adapt the multi-classification model to the target domain based on the plurality of target video samples and the probabilities thereof at the at least one level.

25. An electronic device, comprising:

a processor; and

A memory storing computer-executable instructions that, when executed by the processor, are configured to implement the method of any one of claims 1 to 12.

26. A computer readable storage medium having stored thereon computer executable instructions, wherein the computer executable instructions are executed by a processor to implement the steps of the method according to any of claims 1 to 12.