WO2021097774A1

WO2021097774A1 - Systems and methods for multi-source domain adaptation for semantic segmentation

Info

Publication number: WO2021097774A1
Application number: PCT/CN2019/120053
Authority: WO
Inventors: Pengfei XU; Sicheng ZHAO; Bo Li; Xiangyu YUE; Yang GU; Tengfei XING; Zhichao Song; Runbo HU; Hua CHAI
Original assignee: Beijing Didi Infinity Technology And Development Co., Ltd.
Priority date: 2019-11-21
Filing date: 2019-11-21
Publication date: 2021-05-27

Abstract

The present disclosure relates to a system and a method for performing multi-source domain adaptation for semantic segmentation by using a Multi-source Adversarial Domain Aggregation Network (MADAN) framework. The MADAN framework achieves the highest mloU score compared to existing adaptation methods. The MADAN framework is believed to benefit from the joint consideration of pixel-and feature-level alignments, cycle-consistency, dynamic semantic-consistency, domain aggregation, and multiple sources by employing Dynamic Adversarial Image Generation, Adversarial Domain Aggregation, and Feature-aligned Semantic Segmentation.

Description

SYSTEMS AND METHODS FOR MULTI-SOURCE DOMAIN ADAPTATION FOR SEMANTIC SEGMENTATION

TECHNICAL FIELD

The present disclosure generally relates to systems and methods for performing multi-source domain adaptation for semantic segmentation of a target image, and in some embodiments, to systems and methods for performing multi-source domain adaptation for semantic segmentation from synthetic data to real data.

BACKGROUND

Semantic segmentation, which aims to assign a semantic label such as car, cyclist, pedestrian, road to each pixel of an image, plays a crucial role in many applications, ranging from autonomous driving and robotic control to medical imaging and fashion recommendation. With the advent of deep learning, especially convolutional neural networks (CNNs) , several end-to-end approaches have been used for semantic segmentation. Although the end-to-end methods have achieved promising results, they suffer from some limitations. On the one hand, training the end-to-end methods requires large-scale labeled data with pixel-level annotations, which is prohibitively expensive and time-consuming to obtain. For example, it takes about 90 minutes to label each image in the Cityscapes dataset. On the other hand, the learned knowledge in these end-to-end methods cannot be well generalized to new domains, because of the presence of domain shift or dataset bias.

To sidestep the cost of data collection and annotation, unlimited amounts of synthetic labeled data can be created from simulators like CARLA and GTA-V, thanks to the progress in graphics and simulation infrastructure. To mitigate the gap between different domains, domain adaptation (DA) or knowledge transfer techniques have been used with both theoretical analysis and algorithm design. Besides the traditional task loss on the labeled source domain, deep unsupervised domain adaptation (UDA) methods are generally trained with another loss to deal with domain shift, such as a discrepancy loss, adversarial loss, and reconstruction loss, etc. Many of the existing simulation-to-real DA methods for semantic segmentation, however, all focus on the single-source setting and do not consider a more practical scenario where the labeled data are collected from multiple sources with different distributions. Simply combining different sources into one source and directly employing single-source DA may not perform well, since images from different source domains may interfere with each other during the learning process. Earlier efforts on multi-source DA (MDA) used shallow models. Additionally, some multi-source deep UDA methods have been proposed which only focus on image classification. Directly extending these MDA methods from classification to segmentation however does not appear to perform well. Although simulation-to-real domain adaptation for semantic segmentation has been actively studied with various applications such as autonomous driving. Existing methods mainly focus on the single-source setting, which cannot well handle a more practical scenario of multiple sources with different distributions. There remains a need to improve the performance of multi-source domain adaptation for semantic segmentation of a target image of a target domain, for example in real-time applications such as autonomous driving.

SUMMARY

According to a first aspect of the present disclosure, a system for conducting multi-source domain adaptation for semantic segmentation of a target image of a target domain is disclosed. The multiple source domains comprise images from a plurality of single sources. The system comprises a storage medium storing a set of instructions and a processor in communication with the storage medium to execute the set of instructions to: perform dynamic adversarial image generation by generating an adapted domain for each single source of the plurality of single sources with dynamic semantic consistency while aligning at pixel-level cycle-consistently towards the target image; perform adversarial domain aggregation by using sub-domain aggregation discriminator and cross-domain cycle discriminator to aggregate the adapted domains to form aggregated domains; and perform feature-aligned semantic segmentation of the target image by performing feature-level alignment between the aggregated domain and the target domain while training a segmentation network. In some embodiments, in the system, the adapted domain is generated by using a Generative Adversarial Network (GAN) with cycle-consistency loss. The dynamic semantic consistency and pixel level alignment is achieved by minimizing a KL divergence between source predictions of a pretrained segmentation model and adapted predictions of a dynamic segmentation model. The sub-domain aggregation discriminator is used to directly make different adapted domains indistinguishable. And the cross-domain cycle discriminator is used to discriminate between the images from the single source and the images transferred from other sources to aggregate different adapted domains into a more unified domain. The data from different domains are observed in same image space but exhibit different distributions and all the domains share same set of classes. The dynamic adversarial image generation make images from different source domains visually similar to the target image, as if they are drawn from the same target domain distribution.

According to a second aspect of the present disclosure, a method for conducting multi-source domain adaptation for semantic segmentation of a target image of a target domain on a computing device including a storage medium storing a set of instructions and a processor in communication with the storage medium to execute the set of instructions is disclosed herein. The multiple source domains comprise images from a plurality of single sources. The method comprises performing dynamic adversarial image generation by generating an adapted domain for each single source of the plurality of single sources with dynamic semantic consistency while aligning at pixel-level cycle-consistently towards the target image; performing adversarial domain aggregation by using sub-domain aggregation discriminator and cross-domain cycle discriminator to aggregate the adapted domains to form aggregated domains; and performing feature-aligned semantic segmentation of the target image by performing feature-level alignment between the aggregated domain and the target domain while training a segmentation network. In some embodiments, in the method, the adapted domain is generated by using a Generative Adversarial Network (GAN) with cycle-consistency loss. The dynamic semantic consistency and pixel level alignment is achieved by minimizing a KL divergence between source predictions of a pretrained segmentation model and adapted predictions of a dynamic segmentation model. The sub-domain aggregation discriminator is used to directly make different adapted domains indistinguishable. The cross-domain cycle discriminator is used to discriminate between the images from the single source and the images transferred from other sources to aggregate different adapted domains into a more unified domain. The data from different domains are observed in same image space but exhibit different distributions and all the domains share same set of classes. The dynamic adversarial image generation make images from different source domains visually similar to the target image, as if they are drawn from the same target domain distribution.

According to a third aspect of the present disclosure, a non-transitory readable medium, storing a set of instructions for conducting multi-source domain adaptation for semantic segmentation of a target image of a target domain is disclosed herein. The multiple source domains comprise images from a plurality single sources and when the set of instructions is executed by a processor of an electrical device, the device performs a method comprising: performing dynamic adversarial image generation by generating an adapted domain for each single source of the plurality of single sources with dynamic semantic consistency while aligning at pixel-level cycle-consistently towards the target image; performing adversarial domain aggregation by using sub-domain aggregation discriminator and cross-domain cycle discriminator to aggregate the adapted domains to form aggregated domains; and performing feature-aligned semantic segmentation of the target image by performing feature-level alignment between the aggregated domain and the target domain while training a segmentation network. In some embodiments, in the method instructed by the storage medium, the adapted domain is generated by using a Generative Adversarial Network (GAN) with cycle-consistency loss. The dynamic semantic consistency and pixel level alignment is achieved by minimizing a KL divergence between source predictions of a pretrained segmentation model and adapted predictions of a dynamic segmentation model. The sub-domain aggregation discriminator is used to directly make different adapted domains indistinguishable and the cross-domain cycle discriminator is used to discriminate between the images from the single source and the images transferred from other sources to aggregate different adapted domains into a more unified domain. The data from different domains are observed in same image space but exhibit different distributions and all the domains share same set of classes. The dynamic adversarial image generation make images from different source domains visually similar to the target image, as if they are drawn from the same target domain distribution.

Additional features will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following and the accompanying drawings or may be learned by production or operation of the examples. The features of the present disclosure may be realized and attained by practice or use of various aspects of the methodologies, instrumentalities and combinations set forth in the detailed examples discussed below.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure is further described in terms of example embodiments. These example embodiments are described in detail with reference to the drawings. The drawings are not to scale. These embodiments are non-limiting example embodiments, in which like reference numerals represent similar structures throughout the several views of the drawings, and wherein:

FIG. 1 is a schematic diagram illustrating the systems and methods disclosed herein integrated into an autonomous vehicle service system according to some embodiments of the present disclosure;

FIG. 2 is a schematic diagram illustrating components of a computing device according to some embodiments of the present disclosure;

FIG. 3 is a schematic diagram illustrating hardware and/or software components of an example of the ACU of FIG. 1 according to some embodiments of the present disclosure;

FIG. 4 is a flow chart illustrating the multi-source domain adaptation for semantic adaptation process according to one embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating the framework of MADAN according to one embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating qualitative semantic segmentation with single source DA;

FIG. 7 is a schematic diagram illustrating qualitative semantic segmentation with multi-source adaptation according to one embodiment of the present disclosure;

FIG. 8 is a schematic diagram illustrating qualitative semantic segmentation with multi-source adaptation according to another embodiment of the present disclosure;

FIG. 9 is a visual example of image translation according to one embodiment of the present disclosure.

FIG. 10 is a visual example of image translation according to another embodiment of the present disclosure.

DETAILED DESCRIPTION

The following description is presented to enable any person skilled in the art to make and use the present disclosure and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present disclosure. Thus, the present disclosure is not limited to the embodiments shown but is to be accorded the widest scope consistent with the claims.

The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting. As used herein, the singular forms “a, ” “an, ” and “the” may be intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprise, ” “comprises, ” and/or “comprising, ” “include, ” “includes” and/or “including” when used in this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Generally, the word “module, ” “unit, ” or “block, ” as used herein, refers to logic embodied in hardware or firmware, or to a collection of software instructions. A module, a unit, or a block described herein may be implemented as software and/or hardware and may be stored in any type of non-transitory computer-readable medium or another storage device. In some embodiments, a software module/unit/block may be compiled and linked into an executable program. It will be appreciated that software modules can be callable from other modules/units/blocks or from themselves, and/or may be invoked in response to detected events or interrupts. Software modules/units/blocks configured for execution on computing devices may be provided on a computer-readable medium, such as a compact disc, a digital video disc, a flash drive, a magnetic disc, or any other tangible medium, or as a digital download (and can be originally stored in a compressed or installable format that needs installation, decompression, or decryption prior to execution) . Such software code may be stored, partially or fully, on a storage device of the executing computing device, for execution by the computing device. Software instructions may be embedded in firmware, such as an erasable programmable read-only memory (EPROM) . It will be further appreciated that hardware modules/units/blocks may be included in connected logic components, such as gates and flip-flops, and/or can be included of programmable units, such as programmable gate arrays or processors. The modules/units/blocks or computing device functionality described herein may be implemented as software modules/units/blocks but may be represented in hardware or firmware. In general, the modules/units/blocks described herein refer to logical modules/units/blocks that may be combined with other modules/units/blocks or divided into sub-modules/sub-units/sub-blocks despite their physical organization or storage. The description may be applicable to a system, an engine, or a portion thereof.

It will be understood that when a unit, engine, module or block is referred to as being “on, ” “connected to, ” or “coupled to, ” another unit, engine, module, or block, it may be directly on, connected or coupled to, or communicate with the other unit, engine, module, or block, or an intervening unit, engine, module, or block may be present, unless the context clearly indicates otherwise. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.

These and other features, and characteristics of the present disclosure, as well as the methods of operation and functions of the related elements of structure and the combination of parts and economies of manufacture, may become more apparent upon consideration of the following description with reference to the accompanying drawings, all of which form a part of this disclosure. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended to limit the scope of the present disclosure. It is understood that the drawings are not to scale.

The flowcharts used in the present disclosure illustrate operations that systems implement according to some embodiments of the present disclosure. It is to be expressly understood, the operations of the flowcharts may be implemented not in order. Conversely, the operations may be implemented in inverted order, or simultaneously. Moreover, one or more other operations may be added to the flowcharts. One or more operations may be removed from the flowcharts.

Embodiments of the present disclosure may be applied to different transportation systems including but not limited to land transportation, sea transportation, air transportation, space transportation, or the like, or any combination thereof. A vehicle of the transportation systems may include a rickshaw, travel tool, taxi, chauffeured car, hitch, bus, rail transportation (e.g., a train, a bullet train, high-speed rail, and subway) , ship, airplane, spaceship, hot-air balloon, driverless vehicle, or the like, or any combination thereof. The transportation system may also include any transportation system that applies management and/or distribution, for example, a system for sending and/or receiving an express.

Existing MDA methods directly extend from classification to segmentation, which is problematic for at least the following reasons. First of all, segmentation is a structured prediction task, the decision function of which is more involved than classification because it has to resolve the predictions in an exponentially large label space. Secondly, the existing MDA methods mainly focus on feature-level alignment, which only aligns high-level information. This may be enough for coarse-grained classification tasks, but is insufficient for fine-grained semantic segmentation, which performs pixel-wise prediction. Thirdly, these MDA methods only align each source and target pair. Although different sources are matched towards the target, there exist significant mis-alignment across different sources. To address these inherent limitations of the existing methods, a Multi-source Adversarial Domain Aggregation Network (MADAN) framework is disclosed herein, which consists of Dynamic Adversarial Image Generation (DAIG) , Adversarial Domain Aggregation (ADA) , and Feature-aligned Semantic Segmentation (FSS) . The MADAN framework is trained in an end-to-end manner.

Specifically, when DAIG is performed, for each source, an adapted domain is generated using a DA method such as Generative Adversarial Network (GAN) with cycle-consistency loss, which enforces pixel-level alignment between source images and target images. To preserve the semantics before and after image translation, semantic consistency loss is addressed by minimizing the KL divergence between the source predictions of a pretrained segmentation model and the adapted predictions of a dynamic segmentation model. When ADA is performed, instead of training a classifier for each source domain such as those used in existing, a sub-domain aggregation discriminator is used to directly make different adapted domains indistinguishable and a cross-domain cycle discriminator is used to discriminate between the images from each source and the images transferred from other sources, such that different adapted domains are aggregated into a unified domain. When FSS is performed, the segmentation model is trained based on the aggregated domain obtained from ADA, while enforcing feature-level alignment between the aggregated domain and the target domain. Accordingly, domain adaptation for semantic segmentation from multiple sources is performed herein using the MADAN framework for semantic segmentation such that besides feature-level alignment, pixel-level alignment is achieved by generating an adapted domain for each source cycle-consistently with a dynamic semantic consistency loss disclosed herein. Sub-domain aggregation discriminator and cross-domain cycle discriminator are used to better align different adapted domains. To demonstrate the effectiveness of the systems and methods disclosed herein, experiments are conducted on GTA dataset, SYNTHIA dataset and Cityscapes dataset. Extensive experimental results on GTA, SYNTHIA, and Cityscapes datasets demonstrate that the MADAN model disclosed herein outperforms existing approaches.

Various embodiments of the present disclosure may be applied to various applications including autonomous driving, robotic control, medical imaging, and fashion recommendation. Autonomous driving is disclosed specifically herein as an example for such applications. In general, an autonomous vehicle has an autonomous pilot system, which is used to control the autonomous driving of the vehicle. An arithmetic and control unit (ACU) of the autonomous vehicle may receive and process real time sensor data such as image data from a camera system of the autonomous vehicle. The image data is processed by the systems and methods disclosed herein in real time to generate one or more real time vehicle control (VC) commands. The one or more real time VC commands may include but not limited to acceleration, deceleration, making a turn, switching lanes, or the like, or any combination thereof. It should be understood that application scenarios of the system and method disclosed herein are only some examples or embodiments. Those having ordinary skills in the art, without further creative efforts, may apply these drawings to other application scenarios, for example, another similar transportation system.

FIG. 1 is a schematic diagram illustrating an autonomous vehicle service system according to some embodiments of the present disclosure. In some embodiments, autonomous vehicle service system 100 may be an Internet of Things (IoT) platform including a server 110, a storage device 120, a network 130, an autonomous vehicle 140. The server 110 may further include a processing device 112.

In some embodiments, the server 110 may be a single server, or a server group. The server group may be centralized, or distributed (e.g., the server 110 may be a distributed system) . In some embodiments, the server 110 may be local or remote. For example, the server 110 may access at least one of information and data stored in at least one of the autonomous vehicle 140, and the storage device 120 via the network 130. As another example, the server 110 may be directly connected to at least one of the autonomous vehicle 140, and the storage device 120 to access stored at least one of information and data. In some embodiments, the server 110 may be implemented on a cloud platform. Merely by way of example, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof.

In some embodiments, the server 110 may include a processing device 112. The processing device 112 may process at least one of information and data from the autonomous vehicle (s) 140. For example, the processing device 112 may receive a service request from a user (e.g., a driver or a passenger) . In some embodiments, the service request may include at least one of a pick-up location and a drop-off location. The processing device 112 may provide one or more routes from the pick-up location to the drop-off location. The processing device 112 may send the one or more routes to the autonomous device 140 via the network 130. In some embodiments, the service request may include entertainment needs such as, music needs, radio needs, movie needs, reading needs, etc. from the user, the processing device 112 may provide sources to satisfy the entertainment needs of the user in response to the service request. In some embodiments, the service request may include one or more commands to operate the autonomous vehicle 140, for example, parking, slowing down, accelerating, controlling in-car temperature, etc. The processing device 112 may remotely operate the autonomous vehicle 140 via a built-in autonomous pilot system in response to the one or more commands. In some embodiments, the processing device 112 may include one or more processing engines (e.g., a single-core processor or a multi-core processor) . Merely by way of example, the processing device 112 may include a central processing unit (CPU) , an application-specific integrated circuit (ASIC) , an application-specific instruction-set processor (ASIP) , a graphics processing unit (GPU) , a physics processing unit (PPU) , a digital signal processor (DSP) , a field programmable gate array (FPGA) , a programmable logic device (PLD) , a controller, a microcontroller unit, a reduced instruction-set computer (RISC) , a microprocessor, or the like, or any combination thereof.

The storage device 120 may comprise at least one of store data and instructions. In some embodiments, the storage device120 may store data obtained from the autonomous vehicle 140. In some embodiments, the storage device 120 may store at least one of data and instructions that the server 110 may execute or use to perform example methods described in the present disclosure. In some embodiments, the storage device 120 may include a mass storage, a removable storage, a volatile read-and-write memory, a read-only memory (ROM) , or the like, or any combination thereof. Example mass storage may include a magnetic disk, an optical disk, a solid-state drive, etc. Example removable storage may include a flash drive, a floppy disk, an optical disk, a memory card, a zip disk, a magnetic tape, etc. Example volatile read-and-write memory may include a random access memory (RAM) . Example RAM may include a dynamic RAM (DRAM) , a double date rate synchronous dynamic RAM (DDR SDRAM) , a static RAM (SRAM) , a thyrisor RAM (T-RAM) , and a zero-capacitor RAM (Z-RAM) , etc. Example ROM may include a mask ROM (MROM) , a programmable ROM (PROM) , an erasable programmable ROM (EPROM) , an electrically-erasable programmable ROM (EEPROM) , a compact disk ROM (CD-ROM) , and a digital versatile disk ROM, etc. In some embodiments, the storage may be implemented on a cloud platform. Merely by way of example, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof.

In some embodiments, the storage device 120 may be connected to the network 130 to communicate with one or more components of the autonomous vehicle service system 100 (e.g., the server 110, the autonomous vehicle 140) . One or more components in the autonomous vehicle service system 100 may access the data or instructions stored in the storage device 120 via the network 130. In some embodiments, the storage device 120 may be directly connected to or communicate with one or more components in the autonomous vehicle service system 100 (e.g., the server 110, the autonomous vehicle 140) . In some embodiments, the storage device 120 may be part of the server 110.

The network 130 may facilitate exchange of at least one of information and data. In some embodiments, one or more components in the autonomous vehicle service system 100 (e.g., the server 110, the storage device 120, and the autonomous vehicle 140) may send at least one of information and data to other component (s) in the autonomous vehicle service system 100 via the network 130. For example, the server 110 may obtain/acquire vehicle at least one of information and environment information around the vehicle via the network 130. In some embodiments, the network 130 may be any type of wired or wireless network, or combination thereof. Merely by way of example, the network 130 may include a cable network, a wireline network, an optical fiber network, a tele communications network, an intranet, an Internet, a local area network (LAN) , a wide area network (WAN) , a wireless local area network (WLAN) , a metropolitan area network (MAN) , a wide area network (WAN) , a public telephone switched network (PSTN) , a Bluetooth network, a ZigBee network, a near field communication (NFC) network, or the like, or any combination thereof. In some embodiments, the network 130 may include one or more network access points. For example, the network 130 may include wired or wireless network access points such as at least one of base stations and internet exchange points 130-1, 130-2, ..., through which one or more components of the autonomous vehicle service system 100 may be connected to the network 130 to exchange at least one of data and information.

In some embodiments, the autonomous vehicle 140 may include structures of a conventional vehicle, for example, a chassis, a suspension, steering, a braking, drivetrain components, an engine, and so on. In some embodiments, the autonomous vehicle 140 may include vehicles having various autonomous driving levels, such as, level 0 (i.e., No Automation, at level 0, the driver performs all operating tasks like steering, braking, accelerating or slowing down, and so forth) , level 1 (i.e., Driver Assistance, at level 1, the vehicle can assist with some functions, but the driver still handles all accelerating, braking, and monitoring of the surrounding environment) , level 2 (i.e., Partial Automation, at level 2, the vehicle can assist with steering or acceleration functions and allow the driver to disengage from some of their tasks, but the driver must always be ready to take control of the vehicle and is still responsible for most safety-critical functions and all monitoring of the environment) , level 3 (i.e., Conditional Automation, at level 3, the vehicle itself controls all monitoring of the environment, and the driver’s attention is still critical at this level, but can disengage from “safety critical” functions like braking and leave it to the technology when conditions are safe ) , level 4 (i.e., High Automation, at level 4, the vehicle is capable of steering, braking, accelerating, monitoring the vehicle and roadway as well as responding to events, determining when to change lanes, turn, and use signals. However, the automatic pilot system would first notify the driver when conditions are safe, and only then does the driver switch the vehicle into the auto pilot mode) , or level 5 (e.g., Complete Automation, at level 5, this level of autonomous driving requires absolutely no human attention. There is no need for pedals, brakes, or a steering wheel, as the automatic pilot system controls all critical tasks, monitoring of the environment and identification of unique driving conditions like traffic jams) .

In some embodiments, the autonomous vehicle 140 may be configured with one or more sensors such as a camera to detect at least one of internal information and external information surrounding the vehicle. For example, the external information may include environment information surrounding the vehicle, such as weather information, road condition information, traffic light information, obstacle information, pedestrian information, and so on. The internal information may include user pose information, user interaction information, and so on. In some embodiments, the autonomous vehicle 140 may be configured with a computing device 150 for controlling the autonomous vehicle in real time or near real time according to at least one of the internal information and external information. As used herein, the computing device 150 may refer to an arithmetic and control unit (ACU) . The ACU 150 may be various in forms. For example, the ACU 150 may include a mobile device, a tablet computer, a physical display screen (e.g., an LCD, an electronic ink display (E-Ink) , curved screen, a television device, a touch screen, etc. ) , or the like, or any combination thereof. In some embodiments, the mobile device may include, a wearable device, a mobile device, a virtual reality device, an augmented reality device, or the like, or any combination thereof. In some embodiments, the wearable device may include a bracelet, footgear, eyeglasses, a helmet, a watch, clothing, a backpack, a smart accessory, or the like, or any combination thereof. In some embodiments, the mobile device may include a mobile phone, a personal digital assistance (PDA) , a laptop, a tablet computer, a desktop, or the like, or any combination thereof. In some embodiments, the virtual reality device and/or the augmented reality device may include a virtual reality helmet, virtual reality glasses, a virtual reality patch, an augmented reality helmet, augmented reality glasses, an augmented reality patch, or the like, or any combination thereof. For example, the virtual reality device and/or the augmented reality device may include a Google Glass ^TM, an Oculus Rift ^TM, a Hololens ^TM, a Gear VR ^TM, etc.

In some embodiments, the ACU may be configured with an autonomous pilot system for controlling the autonomous vehicle. The ACU may include a multi-core processor for processing one or more tasks from the autonomous pilot system. In some embodiments, at least one dedicated processing core of the ACU may be dedicated to process one or more real time vehicle controlling (VC) tasks to generate one or more real time VC commands according to the real time sensor data. In some embodiments, at least one of the real time VC commands and the non-real time VC commands may be transmitted to a vehicle controlling unit (VCU) for operating the vehicle. The VCU may include one or more controllers of the autonomous vehicle, for example, one or more throttle controllers, one or more spark controllers, one or more brake controllers, one or more steering controllers, an exhaust gas recycling (EGR) controller, a waste gate controller, and so on.

It should be noted that the descriptions above in relation to the ACU 150 is provided for the purposes of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, various variations and modifications may be conducted under the guidance of the present disclosure. However, those variations and modifications do not depart the scope of the present disclosure. In some embodiments, the ACU 150 may include one or more other subunits. For example, the ACU 150 may include a storage subunit to store data generated by the subunits in the ACU 150. In some embodiments, any two of the subunits may be combined as a single unit.

In some embodiments, the autonomous vehicle 140 may communicate with one or more components of the autonomous vehicle service system 100 (e.g., the server 110, the storage device 120) via the network 130. In some embodiments, the autonomous vehicle 140 may communicate with other vehicles (not shown in FIG. 1) around the vehicle itself. For example, a first vehicle may obtain at least one of distance information and speed information regarding a second vehicle. When the second vehicle is so close to the first vehicle (e.g., a distance between two vehicles is less than or equal to a distance threshold) , the first vehicle may send an alert information to the second vehicle, which may avoid a potential vehicle accident.

In some embodiments, the autonomous vehicle 140 may be an electric vehicle, a fuel cell vehicle, a hybrid vehicle, or a conventional internal combustion engine vehicle. The autonomous vehicle 140 may include a body 142 and at least one wheel 144. The body 142 may include various body styles, such as a sports vehicle, a coupe, a sedan, a pick-up truck, a station wagon, a sports utility vehicle (SUV) , a minivan, or a conversion van. In some embodiments, the autonomous vehicle 140 may include a pair of front wheels and a pair of rear wheels, as illustrated in FIG. 1. In some embodiments, the autonomous vehicle 140 may have more or less wheels or equivalent structures that enable vehicle 140 to move around. The autonomous vehicle 140 may be configured to be all wheel drive (AWD) , front wheel drive (FWR) , or rear wheel drive (RWD) . In some embodiments, the autonomous vehicle 140 may be configured to be operated by at least one of an operator occupying the vehicle, remotely controlled, and autonomously controlled.

As described in connection with FIG. 1, the autonomous vehicle 140 may be equipped with an automatic pilot system configured to control the autonomous vehicle 140. The automatic pilot system may be implemented by an arithmetic and control unit (ACU) . The autonomous pilot system may be configured to operate the vehicle automatically. In some embodiments, the autonomous pilot system may obtain at least one of data and information from one or more sensors of the vehicle. In some embodiments, the autonomous pilot system may be categorized to three layers, that is, perception, planning, and control. The autonomous pilot system may perform one or more operations regarding at least one of the perception, the planning and the control. For example, in the perception layer, the autonomous pilot system may perform at least one of environment perception and localization based on the sensor data, such as weather detection, in-car temperature detection, lane detection, free drivable area detection, pedestrian detection, obstacle detection, traffic sign detection, and so on. As another example, in the planning layer, the autonomous pilot system may perform at least one of a mission planning, a behavior planning, and a motion planning according to at least one of the environment perception and localization. As a further example, in the control layer, the autonomous pilot system may generate one or more real time VC commands according to results of the perception layer and the planning layer. Specifically, the autonomous pilot system may generate the one or more real time VC commands based on at least one of feedback control and model predictive control. More descriptions regarding the embodiments with respect to the perception layer, the planning layer, and the control layer may be found in, e.g., an article “Pendleton, Scott Drew, et al. ″Perception, planning, control, and coordination for autonomous vehicles. ″Machines 5.1 (2017) : 6” , the contents of which are hereby incorporated by reference.

The autonomous vehicle 140 may include one or more sensors to provide information that is used to operate the vehicle automatically. The one or more sensors such as one or more cameras may detect at least one of internal information and external information regarding the autonomous vehicle 140 in real time or near real time. For example, the external information may include environment information around the vehicle, such as weather information, road condition information, traffic light information, obstacle information, pedestrian information, and so on. The internal information may include user pose information, user interaction information, and so on. It is understood that the one or more sensors may also include various types of sensors, such as visual-sensing systems, laser-sensing systems, infrared-sensing systems, acoustic-sensing systems, thermal-sensing systems, or the like, or any combination thereof.

In some embodiments, the autonomous vehicle 140 may at least be configured with a positioning system. The positioning system may provide navigation information for the autonomous vehicle 140. The navigation information may include a current location of the vehicle, a destination of the vehicle, a velocity, an acceleration, a current time, or the like, or any combination thereof. The location may be in the form of coordinates, such as a latitude coordinate and a longitude coordinate. The positioning system may include but not limited to a compass navigation system (COMPASS) , a global positioning system (GPS) , a BeiDou navigation satellite system, a Galileo positioning system, a quasi-zenith satellite system (QZSS) , and so on.

In some embodiments, the visual-sensing system comprises a video or image capture system or image source 170 including

cameras

172 and 174 configured to acquire a video composed of a plurality of images (also referred to as “video frames” ) or still images.

The

camera

172 or 174 may be able to capture one or more images. As used in this application, an image may be a still image, a video, a stream video, or a video frame obtained from a video. In some embodiments, the

camera

172 or 174 may be a digital camera, a video camera, a security camera, a web camera, a smartphone, a tablet, a laptop, a video gaming console equipped with a web camera, a camera with multiple lenses, etc.

The

camera

172 or 174 may include a lens, a shutter, a sensor, a processing device, and a storage device. The lens may be an optical device that focuses a light beam by means of refraction to form an image. In some embodiments, the lens may include one or more lenses. The lens may be configured to intake a scene. An aperture of the lens may refer to the size of the hole through which light passes to reach the sensor. The aperture may be adjustable to adjust the amount of light that passes through the lens. The focal length of the lens may be adjustable to adjust the coverage of the camera.

The shutter may be opened to allow light through the lens when an image is captured. The shutter may be controlled manually or automatically by the processing device.

The sensor may be configured to receive light passing through the lens and transform the light signals of the received light into electrical signals. The sensor may include charge coupled device (CCD) and complementary metal-oxide semiconductor (CMOS) . The sensor may be in communication with the logic circuits and may be configured to detect the scene from the lens and transform the scene into electronic signals.

A “video” provided by the video or image capture system or image source 170 may include a plurality of frames, which may also be referred to as video frames. A frame may be one of a plurality of still images that compose a completer video. And the frames of a video are captured in a rate called frame rate, such as 24 frames per second (fps) , 30 fps, 60 fps, etc.

The video frames to be transmitted may be stored in a buffer in the ACU 150 in a form of a video frame buffering queue, which may be managed by a buffer manager. The buffer may use a queue based data structure for buffering the video to be transmitted.

The buffer may be a storage device for buffering the video to be transmitted. The buffer may include a mass storage device, a removable storage device, a volatile read-and-write memory, a read-only memory (ROM) , or the like, or any combination thereof. Example mass storage may include a magnetic disk, an optical disk, a solid-state drive, etc. Example removable storage may include a flash drive, a floppy disk, an optical disk, a memory card, a zip disk, a magnetic tape, etc. Example volatile read-and-write memory may include a random-access memory (RAM) , such as a dynamic RAM (DRAM) , a double date rate synchronous dynamic RAM (DDR SDRAM) , a static RAM (SRAM) , a thyristor RAM (T-RAM) , and a zero-capacitor RAM (Z-RAM) . Example ROM may include a mask ROM (MROM) , a programmable ROM (PROM) , an erasable programmable ROM (EPROM) , an electrically erasable programmable ROM (EEPROM) , a compact disk ROM (CD-ROM) , and a digital versatile disk ROM, etc.

FIG. 2 is a schematic diagram illustrating example hardware and/or software components of an example 200 of the computing device 150 according to some embodiments of the present disclosure. For example, the computing device 200 may be the display control device or a part of it. As illustrated in FIG. 2, the computing device 200 may include a processor 222, a storage 227, an input/output (I/O) 226, and a communication port 225.

The processor 222 (e.g., logic circuits) may execute computer instructions (e.g., program code) and perform functions in accordance with techniques described herein. For example, the processor 222 may include interface circuits and processing circuits therein. The interface circuits may be configured to receive electronic signals from a bus 221, wherein the electronic signals encode structured data and/or instructions for the processing circuits to process. The processing circuits may conduct logical operations calculations, and then determine a conclusion, a result, and/or an instruction encoded as electronic signals. Then the interface circuits may send out the electronic signals from the processing circuits via the bus 221.

The computer instructions may include, for example, routines, programs, objects, components, data structures, procedures, modules, and functions, which perform particular functions described herein. In some embodiments, the processor 222 may include one or more hardware processors, such as a microcontroller, a microprocessor, a reduced instruction set computer (RISC) , an application specific integrated circuits (ASICs) , an application-specific instruction-set processor (ASIP) , a central processing unit (CPU) , a graphics processing unit (GPU) , a physics processing unit (PPU) , a microcontroller unit, a digital signal processor (DSP) , a field programmable gate array (FPGA) , an advanced RISC machine (ARM) , a programmable logic device (PLD) , any circuit or processor capable of executing one or more functions, or the like, or any combinations thereof.

Merely for illustration, only one processor is described in the computing device 200. However, it should be noted that the computing device 200 in the present disclosure may also include multiple processors, thus operations and/or method steps that are performed by one processor as described in the present disclosure may also be jointly or separately performed by the multiple processors. For example, if in the present disclosure the processor of the computing device 200 executes both step A and step B, it should be understood that step A and step B may also be performed by two or more different processors jointly or separately in the computing device 200 (e.g., a first processor executes step A and a second processor executes step B, or the first and second processors jointly execute steps A and B) .

The storage 227 may store data/information obtained from the image source 170, and/or the ACU 160. In some embodiments, the storage 222 may include a mass storage, removable storage, a volatile read-and-write memory, a read-only memory (ROM) , or the like, or any combination thereof. For example, the mass storage may include a magnetic disk, an optical disk, a solid-state drive, etc. The removable storage may include a flash drive, a floppy disk, an optical disk, a memory card, a zip disk, a magnetic tape, etc. The volatile read-and-write memory may include a random-access memory (RAM) , which may include a dynamic RAM (DRAM) , a double date rate synchronous dynamic RAM (DDR SDRAM) , a static RAM (SRAM) , a thyristor RAM (T-RAM) , and a zero-capacitor RAM (Z-RAM) , etc. The ROM may include a mask ROM (MROM) , a programmable ROM (PROM) , an erasable programmable ROM (EPROM) , an electrically erasable programmable ROM (EEPROM) , a compact disk ROM (CD-ROM) , and a digital versatile disk ROM, etc. In some embodiments, the storage 222 may store one or more programs and/or instructions to perform example methods described in the present disclosure. For example, the storage 222 may store a program for the processing engine (e.g., the server 103) for determining a regularization item.

The I/O 226 may input and/or output signals, data, information, etc. In some embodiments, the I/O 226 may include one or more input ports and one or more output ports. The one or more input ports (also referred to as data acquisition port) may be configured to acquire data/information, such as a channel of video signal.

The communication port 225 may be connected to a network to facilitate data communications. The communication port 225 may establish connections between the image source 170 and/or the ACU 150. The connection may be a wired connection, a wireless connection, any other communication connection that can enable data transmission and/or reception, and/or any combination of these connections. The wired connection may include, for example, an electrical cable, an optical cable, a telephone wire, or the like, or any combination thereof. The wireless connection may include, for example, a Bluetooth ^TM link, a Wi-Fi ^TM link, a WiMax ^TM link, a WLAN link, a ZigBee link, a mobile network link (e.g., 3G, 4G, 5G) , or the like, or a combination thereof. In some embodiments, the communication port 225 may be and/or include a standardized communication port, such as RS232, RS485, etc. In some embodiments, the communication port 225 may be a specially designed communication port.

FIG. 3 is a schematic diagram illustrating hardware and/or software components of an example 300 of the ACU 150 according to some embodiments of the present disclosure. As illustrated in FIG. 3, the ACU example 300 includes a communication platform 310, a display 320, a graphics processing unit (GPU) 330, a central processing unit (CPU) 340, an I/O 350, a memory 360, a mobile operating system (OS) 370, and storage 390. In some embodiments, any other suitable component, including but not limited to a system bus or a controller (not shown) , may also be included in the ACU 300.

In some embodiments, the operating system 370 and one or more applications 380 may be loaded into the memory 360 from the storage 390 in order to be executed by the CPU 340. The applications 380 may include a browser or any other suitable application for receiving and rendering information relating to positioning or other information from the processing device 112. User interactions with the information stream may be achieved via the I/O 350 and provided to the processing device 112 and/or other components of the autonomous driving system 100 via the network 130.

To implement various modules, units, and their functionalities described in the present disclosure, computer hardware platforms may be used as the hardware platform (s) for one or more of the elements described herein. A computer with user interface elements may be used to implement a personal computer (PC) or any other type of workstation or terminal device. A computer may also act as a server if appropriately programmed.

Some embodiments 400 of the systems and methods disclosed herein for multi-source domain adaptation for semantic segmentation are illustrated in the flow chart of FIG. 4. First, multiple source domains that comprise images from a plurality of single sources and a target domain that comprises corresponding target images are obtained at step 410. An adapted domain is then generated for each source domain with dynamic semantic consistency while aligning at the pixel-level cycle-consistently towards the target in step 420. Sub-domain aggregation discriminator and cross-domain cycle discriminator are used to make different adapted domains more closely aggregated in step 430. Finally, feature-level alignment is performed between the aggregated domain and target domain while training the segmentation network in step 440.

For example, suppose M source domains S ₁, S ₂, …, S _M and one target domain T are first obtained. In an unsupervised domain adaptation (UDA) scenario, S ₁, S ₂, …, S _M are labeled and T is fully unlabeled. For the ith source domain S _i, the observed images and corresponding labels drawn from the source distribution p _i (x, y) are

and

where N _i is the number of images in the ith source domain. The target images drawn from the target distribution pT (x, y) are

without label observation, where N _T is the number of target images.

The method disclosed herein is based on covariate shift and concept drift such as those disclosed by Vishal M Patel et al. in IEEE Signal Processing Magazine, 32(3) : 53-69, 2015. Unless otherwise specified, the following assumptions are made when performing the methods disclosed herein: (1) homogeneity, i.e.

indicating that the data from different domains are observed in the same image space but exhibit different distributions; (2) closed set, i.e.

where γ is the class label space, indicating that all the domains share the same set of classes. An adaptation model that can correctly predict a sample from the target domain based on

and {X _T} is learned. The systems and methods disclosed herein can be easily extended to other MDA applications, such as to tackle heterogeneous DA by changing the network structure of the feature extractor, to open set DA by adding an “unknown” class, or to category shift DA by training the task network with only those sample that belong to a specified category. Example heterogeneous DA is disclosed by Wen Li et al. in IEEE Transactions on Pattern Analysis and Machine Intelligence, 36 (6) : 1134-1148, 2014 and by Yao-Hung Hubert et al. in IEEE Conference on Computer Vision and Pattern Recognition, pages 5081-5090, 2016. Example open set DA is disclosed by Pau Panareda Busto et al. in IEEE International Conference on Computer Vision, pages 754-763, 2017. Example category shift DA is disclosed by Ruijia Xu et al. in IEEE Conference on Computer Vision and Pattern Recognition, pages 3964-3973, 2018.

The Multi-source Adversarial Domain Aggregation Network (MADAN) framework for semantic segmentation adaptation is illustrated in the flow chart of FIG. 5, which consists of three components: Dynamic Adversarial Image Generation (DAIG) , Adversarial Domain Aggregation (ADA) , and Feature-aligned Semantic Segmentation (FSS) . In FIG. 5, the colored solid arrows represent generators, while the black solid arrows indicate the segmentation network F. The dashed arrows correspond to different losses. DAIG aims to generate adapted images from source domains to the target domain from the perspective of visual appearance while preserving the semantic information with a dynamic segmentation model. In order to reduce the distances among the adapted domains and thus generate a more aggregated unified domain, ADA is performed, including Cross-domain Cycle Discriminator (CCD) and Sub-domain Aggregation Discriminator (SAD) . Finally, FSS learns the domain-invariant representations at the feature-level in an adversarial manner. Table 1 compares MADAN with several state-of-the-art DA methods.

Table 1: Comparison of the MADAN model with several state-of-the-art domain adaptation methods. The full names of each property from the second to the last columns are pixel-level alignment, feature-level alignment, semantic consistency, cycle consistency, multiple sources, domain aggregation, one task network, and fine-grained prediction, respectively.

The goal of DAIG is to make images from different source domains visually similar to the target images, as if they are drawn from the same target domain distribution. To this end, for each source domain S _i, a generator

mapping to the target T is introduced in order to generate adapted images that fool D _T , which is a pixel-level adversarial discriminator. D _T is trained simultaneously with each

to classify real target images X _T from adapted images

The corresponding GAN loss function is:

Since the mapping

is highly under-constrained, an inverse mapping

as well as a cycle-consistency loss is employed to enforce

and vice versa. Similarly, we introduce D _i to classify X _i from

with the following GAN loss:

The cycle-consistency loss that ensures the learned mappings

and

are cycle-consistent, preventing them from contradicting each other, is defined as:

The adapted images contain the same semantic information as original source images. The semantic consistency is partially constrained by the cycle consistency loss. The semantic consistency loss in CyCADA was proposed to better preserve semantic information. Example CyCADA is disclosed by Judy Hoffman et al, in International Conference on Machine Learning, pages 1994-2003, 2018. x _i and

are both fed into a segmentation model F _i pretrained on (X _i, Y _i) . However, since x _i and

are from different domains, employing the same segmentation model, i.e. F _i, to obtain the segmentation results and then compute the semantic consistency loss may be detrimental to image generation. Instead, the adapted images

is fed into a network F _T trained on the target domain, which is infeasible since target domain labels are not available in UDA. Instead of employing F _i on

the network F _A is dynamically updated, which takes

as input, so that its input domain, the domain that the network performs best on, gradually changes from that of F _i to F _T . The task segmentation model F trained on the adapted domain as F _A, i.e. F _A = F is employed, which has two advantages: (1)

becomes the input domain of F _A, and as F is trained to have better performance on the target domain, the semantic loss after F _A promotes

to generate images that are closer to target domain at the pixel-level; (2) since F _A and F can share the parameters, no additional training or memory space is introduced, which is quite efficient. The dynamic semantic consistency (DSC) loss disclosed herein is defined as:

where hL (·||·) is the hL divergence between two distributions.

Different segmentation models are trained for each adapted domain and different predictions with specific weights are combined for target images, or combining all adapted domains together and training one model. For the first strategy, the weights for different adapted domains need to be selected; and each target image need to be fed into all segmentation models at reference time. For the second strategy, since the alignment space is high-dimensional, although the adapted domains are relatively aligned with the target, they may be significantly misaligned with each other. In order to mitigate this issue, adversarial domain aggregation is used to make different adapted domains more closely aggregated with two kinds of discriminators. One is sub-domain aggregation discriminator, which is designed to directly make the different adapted domains indistinguishable. For S _i, a discriminator D _i is introduced with the following loss function:

The other is cross-domain cycle discriminator. For each source domain S _i, images from the adapted domains

j= 1, …, M, j≠ i transferred back to S _i using

and employ the discriminator D _ito classify u _ifrom

) , which corresponds to the following loss function:

After adversarial domain aggregation, the adapted images of different domains X′ _i (i = 1, ..., M) are more closely aggregated and aligned. Meanwhile, the semantic consistency loss in dynamic adversarial image generation ensures that the semantic information, i.e. the segmentation labels, is preserved before and after image translation. Suppose the images of the unified aggregated domain are

and corresponding labels are

Atask segmentation model F is then trained based on X′ and Y′ with the following cross-entropy loss:

where L is the number of classes, H, W are the height and width of the adapted images, s is the softmax function,

is an indicator function and F _l, h, w (x′ ) is value of F (x′ ) at index (l, h, w) . Further, a feature-level alignment between X′ and X _T is imposed to improve the segmentation performance during inference of X _T on the segmentation model F by introducing a discriminator D _F. The feature-level GAN loss is defined as:

The MADAN learning framework disclosed herein utilizes adaptation techniques including pixel-level alignment, cycle-consistency, semantic consistency, domain aggregation, and feature-level alignment. Combining all these components, the overall objective loss function of MADAN is:

The training process corresponds to solving for a target model F according to the optimization:

Where G and D represent all the generators and discriminators in Eq. (9) , respectively.

At least one of the process and method 400 may be executed by at least one computing device in an autonomous vehicle (e.g., the ACU 150 or the computing device 200) . For example, at least one of the process and method 400 may be implemented as a set of instructions (e.g., an application) stored in a non-transitory computer readable storage medium (e.g., the storage device 227) . At least one processor of the computing device (e.g., the processor 222 of the computing device 200) may execute the set of instructions and may accordingly be directed to perform at least one of the process and method 400 via at least one of receiving and sending electronic signals.

It should be noted that the above description is merely provided for the purpose of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure.

EXAMPLES

In the section below, the experimental settings are first introduced and then the segmentation results of the MADAN framework and several existing approaches are compared both quantitatively and qualitatively, followed by some ablation studies. The domain generalization performance to unseen domains is also evaluated.

In the following adaptation experiments, synthetic GTA and SYNTHIA datasets are used as source domains and real Cityscapes are used as target domain. BDDS (Big Data and Distributed Systems) is also used as target domain in domain generalization experiments.

Datasets

Cityscapes such as those disclosed by Marius Cordts et al. (IEEE Conference on Computer Vision and Pattern Recognition, pages 3213-3223, 2016) contains vehicle-centric urban street images collected from a moving vehicle in 50 cities from Germany and neighboring countries. There are 5,000 images with pixel-wise annotations, including a training set with 2,975 images, a validation set with 500 images, and a test set with 1,595 images. The images have resolution of 2048 × 1024 and are labeled into 19 classes.

GTA such as those disclosed by Stephan R Richter et al. (European Conference on Computer Vision, pages 102-118, 2016) is a vehicle-egocentric image dataset collected in the high-fidelity rendered computer game GTA-V with pixel-wise semantic labels. It contains 24, 966 images (video frames) with the resolution 1914 × 1052. There are 19 classes compatible with Cityscapes.

SYNTHIA such as those disclosed by German Ros et al. (IEEE Conference on Computer Vision and Pattern Recognition, pages 3234-3243, 2016) is a large synthetic dataset. To pair with Cityscapes, a subset, named SYNTHIA-RANDCITYSCAPES, is designed with 9, 400 images with resolution 960 x 720 which are automatically annotated with 16 object classes, one void class, and some unnamed classes.

BDDS such as those disclosed by Fisher Yu et al. (Bdd100k: A diverse driving video database with scalable annotation tooling. arXiv: 1805.04687, 2018) contains thousands of real-world dashcam video frames with accurate pixel-wise annotations. It has a compatible label space with Cityscapes and the image resolution is 1280 x 720. There are 7,000, 1,000 and 2,000 images for training, validation, and testing, respectively.

To demonstrate the effectiveness of the MADAN framework for semantic segmentation, the following methods are used for comparison. (1) Source only, i.e. train on the source domains and test on the target domain directly. The source only approach is viewed as a lower bound of DA. (2) Single-source DA, perform multi-source DA via single-source DA, including FCNs WId, CDA, ROAD, AdaptSeg, CyCADA, and DCAN. (3) Multi-source DA, extend some single-source DA method to multi-source setting, including MDAN. For comparison, the results of an oracle setting is also reported, where the segmentation model is both trained and tested on the target domain. For the source-only and the single-source DA stands, two strategies were employed: (1) single source, i.e. performing adaptation on each single source; (2) source-combine, i.e. all source domains are combined into a traditional single source. For MDAN, the original classification network is extended for the segmentation tasks disclosed herein.

In some embodiments, methods such as those disclosed by Judy Hoffman et al. (Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv: 1612.02649, 2016) , Yang Zhang et al. (IEEE International Conference on Computer Vision, pages 2020-2030, 2017) , and Judy Hoffman et al. (International Conference on Machine Learning, pages 1994-2003, 2018) were followed to employ class-wise intersection-over-union (mIoU) and mean IoU (mIoU) to evaluate the segmentation results of each class and classes. Let P _l and G _l respectively denote the predicted and ground-truth pixels that belong to class l, and then

where |·| denotes the cardinality of a set. Larger cwIoU and mIoU values represent better performances.

In some embodiments, the MADAN could be trained in an end-to-end manner. In other embodiments, the MADAN is trained in three stages. First two CycleGANs such as those disclosed by Jun-Yan Zhu et al, (IEEE International Conference on Computer Vision, pages 2223-2232, 2017) are trained without semantic consistency losses for

and

and then train an FCN F on the adapted images with corresponding labels from the source domains. Second, after updating FA with F trained above, adapted images are generated using CycleGAN with the proposed semantic consistency loss in Eq. (4) and aggregate different adapted domains using Cross-domain Cycle Discriminator and Sub-domain Aggregation Discriminator. Finally, an FCN is trained on the newly adapted images in the aggregated domain with feature-level alignment. The above stages are trained iteratively.

In the experiments, FCN such as those disclosed by Jonathan Long et al. (IEEE Conference on Computer Vision and Pattern Recognition, pages 3431-3440, 2015) is used as semantic segmentation network. In some embodiment, VGG-16 such as those disclosed by Karen Simonyan et al. (International Conference on Learning Representations, 2015) is used as FCN backbone. The weights of the feature extraction layers in the networks are initialized from models trained on ImageNet such as those disclosed by Jia Deng et al (IEEE Conference on Computer Vision and Pattern Recognition, pages 248-255, 2009) . The network is implemented in PyTorch and trained with Adam optimizer such as those disclosed by Diederik P Kingma et al. (International Conference on Learning Representations, 2015) using a batch size of 8 with initial learning rate 1e-4. The machines used are equipped with 4 NVIDIA Tesla P40 GPUs and 20 Intel (R) Xeon (R) CPU E5-2630 [email protected].

All the images in SYNTHIA, GTA and Cityscapes are resized to 600 × 1080, and are then cropped to 400 × 400 during the training of the pixel-level adaptation for 20 epochs. Sub-domain aggregation discriminator and cross-domain cycle discriminator are frozen in the first 5 and 10 epochs, respectively. There are 16 and 19 different classes in SYNTHIA and GTA, respectively. In the experiments, 16 intersection classes are taken for all mIoU evaluations.

In some of the experiments, mean intersection-over-union (mIoU) is employed to evaluate the segmentation results. For example, sixteen intersection classes of GTA and SYNTHIA, compatible with Cityscapes, are taken for all mIoU evaluations.

The performance comparisons between the MADAN model framework and the other baselines, including source-only, single-source DA, and multi-source DA, as measured by class-wise IoU and mIoU are shown in Table 2.

Table 2: Comparison with the state-of-the-art DA methods for semantic segmentation from GTA and SYNTHIA to Cityscapes. The best class-wise IoU and mIoU are emphasized in bold.

As shown in Table 2, the source-only method, i.e. directly transferring the models trained on the source domains to the target domain, performs the worst in most adaptation settings. Due to the presence of domain shift or dataset bias, the joint probability distributions of observed images and class labels greatly differ between the source and target domains. This results in the model’s low transferability from the source domains to the target domain. Simply combining different source domains performs better than each single source, which indicates the superiority of multiple sources over single source despite the domain shift among different sources.

Comparing source-only with single-source DA respectively on GTA and SYNTHIA, it is clear that all adaptation methods perform better, which demonstrates the effectiveness of domain adaptation in semantic segmentation. Comparing the results of CyCADA in single-source and source-combine settings, it appears simply combining different source domains and performing single-source DA may result in performance degradation.

The MADAN framework disclosed herein achieves the highest mloU score among all adaptation methods, which benefits from the joint consideration of pixel-and feature-level alignments, cycle-consistency, dynamic semantic-consistency, domain aggregation, and multiple sources. The MADAN disclosed herein also significantly outperforms source-combine DA, in which domain shift also exists among different sources. By bridging this gap, multi-source DA can boost the adaptation performance. On the one hand, compared to single-source DA, MADAN utilizes more useful information from multiple sources. On the other hand, existing multi-source DA methods only consider feature-level alignment, which may be enough for course-grained tasks, e.g. image classification but is insufficient for fine-grained tasks, e.g. semantic segmentation, a pixel-wise prediction task. In addition, pixel-level alignment with a dynamic semantic consistency loss is considered and different adapted domains are further aggregated.

The oracle method, i.e. testing on the target domain using the model trained on the same domain, achieves the best performance. However, this model is trained using the ground truth segmentation labels from the target domain, which are unavailable in unsupervised domain adaptation.

Qualitative semantic segmentation result from GTA to Cityscapes with single source DA are shown in FIG. 6. From left to right of FIG. 6 are: (a) original image, (b) ground truth annotation, (c) CycleGAN, (d) CycleGAN+DSC, (e) CycleGAN+DSC+Feat.

Qualitative semantic segmentation result from GTA and SYNTHIA to Cityscapes with multi-source adaptation method, i.e. MADAN are shown in FIG. 7. From left to right of FIG. 7 are: (a) original image, (b) ground truth annotation, (c) source only from GTA, (d) CycleGANs on GTA and SYNTHIA, (e) +CCD+DSC, (f) +SAD+DSC, (g) +CCD+SAD+DSC, and (h) +CCD+SAD+DSC+Feat (MADAN) . As shown in FIG. 7, after adaptation by the method disclosed herein, the visual segmentation results are improved notably.

Another qualitative semantic segmentation result from GTA and SYNTHIA to Cityscapes with multi-source adaptation method, i.e. MADAN are shown in FIG. 8. From left to right of FIG. 8 are: (a) original image, (b) ground truth annotation, (c) source only from GTA, (d) CycleGANs on GTA and SYNTHIA, (e) +CCD+DSC, (f) +SAD+DSC, (g) +CCD+SAD+DSC, and (h) +CCD+SAD+DSC+Feat (MADAN) .

The results of image translation with pixel-level alignment from GTA and SYNTHIA to Cityscapes in visualized in FIG. 9. From left to right of FIG. 9 are: (a) original source image, (b) CycleGAN, (c) CycleGAN+DSC, (d) CycleGAN+CCD+DSC, (e) CycleGAN+SAD+DSC, (f) CycleGAN+CCD+SAD+DSC, and (g) target Cityscapes image. The top two rows and bottom rows are GTA → Cityscapes and SYNTHIA → Cityscapes, respectively. As shown in FIG. 9, the pixel-level alignment method (f) disclosed herein produces styles of the images that are close to Cityscapes while the semantic information is well preserved.

Another results of image translation with pixel-level alignment from GTA and SYNTHIA to Cityscapes in visualized in FIG. 10. From left to right of FIG. 10 are: (a) original source image, (b) CycleGAN, (c) CycleGAN+DSC, (d) CycleGAN+CCD+DSC, (e) CycleGAN+SAD+DSC, (f) CycleGAN+CCD+SAD+DSC, and (g) target Cityscapes image.

Ablation study was performed by first comparing the dynamic semantic consistency (DSC) loss disclosed herein in MADAN with the original semantic consistency (SC) loss in CyCADA. As shown in Table 3, for both GTA → Cityscapes and SYNTHIA→ Cityscapes adaptations, DSC achieves better results. After demonstrating its value, the DSC loss is employed in subsequent experiments.

Table 3: Comparison between the proposed dynamic semantic consistency (DSC) loss in MADAN and the original SC loss in [31] . The better mIoU for each pair is emphasized in bold.

Second, the effectiveness of different components in MADAN were incrementally investigated. The results are shown in Table 4.

Table 4: Ablation study on different components in MADAN. Baseline denotes using piexl-level alignment with cycle-consistency, +SAD denotes using the sub-domain aggregation discriminator, +CCD denotes using the cross-domain cycle discriminator, +DSC denotes using the dynamic semantic consistency loss, and +Fcat denotes using feature-level alignment.

As shown in Table 4, (1) both domain aggregation methods, i.e. SAD and CCD, obtain better performance by making different adapted domains more closely aggregated, while SAD outperforms CCD; (2) adding the DSC loss further improves the mloU score, again demonstrating the effectiveness of DSC; (3) feature-level alignments also contribute to the adaptation task; (4) the modules are orthogonal to each other to some extent, since adding each one of them does not introduce performance degradation.

After showing the effectiveness of the MADAN framework on domain adaptation, the trained models on BDDS is further evaluated, which is unseen during the training time, to test their generalization capabilities. The results are shown in Table 5.

Table 5: Domain generalization performance from GTA and SYNTHIA to BDDS.

As shown in Table 5, compared with the baselines, MADAN improved the mloU score by ～10%.

As disclosed herein, multi-source domain adaptation for semantic segmentation from synthetic data to real data is performed. A framework, termed Multi-source Adversarial Domain Aggregation Network (MADAN) , is designed with three components. For each source domain, adapted images is generated with a novel dynamic semantic consistency loss. Further a sub-domain aggregation discriminator and cross-domain cycle discriminator are used to better aggregate different adapted domains. Together with other techniques such as pixel-and feature-level alignments as well as cycle-consistency, MADAN achieves 60.5%, 4.0%, 9.9%, and 40.8%relative mloU improvements compared with best source-only, best single-source DA, source-combine DA, and other multi-source DA, respectively on Cityscapes from GTA and SYNTHIA.

Having thus described the basic concepts, it may be rather apparent to those skilled in the art after reading this detailed disclosure that the foregoing detailed disclosure is intended to be presented by way of example only and is not limiting. Various alterations, improvements, and modifications may occur and are intended to those skilled in the art, though not expressly stated herein. These alterations, improvements, and modifications are intended to be suggested by this disclosure, and are within the spirit and scope of the example embodiments of this disclosure.

Moreover, certain terminology has been used to describe embodiments of the present disclosure. For example, the terms “one embodiment, ” “an embodiment, ” and “some embodiments” mean that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Therefore, it is emphasized and should be appreciated that two or more references to “an embodiment” or “one embodiment” or “an alternative embodiment” in various portions of this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined as suitable in one or more embodiments of the present disclosure.

Further, it will be appreciated by one skilled in the art, aspects of the present disclosure may be illustrated and described herein in any of a number of patentable classes or context including any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof. Accordingly, aspects of the present disclosure may be implemented entirely hardware, entirely software (including firmware, resident software, micro-code, etc. ) or combining software and hardware implementation that may all generally be referred to herein as a “module, ” “unit, ” “component, ” “device, ” or “system. ” Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable media having computer readable program code embodied thereon.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including electro-magnetic, optical, or the like, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that may communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable signal medium may be transmitted using any appropriate medium, including wireless, wireline, optical fiber cable, RF, or the like, or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB. NET, Python or the like, conventional procedural programming languages, such as the ″C″ programming language, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages. The program code may execute entirely on the user′s computer, partly on the user′s computer, as a stand-alone software package, partly on the user′s computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user′s computer through any type of network, including a local area network (LAN) or a wide area network (WAN) , or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider) or in a cloud computing environment or offered as a service such as a Software as a Service (SaaS) .

Furthermore, the recited order of processing elements or sequences, or the use of numbers, letters, or other designations therefore, is not intended to limit the claimed processes and methods to any order except as may be specified in the claims. Although the above disclosure discusses through various examples what is currently considered to be a variety of useful embodiments of the disclosure, it is to be understood that such detail is solely for that purpose, and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover modifications and equivalent arrangements that are within the spirit and scope of the disclosed embodiments. For example, although the implementation of various components described above may be embodied in a hardware device, it may also be implemented as a software only solution, e.g., an installation on an existing server or mobile device.

Similarly, it should be appreciated that in the foregoing description of embodiments of the present disclosure, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the various embodiments. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed subject matter requires more features than are expressly recited in each claim. Rather, claim subject matter lie in less than all features of a single foregoing disclosed embodiment.

In some embodiments, the numbers expressing quantities, properties, and so forth, used to describe and claim certain embodiments of the application are to be understood as being modified in some instances by the term “about, ” “approximate, ” or “substantially. ” For example, “about, ” “approximate” or “substantially” may indicate ±20%variation of the value it describes, unless otherwise stated. Accordingly, in some embodiments, the numerical parameters set forth in the written description and attached claims are approximations that may vary depending upon the desired properties sought to be obtained by a particular embodiment. In some embodiments, the numerical parameters should be construed in light of the number of reported significant digits and by applying ordinary rounding techniques. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of some embodiments of the application are approximations, the numerical values set forth in the specific examples are reported as precisely as practicable.

Each of the patents, patent applications, publications of patent applications, and other material, such as articles, books, specifications, publications, documents, things, and/or the like, referenced herein is hereby incorporated herein by this reference in its entirety for all purposes, excepting any prosecution file history associated with same, any of same that is inconsistent with or in conflict with the present document, or any of same that may have a limiting affect as to the broadest scope of the claims now or later associated with the present document. By way of example, should there be any inconsistency or conflict between the description, definition, and/or the use of a term associated with any of the incorporated material and that associated with the present document, the description, definition, and/or the use of the term in the present document shall prevail.

In closing, it is to be understood that the embodiments of the application disclosed herein are illustrative of the principles of the embodiments of the application. Other modifications that may be employed may be within the scope of the application. Thus, by way of example, but not of limitation, alternative configurations of the embodiments of the application may be utilized in accordance with the teachings herein. Accordingly, embodiments of the present application are not limited to that precisely as shown and described.

Claims

A system for conducting multi-source domain adaptation for semantic segmentation of a target image of a target domain, wherein the multiple source domains comprise images from a plurality of single sources, the system comprising:

a storage medium storing a set of instructions; and

a processor in communication with the storage medium to execute the set of instructions to:

perform dynamic adversarial image generation by generating an adapted domain for each single source of the plurality of single sources with dynamic semantic consistency while aligning at pixel-level cycle-consistently towards the target image;

perform adversarial domain aggregation by using sub-domain aggregation discriminator and cross-domain cycle discriminator to aggregate the adapted domains to form aggregated domains; and

perform feature-aligned semantic segmentation of the target image by performing feature-level alignment between the aggregated domain and the target domain while training a segmentation network.
The system of claim 1, wherein the adapted domain is generated by using a Generative Adversarial Network (GAN) with cycle-consistency loss.
The system of claim 1 or 2, wherein dynamic semantic consistency and pixel level alignment is achieved by minimizing a KL divergence between source predictions of a pretrained segmentation model and adapted predictions of a dynamic segmentation model.
The system of any one of claims 1-3, wherein the sub-domain aggregation discriminator is used to directly make different adapted domains indistinguishable.
The system of any one of claims 1-4, wherein the cross-domain cycle discriminator is used to discriminate between the images from the single source and the images transferred from other sources to aggregate different adapted domains into a more unified domain.
The system of any one of claims 1-5, wherein data from different domains are observed in same image space but exhibit different distributions and all the domains share same set of classes.
The system of any one of claims 1-6, wherein the dynamic adversarial image generation make images from different source domains visually similar to the target image, as if they are drawn from the same target domain distribution.
A method for conducting multi-source domain adaptation for semantic segmentation of a target image of a target domain, wherein the multiple source domains comprise images from a plurality of single sources, on a computing device including a storage medium storing a set of instructions and a processor in communication with the storage medium to execute the set of instructions, the method comprising:

performing dynamic adversarial image generation by generating an adapted domain for each single source of the plurality of single sources with dynamic semantic consistency while aligning at pixel-level cycle-consistently towards the target image;

performing adversarial domain aggregation by using sub-domain aggregation discriminator and cross-domain cycle discriminator to aggregate the adapted domains to form aggregated domains; and

performing feature-aligned semantic segmentation of the target image by performing feature-level alignment between the aggregated domain and the target domain while training a segmentation network.
The method of claim 8, wherein the adapted domain is generated by using a Generative Adversarial Network (GAN) with cycle-consistency loss.
The method of claim 8 or 9, wherein dynamic semantic consistency and pixel level alignment is achieved by minimizing a KL divergence between source predictions of a pretrained segmentation model and adapted predictions of a dynamic segmentation model.
The method of any one of claims 8-10, wherein the sub-domain aggregation discriminator is used to directly make different adapted domains indistinguishable.
The method of any one of claims 8-11, wherein the cross-domain cycle discriminator is used to discriminate between the images from the single source and the images transferred from other sources to aggregate different adapted domains into a more unified domain.
The method of any one of claims 8-12, wherein data from different domains are observed in same image space but exhibit different distributions and all the domains share same set of classes.
The method of any one of claims 8-13, wherein the dynamic adversarial image generation make images from different source domains visually similar to the target image, as if they are drawn from the same target domain distribution.
A non-transitory readable medium, storing a set of instructions for conducting multi-source domain adaptation for semantic segmentation of a target image of a target domain, wherein the multiple source domains comprise images from a plurality single sources and wherein when the set of instructions is executed by a processor of an electrical device, the device performs a method comprising:

performing dynamic adversarial image generation by generating an adapted domain for each single source of the plurality of single sources with dynamic semantic consistency while aligning at pixel-level cycle-consistently towards the target image;

performing adversarial domain aggregation by using sub-domain aggregation discriminator and cross-domain cycle discriminator to aggregate the adapted domains to form aggregated domains; and

performing feature-aligned semantic segmentation of the target image by performing feature-level alignment between the aggregated domain and the target domain while training a segmentation network.
The medium of claim 15, wherein the adapted domain is generated by using a Generative Adversarial Network (GAN) with cycle-consistency loss.
The medium of claim 15 or 16, wherein dynamic semantic consistency and pixel level alignment is achieved by minimizing a KL divergence between source predictions of a pretrained segmentation model and adapted predictions of a dynamic segmentation model.
The medium of any one of claims 15-17, wherein the sub-domain aggregation discriminator is used to directly make different adapted domains indistinguishable and the cross-domain cycle discriminator is used to discriminate between the images from the single source and the images transferred from other sources to aggregate different adapted domains into a more unified domain.
The medium of any one of claims 15-18, wherein data from different domains are observed in same image space but exhibit different distributions and all the domains share same set of classes.
The medium of any one of claims 15-19, wherein the dynamic adversarial image generation make images from different source domains visually similar to the target image, as if they are drawn from the same target domain distribution.