WO2019240964A1

WO2019240964A1 - Teacher and student based deep neural network training

Info

Publication number: WO2019240964A1
Application number: PCT/US2019/034841
Authority: WO
Inventors: Prithviraj DHAR; Rajib MONDAL; Rajat Vikram SINGH; Kuan-Chuan Peng; Ziyan Wu; Jan Ernst
Original assignee: Siemens Aktiengesellschaft; Siemens Corporation
Priority date: 2018-06-12
Filing date: 2019-05-31
Publication date: 2019-12-19

Abstract

Examples of techniques for teacher and student based deep neural network (DNN) training are described herein. An aspect includes initializing a first DNN based on a teacher DNN, wherein the teacher DNN is trained to recognize a plurality of old classes. Another aspect includes adding a first new class to the first student DNN. Another aspect includes providing a first input image corresponding to the first new class to the first student DNN and the teacher DNN. Another aspect includes determining a first loss based on an output of the first student DNN corresponding to the first input image and an output of the teacher DNN corresponding to the first input image. Another aspect includes updating the first student DNN based on the first loss, wherein the teacher DNN is not updated base on the first loss.

Description

TEACHER AND STUDENT BASED DEEP NEURAL NETWORK TRAINING

BACKGROUND

[0001] The present techniques relate to neural networks. More specifically, the techniques relate to teacher and student based deep neural network (DNN) training.

[0002] A neural network may include a plurality of processing elements arranged in layers. Interconnections are made between successive layers in the neural network. A neural network may have an input layer, an output layer, and any appropriate number of intermediate layers. The intermediate layers may allow solution of nonlinear problems by the neural network. A layer of a neural network may generate an output signal which may be determined based on a weighted sum of any input signals the layer receives. The input signals to a layer of a neural network may be provided from the neural network input, or from the output of any other layer of the neural network.

SUMMARY

[0003] According to an embodiment described herein, a system can include a processor to initialize a first student deep neural network (DNN) based on a teacher DNN, wherein the teacher DNN is trained to recognize a plurality of old classes. The processor can also add a first new class to the first student DNN. The processor can also provide a first input image corresponding to the first new class to the first student DNN and the teacher DNN. The processor can also determine a first loss based on an output of the first student DNN corresponding to the first input image and an output of the teacher DNN corresponding to the first input image. The processor can also update the first student DNN based on the first loss, wherein the teacher DNN is not updated base on the first loss.

[0004] According to another embodiment described herein, a method can include initializing, via a processor, a first student deep neural network (DNN) based on a teacher DNN, wherein the teacher DNN is trained to recognize a plurality of old classes. The method can also include adding a first new class to the first student DNN. The method can also include providing a first input image corresponding to the first new class to the first student DNN and the teacher DNN. The method can also include determining a first loss based on an output of the first student DNN corresponding to the first input image and an output of the teacher DNN corresponding to the first input image. The method can also include updating the first student DNN based on the first loss, wherein the teacher DNN is not updated base on the first loss.

[0005] According to another embodiment described herein, a computer program product may include a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processing device to cause the processing device to perform a method including initializing a first student deep neural network (DNN) based on a teacher DNN, wherein the teacher DNN is trained to recognize a plurality of old classes. The method can also include adding a first new class to the first student DNN. The method can also include providing a first input image corresponding to the first new class to the first student DNN and the teacher DNN. The method can also include determining a first loss based on an output of the first student DNN corresponding to the first input image and an output of the teacher DNN

corresponding to the first input image. The method can also include updating the first student DNN based on the first loss, wherein the teacher DNN is not updated base on the first loss.

BRIEF DESCRIPTION OF THE DRAWINGS

[0006] Fig. 1 is a block diagram of an example computer system for use in conjunction with teacher and student based deep neural network (DNN) training;

[0007] Figs. 2A-E are block diagrams of example systems for teacher and student based DNN training; [0008] Fig. 3 is a process flow diagram of an example method for teacher and student based DNN training;

[0009] Fig. 4 is a process flow diagram of another example method for teacher and student based DNN training;

[0010] Figs. 5A-B are block diagrams of an example system for teacher and student based DNN training; and

[0011] Fig. 6 is a process flow diagram of another example method for teacher and student based DNN training.

DETAILED DESCRIPTION

[0012] Embodiments of a teacher and student based dense neural network (DNN) training are provided, with exemplary embodiments being discussed below in detail. The weights of a DNN, which are used to determine a weighted sum that gives an output of the DNN, may be determined based on a training process. A DNN may be trained by feeding the DNN a succession of known input patterns and comparing the output of the DNN to a corresponding expected output pattern. The DNN may learn by measuring a difference between the expected output pattern associated with the input pattern and the output pattern that was produced by the current state of the DNN for the input pattern.

The weights of DNN may be adjusted based on the measured difference. DNN training may be an iterative process, requiring a relatively large number of input patterns to be sequentially fed into the DNN. When the weights of a DNN are set to appropriate levels by the training, an input pattern at the input layer of the DNN may successively propagate through the intermediate layers of the DNN to give a correct corresponding output pattern for the input pattern.

[0013] A DNN may provide visual recognition and classification of objects in images, including but not limited to red-green -blue (RGB) images. Such a DNN may be trained to classify objects in images into a predetermined number of classes. A DNN that has been trained to classify objects into a predetermined number of classes may be used as a teacher DNN for a student DNN that is being trained to classify objects into a set of classes that includes the predetermined number of old classes plus one or more new classes. New classes may be incrementally added to the student DNN for object classification while preserving classification performance on the old classes. The weights of the student DNN may be initialized based on the teacher DNN. The architectures of the student and teacher DNN may not differ in the feature extracting module in some embodiments. The student DNN classifier may have a modified last layer, including a number of outputs corresponding to the number of old classes and new classes, as compared to the teacher DNN classifier in some embodiments. The weights for the new classes in a last layer of the classifier of the student DNN may be randomly initialized before the training of the student DNN in some embodiments.

[0014] During training, in some embodiments, the teacher DNN and student DNN may only be provided with input images corresponding to the new classes that are being added to the student DNN. The weights of the student DNN may be updated during the training based any appropriate combination of losses, including but not limited to a cross entropy (CE) classification loss, a distillation loss, and an attention loss. Data from the student DNN and the teacher DNN may be used to calculate the various losses. Only the student DNN may be updated based on the losses; the teacher DNN may not be updated based on the training. In some embodiments, a gradient backpropagation that is used to update the weights of the student DNN may be a weighted sum of the various calculated losses.

[0015] Turning now to FIG. 1, a computer system 100 is generally shown in accordance with an embodiment. The computer system 100 can be an electronic, computer framework comprising and/or employing any number and combination of computing devices and networks utilizing various communication technologies, as described herein. The computer system 100 can be easily scalable, extensible, and modular, with the ability to change to different services or reconfigure some features independently of others. The computer system 100 may be, for example, a server, desktop computer, laptop computer, tablet computer, or smartphone. In some examples, computer system 100 may be a cloud computing node. Computer system 100 may be described in the general context of computer system executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, and so on that perform particular tasks or implement particular abstract data types. Computer system 100 may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

[0016] As shown in FIG. 1 , the computer system 100 has one or more central processing units (CPU(s)) lOla, lOlb, lOlc, etc. (collectively or generically referred to as processor(s) 101). The processors 101 can be a single-core processor, multi-core processor, computing cluster, or any number of other configurations. The processors 101, also referred to as processing circuits, are coupled via a system bus 102 to a system memory 103 and various other components. The system memory 103 can include a read only memory (ROM) 104 and a random access memory (RAM) 105. The ROM 104 is coupled to the system bus 102 and may include a basic input/output system (BIOS), which controls certain basic functions of the computer system 100. The RAM is read- write memory coupled to the system bus 102 for use by the processors 101. The system memory 103 provides temporary memory space for operations of said instructions during operation. The system memory 103 can include random access memory (RAM), read only memory, flash memory, or any other suitable memory systems.

[0017] The computer system 100 comprises an input/output (I/O) adapter 106 and a communications adapter 107 coupled to the system bus 102. The I/O adapter 106 may be a small computer system interface (SCSI) adapter that communicates with a hard disk 108 and/or any other similar component. The I/O adapter 106 and the hard disk 108 are collectively referred to herein as a mass storage 110.

[0018] Software 111 for execution on the computer system 100 may be stored in the mass storage 110. The mass storage 110 is an example of a tangible storage medium readable by the processors 101 , where the software 111 is stored as instructions for execution by the processors 101 to cause the computer system 100 to operate, such as is described herein below with respect to the various Figures. Examples of computer program product and the execution of such instruction is discussed herein in more detail. The communications adapter 107 interconnects the system bus 102 with a network 112, which may be an outside network, enabling the computer system 100 to communicate with other such systems. In one embodiment, a portion of the system memory 103 and the mass storage 110 collectively store an operating system, which may be any appropriate operating system, to coordinate the functions of the various components shown in FIG. 1.

[0019] Additional input/output devices are shown as connected to the system bus 102 via a display adapter 115 and an interface adapter 116 and. In one embodiment, the adapters 106, 107, 115, and 116 may be connected to one or more TO buses that are connected to the system bus 102 via an intermediate bus bridge (not shown). A display 119 (e.g., a screen or a display monitor) is connected to the system bus 102 by a display adapter 115, which may include a graphics controller to improve the performance of graphics intensive applications and a video controller. A keyboard 121 , a mouse 122, a speaker 123, etc. can be interconnected to the system bus 102 via the interface adapter 116, which may include, for example, a Super EO chip integrating multiple device adapters into a single integrated circuit. Suitable EO buses for connecting peripheral devices such as hard disk controllers, network adapters, and graphics adapters typically include common protocols, such as the Peripheral Component Interconnect (PCI). Thus, as configured in Fig. 1, the computer system 100 includes processing capability in the form of the processors 101 , and, storage capability including the system memory 103 and the mass storage 110, input means such as the keyboard 121 and the mouse 122, and output capability including the speaker 123 and the display 119.

[0020] In some embodiments, the communications adapter 107 can transmit data using any suitable interface or protocol, such as the internet small computer system interface, among others. The network 1 12 may be a cellular network, a radio network, a wide area network (WAN), a local area network (LAN), or the Internet, among others.

An external computing device may connect to the computer system 100 through the network 112. In some examples, an external computing device may be an external Webserver or a cloud computing node.

[0021] It is to be understood that the block diagram of Fig. 1 is not intended to indicate that the computer system 100 is to include all of the components shown in Fig. 1. Rather, the computer system 100 can include any appropriate fewer or additional components not illustrated in Fig. 1 (e.g., additional memory components, embedded controllers, modules, additional network interfaces, etc.). Further, the embodiments described herein with respect to computer system 100 may be implemented with any appropriate logic, wherein the logic, as referred to herein, can include any suitable hardware (e.g., a processor, an embedded controller, or an application specific integrated circuit, among others), software (e.g., an application, among others), firmware, or any suitable combination of hardware, software, and firmware, in various embodiments.

[0022] Figs. 2A-E are block diagrams of example systems 200 A-E for teacher and student based DNN training. Systems 200 A-E may be implemented in conjunction with any suitable computing device, such as the computer system 100 of Fig. 1. For example, embodiments of systems 200A-E may include software 11 1 that is executed by processors 101, and may operate on data stored in system memory 103 and/or mass storage 110. Systems 200A-E include a teacher DNN 203 A and student DNNs 203B-D. DNNs 203 A-D may each include any appropriate number of layers in various

embodiments. Teacher DNN 203A is a DNN that is trained to recognize and classify an initial predetermined number of classes of objects (i.e., old classes) in input images, including but not limited to RBG images. Student DNNs 203B-D are initialized using the weights of teacher DNN 203 A. During the training process, student DNNs 203B-D are trained to recognize one or more new classes while maintaining the ability to recognize the old classes from teacher DNN 203 A, using input images, such as input image 201, corresponding to the new classes. Student DNNs 203B-D share their weights, and the weights of student DNNs 203B-D are updated together during the training.

Teacher DNN 203 A may serve as a feature extractor and reference representation for the old classes, and is not updated during the training.

[0023] The student DNNs 203B-D of Figs. 2A-E may have a modified last layer as compared to the teacher DNN 203 A. The number of outputs of a last layer of the student DNNs 203B-D may be increased by one to fit each additional new class (e.g., the number of outputs may be increased by 5 for training that is adding 5 classes to the student DNN 203B-D). The number of outputs of the last layer of the student DNN 203B-D may be a sum of the number of old classes and new classes. The teacher DNN 203 A and student DNNs 203B-D may be any appropriate convolutional network, including but not limited to a DenseNet.

[0024] The weights of the student DNNs 203B-D of Figs 2A-E may be updated during the training based any appropriate combination of losses (e.g., a weighted sum), including but not limited to a CE classification loss, a distillation loss, and an attention loss. In various embodiments, the distillation loss may be replaced with any other appropriate loss function that encourages similarity of its inputs, and the CE classification loss may also be replaced with any other appropriate loss function. In some

embodiments, a CE classification loss may be calculated by comparing a predicted class generated by the student DNNs 203B-D for an input image with an image level label that gives an actual class of the input image. During backpropagation, the weights of the student DNNs 203B-D may be updated based on application of a stochastic gradient descent (SGD) solver to the CE classification loss. A distillation loss may preserve the ability of the student DNNs 203B-D to identify the old classes. Classification knowledge and feature extracting knowledge may be distilled in various embodiments. In some embodiments, the distillation loss may be determined based on a distance, or difference, between the feature maps generated by the student DNNs 203B-D and the teacher DNN 203 A. In some embodiments, the distillation loss may be determined based on a distance between scores output by classifiers associated with any of the student DNNs 203B-D, which recognizes the new and old classes, and the score output by the classifier associated with the teacher DNN 203 A, which only recognizes the old classes. Any appropriate metric may be used to determine the distillation loss distance, including but not limited to cosine-distance. An attention loss may be determined based on, for example, bounding-box annotations on a pixel level label of the input image. The attention loss may measure a distance between a mask generated from the pixel level label associated with the input image and an attention map that may be generated by an attention map generator, including but not limited to gradient-weighted class activation mapping (Grad-CAM) or Grad-CAM++. Bounding boxes in the pixel level label may highlight relevant features that may be used to classify the input image, and may indicate the features that the student DNNs 203B-D should learn. An attention map highlights parts of the input image that were used by a DNN to support the current class prediction. The distance between the pixel level label and the attention map may give the attention loss, which is used for backpropagation. Various losses that may be used in

embodiments of systems for teacher and student based DNN training are discussed in further detail below with respect to systems 200A-E of Figs. 2A-E. Embodiments of systems for teacher and student based DNN training may include any appropriate combination of any elements of the various systems 200A-E that are discussed below with respect to Figs. 2A-E.

[0025] Fig. 2 A illustrates an embodiment of a system 200A for teacher and student based DNN training. Teacher DNN 203A and student DNNs 203B-C receive an input image 201 corresponding to a new class that the student DNNs 203B-C are being trained to recognize. The input image 201 has an associated image level label 202. DNNs 203 A-C receive the input image 201 , and the various layers of DNNs 203 A-C determine feature maps 204 A-C based on the input image 201. The respective feature maps 204 A- C are provided to associated classifiers 205 A-C. The classifiers 205 A-C each determine, from a respective predetermined number of classes, a probable class of an object depicted in the input image 201 based on the feature maps 204A-C. Classifier 205A recognizes the old classes of teacher DNN 203 A, while classifiers 205B-C recognize the old classes plus any new classes that student DNNs 203B-C are being trained to recognize.

[0026] Classifier 205 A outputs a probable class determination from the old classes to distillation loss module 207A. Distillation loss module 207A also receives a probable class determination from classifier 205B. Distillation loss module 207A determines a difference between the class determination from the teacher DNN 203 A and the class determination from student DNN 203B, and outputs a distillation loss based on the determined difference to summation module 209.

[0027] Feature maps 204A and 204B are received by autoencoder 206. The outputs of autoencoder 206 are provided to feature vector loss module 207B, which determines a distillation loss based on vector inputs. Autoencoder 206 may encode each of the feature maps 204A-B into n x 1 x 1 feature-code-vectors in some embodiments. Feature vector loss module 207B determines a difference between the feature-code-vector from the teacher DNN 203 A and the feature-code-vector from student DNN 203B, and outputs a vector loss based on the difference to summation module 209. In some embodiments, the vector loss may be calculated by feature vector loss module 207B as an L2 loss (i.e., least square errors loss) between the two feature-code-vectors from autoencoder 206.

[0028] Classifier 205B outputs a probable class determination to CE classification loss module 208. CE classification loss module 208 also receives an image level label 202, which gives the actual class of the input image 201. The CE classification loss module 208 determines a difference between the class determination from classifier 205B and the image level label 202, and outputs a classification loss based on the determined difference to summation module 209. In some embodiments, if the student DNNs 203B- C correctly classified the input image 201 , the classification loss may be zero; however, if the student DNN 203B-C incorrectly classified the input image 201, the classification loss may be greater than zero.

[0029] Classifier 205C provides a class determination to an attention map generator 211. The attention map generator 211 may be Grad-CAM or Grad-CAM++ in some embodiments. The attention map generator 211 also receives decision information 210 from student DNN 203 C regarding any areas of the input image 201 that were used by student DNN 203C to determine the feature maps 204C. The decision information 210 from the student DNN 203 C may be received from a last layer of the student DNN 203 C in some embodiments. The attention map generator 211 determines an attention map 212 of the input image 201, which is input to attention loss module 214. Attention loss module 214 also receives pixel level label 213, which corresponds to input image 201. Pixel level label 213 may include annotations such as bounding boxes indicating relevant features for classification of the object in the input image 201. The attention loss module 214 determines an attention loss based on a difference between the attention map 212 and the pixel level label 213, and outputs the attention loss to summation module 215.

[0030] Summation module 209 determines a loss sum, which may be a weighted sum, of the distillation loss from distillation loss module 207 A, the vector loss from feature vector loss module 207B, and the classification loss from CE classification loss module 208, and outputs the loss sum to summation module 215. The summation module 215 receives the loss sum from summation module 209, and the attention loss from attention loss module 214, and determines gradient backpropagation and weight update signal 216 that is provided to all of the student DNNs 203B-C. Gradient backpropagation and weight update signal 216 may be determined based on a weighted sum in some embodiments. The weights of the student DNNs 203B-C are each updated based on the gradient backpropagation and weight update signal 216. A next input image 201 corresponding to one of the new classes, having a respective corresponding image level label 202 and pixel level label 213, may then be used for further training of the updated student DNNs 203B-C. [0031] It is to be understood that the block diagram of Fig. 2A is not intended to indicate that the system 200 A is to include all of the components shown in Fig. 2 A.

Rather, the system 200A can include any appropriate fewer or additional components not illustrated in Fig. 2A (e.g., additional DNNs, feature maps, classifiers, autoencoders, attention map generators, attention maps, loss modules, summation modules, data inputs, etc.). Further, the embodiments described herein with respect to system 200A may be implemented with any appropriate logic, wherein the logic, as referred to herein, can include any suitable hardware (e.g., a processor, an embedded controller, or an application specific integrated circuit, among others), software (e.g., an application, among others), firmware, or any suitable combination of hardware, software, and firmware, in various embodiments.

[0032] Fig. 2B illustrates another embodiment of a system 200B for teacher and student based DNN training. System 200B includes a teacher DNN 203A and student DNN 203B as discussed above with respect to Fig. 2A. In system 200B of Fig. 2B, respective attention maps 219A-B are generated for each of teacher DNN 203 A and student DNN 203B by attention maps generators 218A-B based on the outputs of classifiers 205 A-B and decision information 217A-B. The attention map generators 218 A-B may be Grad-CAM or Grad-CAM++ in some embodiments. The decision information 217A-B from the DNNs 203 A-B may be received from a last layer of each of the DNNs 203 A-B in some embodiments. The attention maps 219A-B are used to preserve the recognition performance on old classes by student DNN 203B. Attention distillation loss module 220 receives the attention maps 219A-B, and calculates an attention distillation loss. The attention distillation loss that is determined by attention distillation loss module 220 may be provided to summation module 209, and used to determine the loss sum that is output by summation module 209 to summation module 215 as shown in Fig. 2 A.

[0033] In some embodiments, the attention distillation loss may be calculated by attention distillation loss module 220 based on Equation 1 :

In Eq. 1, a pixel-wise distance between the attention maps 219A-B is summed, the distance is normalized based on the size of the attention maps 219A-B, and weighted based on a class prediction score.

[0034] It is to be understood that the block diagram of Fig. 2B is not intended to indicate that the system 200B is to include all of the components shown in Fig. 2B.

Rather, the system 200B can include any appropriate fewer or additional components not illustrated in Fig. 2B (e.g., additional DNNs, feature maps, classifiers, autoencoders, attention map generators, attention maps, loss modules, summation modules, data inputs, etc.). Further, the embodiments described herein with respect to system 200B may be implemented with any appropriate logic, wherein the logic, as referred to herein, can include any suitable hardware (e.g., a processor, an embedded controller, or an application specific integrated circuit, among others), software (e.g., an application, among others), firmware, or any suitable combination of hardware, software, and firmware, in various embodiments.

[0035] Fig. 2C illustrates another embodiment of a system 200C for teacher and student based DNN training. System 200C includes a student DNN 203B as discussed above with respect to Fig. 2A-B. The classifier 205B that is associated with student DNN 203B in Fig. 2C outputs a first classification 221 A for the input image 201 corresponding to one of the new classes, and a second classification 221B for the input image 201 corresponding to a highest scoring class of the old classes. Attention map generator 218B generates first attention map 219B based on the new class classification 221 A and decision information 217B. Attention map generator 218C generates a second attention map 219C based on the old class classification 221 B and decision information 217C. The attention map generators 218B-C may be Grad-CAM or Grad-CAM++ in some embodiments. Attention loss confusion reduction module 222 determines an attention loss based on a difference between attention map 219B and attention map 219C, and outputs the determined attention loss to summation module 209, which includes the determined attention loss in the loss sum that is provided to summation module 215.

[0036] The attention loss confusion reduction module 222 may determine a normalized sum over overlapping parts of the two attention maps 219B-C (i.e., AM1 and AM2). The attention map size (i.e., AM_Size, given by the height x width of either of the attention maps 219B-C) may be used as a normalization factor. The sum may be determined by attention loss confusion reduction module 222 by stepping through the pixels of the attention maps 219B-C using coordinates (i,j), as illustrated in Equation 2:

Any confusion with a class that has a second highest prediction score may be reduced by attention loss confusion reduction module 222 in some embodiments; in other embodiments, confusion may be reduced with respect to more than one class. In some embodiments, to incorporate more classes, a weighted sum over all possible attention losses may be calculated by attention loss confusion reduction module 222.

[0037] It is to be understood that the block diagram of Fig. 2C is not intended to indicate that the system 200C is to include all of the components shown in Fig. 2C. Rather, the system 200C can include any appropriate fewer or additional components not illustrated in Fig. 2C (e.g., additional DNNs, feature maps, classifiers, autoencoders, attention map generators, attention maps, loss modules, summation modules, data inputs, etc.). Further, the embodiments described herein with respect to system 200C may be implemented with any appropriate logic, wherein the logic, as referred to herein, can include any suitable hardware (e.g., a processor, an embedded controller, or an application specific integrated circuit, among others), software (e.g., an application, among others), firmware, or any suitable combination of hardware, software, and firmware, in various embodiments. [0038] Fig. 2D illustrates another embodiment of a system 200D for teacher and student based DNN training. System 200D includes a teacher DNN 203A and student DNN 203B as discussed above with respect to Figs. 2A-C. In system 200D, attention distillation loss module 220 determines an attention loss as described above with respect to Fig. 2B. System 200D further includes an early-layer attention distillation loss module 223, which determines an early layer attention loss. The attention map generator 218D receives decision information 217D from teacher DNN 203 A. Decision information 217D may be received from an earlier layer in the teacher DNN 203 A than decision information 217A. The attention map generator 218D generates attention map 219D based on classification information from classifier 205 A and decision information 217D, and attention map 209D is provided to early-layer attention distillation loss module 223. The attention map generator 218E receives decision information 217E from student DNN 203B. Decision information 217E may be received from an earlier layer in the student DNN 203B than decision information 217B. The attention map generator 218E generates attention map 219E based on classification information from classifier 205B and decision information 217E, an attention map 219E is provided to early-layer attention distillation loss module 223. Early-layer attention distillation loss module 223 determines a difference between attention map 219D and attention map 219E, and outputs an early-layer attention distillation loss to summation module 209, which includes the early-layer attention distillation loss in the loss sum that is provided to summation module 215.

[0039] In embodiments of teacher and student based DNN training corresponding to Fig. 2D, attention maps, such as attention maps 219A-B and 219D-E, may be generated for any appropriate layers of the student and/or teacher DNNs 203 A-B. A distance may be determined between attention maps from any corresponding layers of the student DNN 203B and teacher DNN 203 A, to determine any appropriate number of early-layer attention distillation losses that may be used to train the student DNN 203B-D. [0040] It is to be understood that the block diagram of Fig. 2D is not intended to indicate that the system 200D is to include all of the components shown in Fig. 2D. Rather, the system 200D can include any appropriate fewer or additional components not illustrated in Fig. 2D (e.g., additional DNNs, feature maps, classifiers, autoencoders, attention map generators, attention maps, loss modules, summation modules, data inputs, etc.). Further, the embodiments described herein with respect to system 200D may be implemented with any appropriate logic, wherein the logic, as referred to herein, can include any suitable hardware (e.g., a processor, an embedded controller, or an application specific integrated circuit, among others), software (e.g., an application, among others), firmware, or any suitable combination of hardware, software, and firmware, in various embodiments.

[0041] Fig. 2E illustrates an embodiment of a system 200E for teacher and student based DNN training. System 200E includes a teacher DNN 203 A and student DNN 203B as discussed above with respect Figs. 2A-D. System 200E further includes student DNN 203 D, which is initialized and updated in tandem with student DNN 203B. In system 200E, classifier 205A associated with teacher DNN 203A outputs a top confusing old class Z 224 based on input image 201, i.e., the highest scoring of the old classes for the input image 201. An attention map based on the top confusing class Z 224 is used to mask the input image 201 to produce masked input image 225. Masked input image 225 is input to student DNN 203 D, which generates feature maps 204D based on the masked input image 225. The classifier 205D determines a classification score based on feature maps 204D. Because teacher DNN 203 A is trained to identify old patterns in images of new classes, the masked input image 225 may be representative of a prominent pattern of one of the old classes (i.e., class Z) that was predicted by the teacher DNN 203 A. The student DNN 203 D should predict the same class Z based on the masked input image 225. Confusion reduction module 226 determines an inverse score (e.g., l-score(Z)) based on the classification score from classifier 205D, and outputs the inverse score to summation module 215. The inverse score is used by summation module 215 to determine gradient backpropagation and weight update signal 216. [0042] It is to be understood that the block diagram of Fig. 2E is not intended to indicate that the system 200E is to include all of the components shown in Fig. 2E. Rather, the system 200E can include any appropriate fewer or additional components not illustrated in Fig. 2E (e.g., additional DNNs, feature maps, classifiers, autoencoders, attention map generators, attention maps, loss modules, summation modules, data inputs, etc.). Further, the embodiments described herein with respect to system 200E may be implemented with any appropriate logic, wherein the logic, as referred to herein, can include any suitable hardware (e.g., a processor, an embedded controller, or an application specific integrated circuit, among others), software (e.g., an application, among others), firmware, or any suitable combination of hardware, software, and firmware, in various embodiments.

[0043] Fig. 3 is a process flow diagram of an example method 300 for teacher and student based DNN training. The method 300 can be implemented with any suitable computing device, such as the computer system 100 of Fig. 1. The method 300 may be implemented in conjunction with embodiments of systems including any combination of elements of any of systems 200A-E of Figs. 2A-E. In block 301, a student DNN 203B, as discussed above with respect to Figs. 2A-E, is initialized based on a teacher DNN 203 A. Student DNNs 203 C-D may also be initialized to be the same as student DNN 203B in block 301 of method 300 in some embodiments. The teacher DNN 203 A is trained to recognize objects in images, such as input image 201, corresponding to a predetermined set of old classes. The student DNN 203B may recognize the old classes, and also one or more additional new classes. The student DNN 203B may have a modified last layer as compared to the teacher DNN 203 A. The number of outputs of the last layer of the student DNN 203B may be increased by one to fit each additional class (e.g., the number of outputs may be increased by 5 for training that is adding 5 classes to the student DNN 203 B). The number of outputs of the last layer of the student DNN 203B may be a sum of the number of old classes and new classes. [0044] In block 302, an input image 201 corresponding to a new class that the student DNN 203B is being trained to classify is provided to teacher DNN 203A and student DNN 203B. The DNNs 203 A-B determines feature maps 204A-B of the input image 201, and provides the feature maps 204A-B to classifiers 205 A-B. The classifiers 205A- B each determine a probable class, selected from the predetermined number of classes recognized by each classifier, for the input image 201 based on the feature maps 204 A-B. Classifier 205 A determines a most likely class of the old classes that teacher DNN 203 A is trained to recognize, while classifier 205B, which is associated with student DNN 203B, determine a classification prediction selected from a set including the new classes and the old classes.

[0045] In block 303, a plurality of losses are determined based on the outputs of classifiers 205A-B. The losses may also be determined based on outputs of classifiers 205C-D in various embodiments. The losses that are determined in block 303 may include any appropriate combination of distillation losses, vector losses, attention losses, and/or classification losses, as discussed above with respect to distillation loss module 207A, feature vector loss module 207B, CE classification loss module 208, attention loss module 214, attention distillation loss module 220, attention loss confusion reduction module 222, early-layer attention distillation loss module 223, and/or confusion reduction module 226 of Figs. 2A-E. In block 304, the various losses that were determined in block 303 are summed by summation modules 209 and/or 215 to determine a gradient backpropagation and weight update signal 216 for the input image 201. The sums that are determined in bock 304 may be weighted sums in some embodiments, having any appropriate weight assigned to each loss. In block 305, the weights of the student DNN 203B are updated based on the gradient backpropagation and weight update signal 216. Student DNNs 203C-D may also be updated in tandem with student DNN 203B in embodiments of block 305. Teacher DNN 203 A is not updated in block 305.

[0046] In block 306, blocks 302-305 may be repeated with additional input images 201 until the training of student DNN 203B is determined to be completed. Each additional input image may have a respective image level label 202 and pixel level label 213. The training of student DNN 203B may be determined to be completed based on, for example, the student DNN 203B achieving a classification accuracy threshold corresponding to relatively losses for subsequent input images corresponding to the old classes and the one or more new classes.

[0047] The process flow diagram of Fig. 3 is not intended to indicate that the operations of the method 300 are to be executed in any particular order, or that all of the operations of the method 300 are to be included in every case. Additionally, the method 300 can include any suitable number of additional operations.

[0048] Fig. 4 is a process flow diagram of another example method 400 for teacher and student based DNN training. The method 400 can be implemented with any suitable computing device, such as the computer system 100 of Fig. 1. The method 400 may be implemented in conjunction with embodiments including any appropriate combination of elements of any of systems 200A-E of Figs. 2A-E. In block 401, a DNN is trained to recognize a predetermined set of base (e.g., old) classes. The training of block 401 may be performed in any appropriate manner. The DNN that is trained in block 401 may be used, after the training to recognize the base classes of block 401 is complete, as a teacher DNN such as teacher DNN 203 A. In block 402, a student DNN, such as student DNN 203B, is initialized based on the teacher DNN that was trained in block 401. In block 403, one or more new classes are added to the student DNN that was initialized in block 402.

[0049] In block 404, the student DNN is trained using data (i.e., input images 201 having corresponding image level labels 202 and pixel level labels 213) corresponding to the new classes that were added to the student DNN in block 403. The training of block 404 may correspond to method 300 of Fig. 3, and may be based on any appropriate combination of losses, including but not limited to distillation loss, CE classification loss, and/or attention loss. In block 405, it is determined whether the training of the student DNN is complete for the new classes. The determination of block 405 may be made by determining the classification accuracy of the student DNN for both the new classes and the old classes, and comparing the determined classification accuracy to a threshold. If it is determined that the training of the student DNN is not complete, flow returns to block 404, and the training of the student DNN continues. When it is determined in block 405 that the training of the student DNN is complete, flow proceeds from block 405 to block 406. In block 406, the student DNN that was trained in block 404 replaces the teacher DNN, and flow proceeds from block 406 to block 402. In block 402, a new student DNN is initialized based on the teacher DNN, and, in block 403, an additional set of new classes is added to the new student DNN that was initialized in block 402. Blocks 402- 406 of method 400 may be repeated to incrementally add any desired number of classes to a student DNN; after each iteration of blocks 402-406 of method 400, an additional set of new classes may be classified by the resulting trained DNN.

[0050] In some embodiments of method 400 of Fig. 4 that include segmentation classification, a DNN such as DNNs 203 A-D of Figs. 2A-E may recognize multiple objects of different classes (e.g., sky, a car, a tree) within a single input image such as input image 201. In an example embodiment of method 400 of Fig. 4 including segmentation classification, in block 401, a set of base classes may be randomly selected to be segmented out in images by a DNN. The DNN may be trained to recognize the base classes in any appropriate manner. The DNN that is trained in block 401 becomes a teacher DNN. In block 402, a new student DNN is initialized based on the teacher DNN. In block 403, one or more new classes are added to the student DNN, and, in block 404, the student DNN is trained using data corresponding to the one or more new classes. The training data images may be provided to both the student DNN and teacher DNN in block 404. The training of block 404 may be based on a retainer loss, which may be calculated based on a difference between the student DNN and teacher DNN output for a given input image, and a segmentation loss, which may be determined based on an IOU difference between the predicted segmentation map from the student DNN and ground truth of the training data (e.g., as given by a pixel level label associated with the input image that indicates the actual classifications of objects in the image). During the training of block 404, no training data corresponding to the base classes may be provided to the student DNN in some embodiments.

[0051] Training of the DNN for segmentation classification such as may be performed in embodiments of block 404 of Fig. 4 may be based on a retainer loss that is determined based on attention maps that are generated by student DNN and teacher DNN; the retainer loss ensures that old classes are not forgotten in the process of adding new classes to the student DNN. Attention maps are generated for the student DNN and teacher DNN using, for example, Grad-CAM or Grad-CAM++. A training image may contain objects corresponding to old classes; therefore, the attention map of the old classes from the teacher DNN may focus on regions containing old class objects. A distance between the student DNN and teacher DNN attention maps may be determined to give the retainer loss, and the weights of the student DNN may be updated based on the retainer loss. The segmentation loss may also be used to train the student DNN. The segmentation loss may be determined based on an intersection over union (IOU) difference between the predicted segmentation map from the student DNN and ground truth of the training data (e.g., as given by a pixel level label associated with the input image that indicates the actual classifications of objects in the image). In some embodiments, the total loss (Loss_totai) that is used to determine the update to the weights of the student DNN during the training of block 404 may be given by Equation 3:

Losstotai = Distance(AMstudent, AMteacher) + IOU(predicted, actual) EQ. 3

In Equation 3, AM_studentis the student attention map and AM_teacher is the teacher attention map. The Distance() function may refer to a Ll loss (i.e., least absolute deviations loss), L2 loss, or any other appropriate distance metric in various embodiments. The distance between the two attention maps is a measure of their similarity, and may enforce similarity between the attention maps of the student DNN and teacher DNN.

[0052] In block 405, it is determined whether the training of the student DNN is complete for the new classes. The determination of block 405 may be made by determining a segmentation classification accuracy of the student DNN for both the new classes and the old classes, and comparing the determined segmentation classification accuracy to a threshold. If it is determined that the training of the student DNN is not complete, flow returns to block 404, and the training of the student DNN continues.

When it is determined in block 405 that the training of the student DNN is complete, flow proceeds from block 405 to block 406. In block 406, the student DNN that was trained in block 404 replaces the teacher DNN, and flow proceeds from block 406 to block 402. In block 402, a new student DNN is initialized based on the teacher DNN, and, in block 403, an additional set of new classes is added to the new student DNN that was initialized in block 402. Blocks 402-406 of method 400 may be repeated to incrementally add any desired number of classes to a student DNN; after each iteration of blocks 402-406 of method 400, an additional set of new classes may be classified in a single image by the resulting trained DNN.

[0053] In some embodiments of method 400 of Fig. 4, the teacher DNN may be initially trained to recognize 10 base classes in block 401. In some embodiments, in each iteration of blocks 402-406 of method 400, a set of 5 new classes may be added to the new student DNN in block 403. The base classes and new classes may include any appropriate number of classes in various embodiments. The training of block 404 may be performed using a fixed number of input images in some embodiments.

[0054] The process flow diagram of Fig. 4 is not intended to indicate that the operations of the method 400 are to be executed in any particular order, or that all of the operations of the method 400 are to be included in every case. Additionally, the method 400 can include any suitable number of additional operations.

[0055] Figs. 5A-B are block diagrams of an example system 500 for teacher and student based DNN training including a first branch 500A and a second branch 500B. First branch 500A in Fig. 5 A includes a teacher DNN 503A and student DNN 503B. Teacher DNN 503 A has been trained to classify a set of old classes, and student DNN 503B is being trained to recognize the old classes plus one or more new classes. The student DNN 503B is initialized based on teacher DNN 503 A. Input image 501 , which corresponds to a new class that student DNN 503B is being trained to recognize, is input to teacher DNN 503A and student DNN 503B. Teacher DNN 503 A and student DNN 503B output feature maps 504A-B to respective classifiers 505A-B. Classifier 505A outputs a probable classification for the input image 501 that is selected from the old classes to distillation loss module 507A. Classifier 505B outputs a probable

classification for input image 501 that is selected from a set including the old classes and the one or more new classes to distillation loss module 507A and to CE classification loss module 508A. The feature maps 504A-B are also provided to autoencoder 506, which encodes the feature maps 504A-B into respective vectors, and provides the vectors to feature vector loss module 507B.

[0056] Distillation loss module 507A determines a difference between the inputs from classifier 505A and classifier 505B, and outputs a distillation loss to summation module 509 based on the determined difference. Feature vector loss module 507B determines a vector loss based on a difference between the inputs from autoencoder 506, and outputs the vector loss to summation module 509. CE classification loss module 508A determines a difference between the probable classification from classifier 505B and the image level label 502, which gives an actual classification of the input image 501. CE classification loss module 508A outputs a classification loss to the summation module 509 based on the determined classification difference.

[0057] The summation module 509 sums the losses received from distillation loss module 507A, feature vector loss module 507B, and CE classification loss module 508A, and determines gradient backpropagation and weight update signal 516A, which is provided to the student DNN 503B. The weights of the student DNN 503B are updated based on the gradient backpropagation and weight update signal 516A. A next input image 501 corresponding to one of the new classes, having a respective corresponding image level label 502, may then be used for further training of the updated student DNN 503B in the first branch 500A as shown in Fig. 5A. [0058] Fig. 5B shows second branch 500B, which is trained using the same input image 501 and associated image level label 502 that are shown in first branch 500 A of Fig. 5A. The second branch 500B includes student DNNs 503C-D, which are initialized based on teacher DNN 503 A that was shown in the first branch 500A. Student DNN 503B of Fig. 5A and student DNNs 503C-D of Fig. 5B are being trained to recognize the same one or more new classes in addition to the old classes recognized by teacher DNN 503 A. However, the weights of student DNNs 503C-D in second branch 500B are updated independently from student DNN 503B in the first branch 500 A. In the second branch 500B, the input image 501 is input to student DNNs 503C-D, and student DNNs 503C-D output feature maps 504C-D to classifiers 505C-D. Classifier 505C outputs a probable classification of input image 501 to distillation loss module 507C and to CE classification loss module 508B.

[0059] Distillation loss module 507C also receives a probable classification, selected from the old classes, from classifier 505A in the first branch 500A of Fig. 5A.

Distillation loss module 507C determines a difference between the inputs from classifier 505A and classifier 505C, and outputs a distillation loss based on the determined difference to summation module 515. CE classification loss module 508B determines a difference between the probable classification from classifier 505C and the image level label 502, which gives an actual classification of the input image 501. CE classification loss module 508B outputs a classification loss to the summation module 515 based on the determined classification difference.

[0060] Classifier 505D provides a class determination to an attention map generator 511. The attention map generator 511 may be Grad-CAM or Grad-CAM++ in some embodiments. The attention map generator 511 also receives decision information 510 from student DNN 503D regarding any areas of the input image 501 that were used by student DNN 503D to determine the feature maps 504D. The decision information 510 from the student DNN 503D may be received from a last layer of the student DNN 503D in some embodiments. The attention map generator 511 determines an attention map 512 of the input image 501, which is input to attention loss module 514. Attention loss module 514 also receives pixel level label 513, which corresponds to input image 501. Pixel level label 513 may include annotations, such as bounding boxes, indicating relevant features that should be used for classification of the object in the input image 501. The attention loss module 514 determines an attention loss based on the attention map 512 and the pixel level label 513, and outputs the attention loss to summation module 515. The summation module 515 sums the distillation loss from distillation loss module 507C, the classification loss from CE classification loss module 508B, and the attention loss from attention loss module 514, and determines gradient backpropagation and weight update signal 516B, which is provided to the student DNNs 503C-D. The weights of the student DNNs 503B-C are updated together based on the gradient backpropagation and weight update signal 516B. A next input image 501 corresponding to one of the new classes, having a respective corresponding image level label 502 and pixel level label 513, may then be used for further training of the updated student DNNs 503C-D of second branch 500B, in tandem with the training of student DNN 503B of first branch 500 A. Operation of the first branch 500A and second branch 500B of system 500 of Figs. 5A-B are discussed below in further detail with respect to Fig. 6.

[0061] It is to be understood that the block diagram of Figs. 5A-B is not intended to indicate that the system 500A-B is to include all of the components shown in Figs. 5A-B. Rather, the system 500 A-B can include any appropriate fewer or additional components not illustrated in Figs. 5A-B (e.g., additional DNNs, feature maps, classifiers, autoencoders, attention map generators, attention maps, loss modules, summation modules, data inputs, etc.). Further, the embodiments described herein with respect to system 500A-B may be implemented with any appropriate logic, wherein the logic, as referred to herein, can include any suitable hardware (e.g., a processor, an embedded controller, or an application specific integrated circuit, among others), software (e.g., an application, among others), firmware, or any suitable combination of hardware, software, and firmware, in various embodiments. [0062] Fig. 6 is a process flow diagram of another example method 600 for teacher and student based DNN training. The method 600 can be implemented with any suitable computing device, such as the computer system 100 of Fig. 1. The method 600 may be implemented in conjunction with system 500 of Figs. 5A-B, including first branch 500A and second branch 500B. In block 601, a student DNN 503B in a first branch 500 A, and student DNNs 503C-D in a second branch 500B, are each initialized based on a teacher DNN 503 A. Student DNNs 503B-D may have modified last layers as compared to the teacher DNN 503 A based on a number of new classes that are being added to the student DNNs 503B-D by the training.

[0063] In block 602, the first branch 500A, including student DNN 203B, and the second branch 500B, including student DNNs 503C-D, are trained to classify one or more new classes. New classes may be incrementally added to the student DNNs 503B- D during the training of block 602, for example, as described above with respect to method 400 of Fig. 4. The weights of student DNN 503B in first branch 500A and student DNNs 503C-D in second branch 500B are updated independently during the training, as described above with respect to Figs. 5A-B. In block 603, a classification accuracy of the first branch 500A and a classification accuracy of the second branch 500B are determined. In block 604, it is determined whether the classification accuracy of the second branch 500B is less than the classification accuracy of the first branch 500A. If it is determined in block 604 that the classification accuracy of the second branch 500B is not less than the classification accuracy of the first branch 500 A, flow returns to block 602, and the training of first branch 500A and second branch 500B continues. If it is determined in block 604 that the classification accuracy of the second branch 500B is less than the classification accuracy of the first branch 500 A, flow proceeds to block 605. In block 605, the capacity of the DNN is augmented by granting, for example, additional computing resources (e.g., processing and/or memory resources) to the system 500, and flow returns to block 602. [0064] In some embodiments, the first branch 500A may achieve better performance on the old classes than the second branch 500B, whereas the second branch 500B may show better performance on a new class than the first branch 500A. Therefore, at the beginning of the training of block 602, the second branch 500B may have better overall accuracy. As classes are added, the accuracy of the second branch 500B may drop faster than the accuracy of the first branch 500A. In some embodiments, when the accuracy of the second branch 500B is determined to be less than the accuracy of the first branch in block 604, the network architecture of the second branch 500B may be augmented in width to increase the model capacity in block 605.

[0065] The process flow diagram of Fig. 6 is not intended to indicate that the operations of the method 600 are to be executed in any particular order, or that all of the operations of the method 600 are to be included in every case. Additionally, the method 600 can include any suitable number of additional operations.

[0066] Although specific embodiments of the disclosure have been described, one of ordinary skill in the art will recognize that numerous other modifications and alternative embodiments are within the scope of the disclosure. For example, any of the

functionality and/or processing capabilities described with respect to a particular system, system component, device, or device component may be performed by any other system, device, or component. Further, while various illustrative implementations and architectures have been described in accordance with embodiments of the disclosure, one of ordinary skill in the art will appreciate that numerous other modifications to the illustrative implementations and architectures described herein are also within the scope of this disclosure. In addition, it should be appreciated that any operation, element, component, data, or the like described herein as being based on another operation, element, component, data, or the like may be additionally based on one or more other operations, elements, components, data, or the like. Accordingly, the phrase“based on,” or variants thereof, should be interpreted as“based at least in part on.” [0067] The present disclosure may be a system, a method, apparatus, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present disclosure.

[0068] The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.

[0069] Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.

[0070] Computer readable program instructions for carrying out operations of the present disclosure may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field- programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present disclosure.

[0071] Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions. [0072] These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a

programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.

[0073] The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

[0074] The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, apparatus, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical fimction(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware- based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

[0075] The descriptions of the various embodiments of the present techniques have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

CLAIMS What is claimed is:

1. A system, comprising a processor configured to: initialize a first student deep neural network (DNN) based on a teacher DNN, wherein the teacher DNN is trained to recognize a plurality of old classes; add a first new class to the first student DNN; provide a first input image corresponding to the first new class to the first student DNN and the teacher DNN; determine a first loss based on an output of the first student DNN corresponding to the first input image and an output of the teacher DNN corresponding to the first input image; and update the first student DNN based on the first loss, wherein the teacher DNN is not updated base on the first loss.

2. The system of claim 1, the processor configured to: determine a classification accuracy of the first student DNN for the first new class; and based on the classification accuracy of the first student DNN being above a threshold: initialize a second student DNN based on the first student DNN; add a second new class to the second student DNN; provide a second input image corresponding to the second new class to the first student DNN and the second student DNN; determine a second loss based on an output of the first student DNN corresponding to the second input image and an output of the second student DNN corresponding to the second input image; and update the second student DNN based on the second loss, wherein the first student DNN is not updated base on the second loss.

3. The system of claim 1, the processor configured to: initialize a third student DNN based on the teacher DNN; add the first new class to the third student DNN; provide the first input image corresponding to the first new class to the third student DNN; determine a distillation loss based on the output of the first student DNN corresponding to the first input image and the output of the teacher DNN corresponding to the first input image; update the first student DNN based on the distillation loss; determine an attention loss based on an output of the third student DNN corresponding to the first input image; update the third student DNN based on the attention loss; determine a classification accuracy of the first student DNN and a classification accuracy of the third student DNN; determine whether the classification accuracy of the third student DNN is less than the classification accuracy of the first student DNN; and based on the classification accuracy of the third student DNN being less than the classification accuracy of the first student DNN, augment a capacity of the first student DNN and the third student DNN.

4. The system of claim 1, wherein the first loss comprises a distillation loss, and wherein determining the distillation loss comprises: determining, by the first student DNN, a first probable class of the first input image, the first probable class being selected from a set including the plurality of old classes and the first new class; determining, by the teacher DNN, a second probable class of the first input image, the second probable class being selected from the plurality of old classes; and determining the distillation loss based on a difference between the first probable class and the second probable class.

5. The system of claim 1, wherein the first loss comprises a vector loss, and wherein determining the vector loss comprises: encoding a first feature map from the teacher DNN into a first feature vector; encoding a second feature map from the first student DNN into a second feature vector; and determining the vector loss based on a difference between the first feature vector and the second feature vector.

6. The system of claim 1, wherein the first loss comprises a cross-entropy classification loss, and wherein determining the cross-entropy classification loss comprises: determining, by the first student DNN, a first probable class of the first input image; receiving an image level label corresponding to the first input image; determining the cross-entropy classification loss based on a difference between the image level label and the first probable class.

7. The system of claim 1, wherein the first loss comprises an attention loss, and wherein determining the attention loss comprises: determining, by the first student DNN, a first probable class of the first input image; determining an attention map of the first student DNN corresponding to the first probable class of the first input image; and determining the attention loss based on a difference between the attention map and a pixel level label corresponding to the first input image.

8. A computer-implemented method, comprising: initializing, via a processor, a first student deep neural network (DNN) based on a teacher DNN, wherein the teacher DNN is trained to recognize a plurality of old classes; adding a first new class to the first student DNN; providing a first input image corresponding to the first new class to the first student DNN and the teacher DNN; determining a first loss based on an output of the first student DNN corresponding to the first input image and an output of the teacher DNN corresponding to the first input image; and updating the first student DNN based on the first loss, wherein the teacher DNN is not updated base on the first loss.

9. The computer-implemented method of claim 8, comprising: determining a classification accuracy of the first student DNN for the first new class; and based on the classification accuracy of the first student DNN being above a threshold: initializing a second student DNN based on the first student DNN; adding a second new class to the second student DNN; providing a second input image corresponding to the second new class to the first student DNN and the second student DNN; determining a second loss based on an output of the first student DNN corresponding to the second input image and an output of the second student DNN corresponding to the second input image; and updating the second student DNN based on the second loss, wherein the first student DNN is not updated base on the second loss.

10. The computer-implemented method of claim 8, comprising: initializing a third student DNN based on the teacher DNN; adding the first new class to the third student DNN; providing the first input image corresponding to the first new class to the third student DNN; determining a distillation loss based on the output of the first student DNN corresponding to the first input image and the output of the teacher DNN corresponding to the first input image; updating the first student DNN based on the distillation loss; determining an attention loss based on an output of the third student DNN corresponding to the first input image; updating the third student DNN based on the attention loss; determining a classification accuracy of the first student DNN and a classification accuracy of the third student DNN; determining whether the classification accuracy of the third student DNN is less than the classification accuracy of the first student DNN; and based on the classification accuracy of the third student DNN being less than the classification accuracy of the first student DNN, augmenting a capacity of the first student DNN and the third student DNN.

11. The computer-implemented method of claim 8, wherein the first loss comprises a distillation loss, and wherein determining the distillation loss comprises: determining, by the first student DNN, a first probable class of the first input image, the first probable class being selected from a set including the plurality of old classes and the first new class; determining, by the teacher DNN, a second probable class of the first input image, the second probable class being selected from the plurality of old classes; and determining the distillation loss based on a difference between the first probable class and the second probable class.

12. The computer-implemented method of claim 8, wherein the first loss comprises a vector loss, and wherein determining the vector loss comprises: encoding a first feature map from the teacher DNN into a first feature vector; encoding a second feature map from the first student DNN into a second feature vector; and determining the vector loss based on a difference between the first feature vector and the second feature vector.

13. The computer-implemented method of claim 8, wherein the first loss comprises a cross-entropy classification loss, and wherein determining the cross-entropy classification loss comprises: determining, by the first student DNN, a first probable class of the first input image; receiving an image level label corresponding to the first input image; determining the cross-entropy classification loss based on a difference between the image level label and the first probable class.

14. The computer-implemented method of claim 8, wherein the first loss comprises an attention loss, and wherein determining the attention loss comprises: determining, by the first student DNN, a first probable class of the first input image; determining an attention map of the first student DNN corresponding to the first probable class of the first input image; and determining the attention loss based on a difference between the attention map and a pixel level label corresponding to the first input image.

15. A computer program product comprising: a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processing device to cause the processing device to perform a method comprising: initializing a first student deep neural network (DNN) based on a teacher DNN, wherein the teacher DNN is trained to recognize a plurality of old classes; adding a first new class to the first student DNN; providing a first input image corresponding to the first new class to the first student DNN and the teacher DNN; determining a first loss based on an output of the first student DNN corresponding to the first input image and an output of the teacher DNN corresponding to the first input image; and updating the first student DNN based on the first loss, wherein the teacher DNN is not updated base on the first loss.

16. The computer program product of claim 15, the method comprising: determining a classification accuracy of the first student DNN for the first new class; and based on the classification accuracy of the first student DNN being above a threshold: initializing a second student DNN based on the first student DNN; adding a second new class to the second student DNN; providing a second input image corresponding to the second new class to the first student DNN and the second student DNN; determining a second loss based on an output of the first student DNN corresponding to the second input image and an output of the second student DNN corresponding to the second input image; and updating the second student DNN based on the second loss, wherein the first student DNN is not updated base on the second loss.

17. The computer program product of claim 15, the method comprising: initializing a third student DNN based on the teacher DNN; adding the first new class to the third student DNN; providing the first input image corresponding to the first new class to the third student DNN; determining a distillation loss based on the output of the first student DNN corresponding to the first input image and the output of the teacher DNN corresponding to the first input image; updating the first student DNN based on the distillation loss; determining an attention loss based on an output of the third student DNN corresponding to the first input image; updating the third student DNN based on the attention loss; determining a classification accuracy of the first student DNN and a classification accuracy of the third student DNN; determining whether the classification accuracy of the third student DNN is less than the classification accuracy of the first student DNN; and based on the classification accuracy of the third student DNN being less than the classification accuracy of the first student DNN, augmenting a capacity of the first student DNN and the third student DNN.

18. The computer program product of claim 15, wherein the first loss comprises a distillation loss, and wherein determining the distillation loss comprises: determining, by the first student DNN, a first probable class of the first input image, the first probable class being selected from a set including the plurality of old classes and the first new class; determining, by the teacher DNN, a second probable class of the first input image, the second probable class being selected from the plurality of old classes; and determining the distillation loss based on a difference between the first probable class and the second probable class.

19. The computer program product of claim 15, wherein the first loss comprises a vector loss, and wherein determining the vector loss comprises: encoding a first feature map from the teacher DNN into a first feature vector; encoding a second feature map from the first student DNN into a second feature vector; and determining the vector loss based on a difference between the first feature vector and the second feature vector.

20. The computer program product of claim 15, wherein the first loss comprises a cross-entropy classification loss, and wherein determining the cross-entropy classification loss comprises: determining, by the first student DNN, a first probable class of the first input image; receiving an image level label corresponding to the first input image; determining the cross-entropy classification loss based on a difference between the image level label and the first probable class.