TW202139070A

TW202139070A - Method and apparatus of training networks

Info

Publication number: TW202139070A
Application number: TW110112201A
Authority: TW
Inventors: 亞明赫拉德曼德; 任昊宇; 哈米莫斯塔法伊爾; 雙全王; 裵東運; 正元李
Original assignee: 南韓商三星電子股份有限公司
Priority date: 2020-04-07
Filing date: 2021-04-01
Publication date: 2021-10-16
Also published as: DE102021103189A1

Abstract

A method and apparatus are provided. The method includes generating a dataset for real-world super resolution (SR), training a first generative adversarial network (GAN), training a second GAN, and fusing an output of the first GAN and an output of the second GAN.

Description

訓練網路的方法及設備Method and equipment for training network

本揭露內容大體上涉及影像超解析，且更特定言之，涉及用於藉由級聯網路訓練、級聯網路微調以及擴張卷積來設計有效超解析度深度卷積神經網路的系統及方法。The content of this disclosure generally relates to image super-resolution, and more specifically, to a system and method for designing an effective super-resolution deep convolutional neural network through hierarchical networking training, hierarchical networking tuning, and expanding convolution .

超解析度成像自低解析度（low resolution；LR）影像生成高解析度（high resolution；HR）影像。超解析度（super resolution；SR）成像具有自監控及面部/虹膜識別至醫學影像處理的廣泛適用性，以及影像及視訊的解析度的直接改良。已提議將自內插（李欣（Li, Xin）及邁克爾奧查德（Orchard, Michael），新邊緣方向內插（New edge-directed interpolation ），IEEE影像處理彙刊（TIP），第10卷，第10期，第1521至1527頁（2001年10月），所述文獻以全文引用的方式併入）、輪廓特徵（泰裕永（Tai, Yu-Wing）；劉帥成（Liu, Shuaicheng）；邁克爾布朗（Brown, Michael）；以及史蒂芬林（Lin, Stephen），使用邊緣先驗及單個影像詳細合成的超解析度（Super resolution using edge prior and single image detail synthesis ），2010 IEEE電腦視覺及模式識別（CVPR）國際會議，第2400至2407頁，所述文獻以全文引用的方式併入）及統計影像先驗（金光仁（Kim, Kwang In）及權英姬（Kwon, Younghee），使用稀疏回歸及自然影像先驗的單影像超解析度（Single-image super-resolution using sparse regression and natural image prior ），IEEE模式分析及機器智慧彙刊（TPAMI），第32卷，第6期，第1127至1133頁（2010年1月），所述文獻以全文引用的方式併入）至根據修補辭典學習的基於實例的方法的許多演算法/系統用於進行SR，所述基於實例的方法諸如鄰域嵌入（張洪（Chang, Hong）；楊瓞仁（Yeung, Dit-Yan）；以及熊益民（Xiong, Yimin），經由鄰域嵌入的超解析度（Super-resolution through neighbor embedding ），2004 CVPR，第275至282頁，所述文獻以全文引用的方式併入）及稀疏寫碼（楊健超（Yang, Jianchao）；約翰瑞特（Wright, John）；托馬斯黃（Huang, Thomas）；以及馬易（Ma, Yi），經由稀疏性表示的影像超解析度（Image super-resolution via sparse representation ），IEEE TIP，第19卷，第11期，第2861至2873頁（2010年11月），所述文獻以全文引用的方式併入）。Super-resolution imaging generates high-resolution (HR) images from low-resolution (LR) images. Super resolution (SR) imaging has wide applicability from self-monitoring and face/iris recognition to medical image processing, as well as direct improvements in the resolution of images and videos. Proposed self-interpolation (Li Xin (Li, Xin) and Michael Orchard, Michael), New edge-directed interpolation (New edge-directed interpolation), IEEE Transactions on Image Processing (TIP), Volume 10 , Issue 10, pages 1521 to 1527 (October 2001), the documents are incorporated by reference in their entirety), contour features (Tai, Yu-Wing; Liu, Shuaicheng; Michael Brown, Michael; and Lin, Stephen, Super resolution using edge prior and single image detail synthesis , 2010 IEEE Computer Vision and Pattern Recognition ( CVPR) International Conference, pages 2400 to 2407, the documents are incorporated by reference in their entirety) and statistical image priors (Kim, Kwang In and Kwon, Younghee), using sparse regression and Single-image super-resolution using sparse regression and natural image prior , IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Volume 32, Issue 6, No. 1127 to 1133 Page (January 2010), the document is incorporated by reference in its entirety) to many algorithms/systems based on instance-based methods of patched dictionary learning for SR, the instance-based methods such as neighborhood embedding (Chang, Hong; Yeung, Dit-Yan; and Xiong, Yimin, Super-resolution through neighbor embedding , 2004 CVPR, pp. 275 to 282 Page, the document is incorporated by reference in its entirety) and sparse coding (Yang, Jianchao; Wright, John; Huang, Thomas; and Ma, Yi) , Image super-resolution via sparse representation , IEEE TIP, Volume 19, Issue 11, Pages 2861 to 2873 (November 2010), the document is quoted in its entirety Way to incorporate).

近來，卷積神經網路（convolutional neural network；CNN）已提供SR準確性上的顯著改良。參見例如董超（Dong, Chao）；呂健勤（Loy, Chen Change）；何愷明（He, Kaiming）；以及湯曉鷗（Tang, Xiaoou），學習用於影像超解析度的深度卷積網路（Learning a deep convolutional network for image super-resolution ），2014歐洲電腦視覺會議（ECCV），第184至199頁（下文中，「董（Dong）等人2014」），所述文獻以全文引用的方式併入。有時稱為「SRCNN」（亦即，超解析度卷積神經網路），其準確性可受限於小結構（例如，3層）及/或小內容接收場。作為回應，研究人員已提議增大SRCNN的大小，但大多數建議使用極大量的參數，且所論述的SRCNN中的許多不能實時執行。由於提議大的網路大小，故可甚至極難以在適當訓練設定（亦即，學習速率、權重初始化以及權重衰減）下猜測。因此，訓練可能根本不收斂或落入局部最小值。Recently, convolutional neural networks (CNN) have provided significant improvements in SR accuracy. See, for example, Dong, Chao (Dong, Chao); Lu Jianqin (Loy, Chen Change); He, Kaiming (He, Kaiming); and Tang, Xiaoou (Tang, Xiaoou), Learning a deep convolutional network for image super-resolution convolutional network for image super-resolution ), European Conference on Computer Vision (ECCV) 2014, pages 184 to 199 (hereinafter, "Dong (Dong) et al. 2014"), which is incorporated by reference in its entirety. Sometimes called "SRCNN" (ie, super-resolution convolutional neural network), its accuracy can be limited by small structures (for example, 3 layers) and/or small content receiving fields. In response, researchers have proposed increasing the size of SRCNN, but most recommend using a very large number of parameters, and many of the SRCNN discussed cannot be performed in real time. Due to the proposed large network size, it may even be extremely difficult to guess under appropriate training settings (ie, learning rate, weight initialization, and weight decay). Therefore, training may not converge at all or fall into a local minimum.

因此，已進行本揭露內容以至少解決本文中所描述的問題及/或缺點且至少提供下文描述的優點。Therefore, the present disclosure has been made to at least solve the problems and/or disadvantages described herein and provide at least the advantages described below.

根據一個實施例，方法包含生成用於真實世界SR的資料集，訓練第一生成對抗網路（generative adversarial network；GAN），訓練第二GAN，以及融合第一GAN的輸出及第二GAN的輸出。According to one embodiment, the method includes generating a data set for real-world SR, training a first generative adversarial network (GAN), training a second GAN, and fusing the output of the first GAN and the output of the second GAN .

根據一個實施例，設備包含一或多個非暫時性電腦可讀媒體、至少一個處理器，所述至少一個處理器當執行儲存於一或多個非暫時性電腦可讀媒體上的指令時進行以下步驟：生成用於真實世界SR的資料集，訓練第一GAN，訓練第二GAN，以及融合第一GAN的輸出及第二GAN的輸出。According to one embodiment, the device includes one or more non-transitory computer-readable media, and at least one processor that performs when executing instructions stored on the one or more non-transitory computer-readable media The following steps: generate a data set for real-world SR, train the first GAN, train the second GAN, and fuse the output of the first GAN and the output of the second GAN.

在下文中，參考附圖詳細地描述本揭露內容的實施例。應注意，相同元件將由相同圖式元件符號指明，但其展示於不同圖式中。在以下描述中，僅提供諸如詳細組態及組件的特定細節以輔助對本揭露內容的實施例的整體理解。因此，對於本領域的技術人員應顯而易見，可在不脫離本揭露內容的範疇的情況下對本文中描述的實施例進行各種改變及修改。此外，出於清楚及簡明起見，省略熟知功能及構造的描述。下文所描述的術語為考慮到本揭露內容中的功能所定義的術語，且可根據使用者、使用者的意圖或習慣而不同。因此，應基於整個說明書中的內容而判定術語的定義。Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. It should be noted that the same elements will be designated by the same drawing element symbols, but they are shown in different drawings. In the following description, only specific details such as detailed configuration and components are provided to assist the overall understanding of the embodiments of the present disclosure. Therefore, it should be obvious to those skilled in the art that various changes and modifications can be made to the embodiments described herein without departing from the scope of the disclosure. In addition, for the sake of clarity and conciseness, descriptions of well-known functions and configurations are omitted. The terms described below are terms defined in consideration of the functions in the present disclosure, and may be different according to the user, the user's intention or habit. Therefore, the definition of terms should be determined based on the content of the entire specification.

本揭露內容可具有各種修改及各種實施例，下文參考附圖詳細地描述所述各種實施例當中的實施例。然而，應理解，本揭露內容不限於實施例，但本揭露內容包含在本揭露內容的範疇內的所有修改、等效物以及替代例。The present disclosure may have various modifications and various embodiments, and the embodiments of the various embodiments are described in detail below with reference to the accompanying drawings. However, it should be understood that the disclosure is not limited to the embodiments, but the disclosure includes all modifications, equivalents, and alternatives within the scope of the disclosure.

雖然包含諸如第一及第二的序數的術語可用於描述各種元件，但結構元件並不受所述術語限制。術語僅用於將一個元件與另一元件區分開。舉例而言，在不脫離本揭露內容的範疇的情況下，第一結構元件可稱作第二結構元件。類似地，第二結構元件亦可稱作第一結構元件。如本文中所使用，術語「及/或」包含一或多個相關聯項目的任何及所有組合。Although terms including ordinal numbers such as first and second can be used to describe various elements, structural elements are not limited by the terms. The terms are only used to distinguish one element from another element. For example, without departing from the scope of the present disclosure, the first structure element may be referred to as the second structure element. Similarly, the second structure element can also be referred to as the first structure element. As used herein, the term "and/or" includes any and all combinations of one or more associated items.

本文中的術語僅用於描述本揭露內容的各種實施例，但不意欲限制本揭露內容。除非上下文另有清晰指示，否則單數形式意欲包含複數形式。在本揭露內容中，應理解，術語「包含」或「具有」指示特徵、數字、步驟、操作、結構元件、部件或其組合的存在，且並不排除一或多個其他特徵、數字、步驟、操作、結構元件、部件或其組合的存在或添加可能性。The terminology herein is only used to describe various embodiments of the disclosure, but is not intended to limit the disclosure. Unless the context clearly indicates otherwise, the singular form is intended to include the plural form. In the present disclosure, it should be understood that the term "including" or "having" indicates the existence of features, numbers, steps, operations, structural elements, components, or combinations thereof, and does not exclude one or more other features, numbers, steps , Operation, structural element, component or combination of existence or addition possibility.

除非不同地定義，否則本文中使用的所有術語具有與本揭露內容所屬的領域的技術人員所理解的術語相同的含義。諸如一般使用的辭典中定義的那些術語的術語應解釋為具有與相關技術領域中的內容相關含義相同的含義，且不應解釋為具有理想或過度形式化含義，除非在本揭露內容中清楚地定義。Unless defined differently, all terms used herein have the same meaning as those understood by those skilled in the art to which the present disclosure belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having the same meaning as content-related meanings in the relevant technical field, and should not be interpreted as having ideal or excessive formal meanings, unless it is clearly stated in this disclosure. definition.

各種實施例可包含一或多個元件。元件可包含經配置以進行某些操作的任何結構。儘管實施例可藉助於實例在某一配置中用有限數目個元件來描述，但所述實施例可視給定實施需要而以替代配置包含更多或更少的元件。值得注意的係，對「一個實施例」或「一實施例」的任何參考意謂結合實施例所描述的特定特徵、結構或特性包含於至少一個實施例中。片語「一個實施例」（或「一實施例」）在本說明書中的各種地方處的出現不必皆指同一實施例。Various embodiments may include one or more elements. An element may include any structure configured to perform certain operations. Although an embodiment may be described with a limited number of elements in a certain configuration by way of example, the embodiment may include more or fewer elements in alternative configurations depending on the needs of a given implementation. It should be noted that any reference to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in conjunction with the embodiment is included in at least one embodiment. The appearances of the phrase "one embodiment" (or "an embodiment") in various places in this specification do not necessarily all refer to the same embodiment.

本揭露內容提供新方法，或更準確地，用於產生SRCNN的若干新技術。本文中，術語「級聯訓練超解析度卷積神經網路」（CT SRCNN）可一起指本文所描述的所有新技術，或可指術語用於其中的上下文應明確的新技術中的一或多種。不同於自無監督權重初始化開始訓練所有層的現有方法，CT-SRCNN以小網路（例如，3個層）開始訓練。當當前網路不能充分地減小訓練誤差時，新層逐漸***至網路中。This disclosure provides new methods, or more accurately, several new technologies for generating SRCNN. In this article, the term "cascaded training super-resolution convolutional neural network" (CT SRCNN) may collectively refer to all the new technologies described in this article, or may refer to one or one of the new technologies in which the context in which the term is used should be clear. Many kinds. Unlike existing methods that start training all layers from unsupervised weight initialization, CT-SRCNN starts training with a small network (for example, 3 layers). When the current network cannot sufficiently reduce the training error, new layers are gradually inserted into the network.

利用此「級聯訓練」策略，使得收斂更容易，且當使用更多層時，不斷提高準確性。但隨著深度增加，網路的相對複雜度因為新層的性質而並不增大。更確切而言，隨機地初始化CT-SRCNN中的新層的所有權重，且學習速率固定。相較於需要消耗大量時間及資源來調諧參數的方法，此為極大優點。具有13個層的CT-SRCNN的一個特定實例（如下文進一步所展示及論述），準確性可與當前最新技術影像SR網路競爭，同時具有快了超過5倍的執行速度，且僅使用參數的1/5。Using this "cascade training" strategy makes convergence easier, and continuously improves accuracy when more layers are used. But as the depth increases, the relative complexity of the network does not increase due to the nature of the new layer. More precisely, all weights of the new layer in CT-SRCNN are initialized randomly, and the learning rate is fixed. Compared with methods that require a lot of time and resources to tune parameters, this is a great advantage. A specific example of CT-SRCNN with 13 layers (shown and discussed further below), the accuracy can compete with the current state-of-the-art image SR network, and the execution speed is more than 5 times faster, and only parameters are used 1/5.

在本揭露內容中，描述藉由降低儲存及計算複雜度來進一步細化CT-SRCNN模型的「級聯網路微調」，以及藉由部署「擴張卷積」的形式而非進行完全習知卷積計算來進一步改良超解析度深度卷積神經網路的效率的另一方法，其可進一步降低CT-SRCNN模型複雜度。In this disclosure, it is described to further refine the CT-SRCNN model by reducing the storage and computational complexity of the "level network path fine-tuning", and by deploying the form of "expanded convolution" instead of performing fully conventional convolution Calculation is another method to further improve the efficiency of super-resolution deep convolutional neural networks, which can further reduce the complexity of the CT-SRCNN model.

本揭露內容的其餘部分按次序論述CT-SRCNN的這三種不同方案/特徵： I.級聯訓練； II.級聯網路微調；以及 III.擴張卷積。The rest of this disclosure discusses these three different schemes/features of CT-SRCNN in order: I. Cascade training; II. Fine-tuning of level networking; and III. Dilated convolution.

雖然這三種方法/技術論述於CT-SRCNN的上下文中，但每一方法/技術可個別地或各自地應用於其他SR方案或CNN網路，如於本領域具有通常知識者將理解。Although these three methods/techniques are discussed in the context of CT-SRCNN, each method/technique can be individually or individually applied to other SR schemes or CNN networks, as those with ordinary knowledge in the field will understand.

圖1說明根據一個實施例的用於建構級聯訓練超解析度卷積神經網路（CT-SRCNN）的方法的例示性方塊圖。FIG. 1 illustrates an exemplary block diagram of a method for constructing a cascaded training super-resolution convolutional neural network (CT-SRCNN) according to an embodiment.

在步驟110處，準備訓練集，意謂低解析度（LR）影像以及對應高解析度（HR）影像的集合，CT-SRCNN藉由所述集合來「學習」當嘗試自低解析度影像產生高解析度影像時使用的模型。在此實施例中，在步驟120處，每一LR影像經雙三次上取樣且LR/HR修補經裁剪以準備用於訓練。更多關於此步驟的細節參見例如董等人2014及董超；呂健勤；何愷明；以及湯曉鷗，使用深度卷積網路的影像超解析度（Image super-resolution using deep convolutional networks ），IEEE模式分析及機器智慧彙刊（TPAMI），第38卷，第2期，第295至307頁（2016年2月）（下文中，「董等人2016a」），所述文獻以全文引用的方式併入。如於本領域具有通常知識者將理解，存在各種訓練前準備技術且本揭露內容不限於作為訓練前準備技術的此雙三次上取樣及LR/HR修補。At step 110, a training set is prepared, which means a collection of low-resolution (LR) images and corresponding high-resolution (HR) images. CT-SRCNN uses the collection to "learn" when trying to generate from low-resolution images The model used for high-resolution images. In this embodiment, at step 120, each LR image is up-sampled bi-cubicly and the LR/HR patch is cropped in preparation for training. For more details on this step, see, for example, Dong et al. 2014 and Dong Chao; Lu Jianqin; He Yuming; and Tang Xiaoou, Image super-resolution using deep convolutional networks , IEEE mode analysis and Transactions on Machine Intelligence (TPAMI), Volume 38, Issue 2, Pages 295 to 307 (February 2016) (hereinafter, "Dong et al. 2016a"), the document is incorporated by reference in its entirety. As those with ordinary knowledge in the art will understand, there are various pre-training preparation techniques and the content of this disclosure is not limited to this bi-cubic up-sampling and LR/HR repair as pre-training preparation techniques.

在步驟130處，根據本揭露內容進行級聯訓練。在下文描述根據本揭露內容的具體實施例的級聯訓練的實施例。在步驟140處，根據本揭露內容進行級聯網路微調。下文進一步描述根據本揭露內容的具體實施例的網路微調的實施例。在步驟150處，過程完成且CT-SRCNN系統準備好用於真實世界。At step 130, cascade training is performed according to the content of the disclosure. Hereinafter, an embodiment of cascade training according to a specific embodiment of the present disclosure is described. At step 140, fine-tuning of the hierarchical networking is performed according to the content of the disclosure. The following further describes embodiments of network fine-tuning according to specific embodiments of the present disclosure. At step 150, the process is complete and the CT-SRCNN system is ready for use in the real world.

雖然這些不同過程（亦即，級聯訓練及級聯網路微調）描述及展示於圖1中作為分離及相異階段/步驟，但在根據本揭露內容的實際實施方案中的這些功能之間可存在重疊。I. 級聯訓練 Although these different processes (ie, cascade training and cascade network fine-tuning) are described and shown in FIG. 1 as separate and different phases/steps, there is a possibility between these functions in the actual implementation according to the present disclosure. There is overlap. I. Cascade training

圖2說明根據一個實施例的級聯訓練的例示性圖。在步驟205處，訓練的過程開始。Figure 2 illustrates an exemplary diagram of cascaded training according to one embodiment. At step 205, the training process begins.

在步驟210處，訓練在階段 i = 1處開始。新的網路以 b 數目個層開始，且在每一階段中添加 c 數目個層，其中訓練誤差收斂（220）或保持高於臨限值（250）。因此，在每一訓練階段 i 處，訓練具有c*( i -1)+b個層的CNN。當階段 i = 1時，訓練具有第一 b 數目個層的CNN。在階段 i = 1之後，視需要，級聯訓練開始將中間層添加至 b 數目個層，尤其每次 c 數目個層。At step 210, training starts at stage i =1. The new network starts with b number of layers, and adds c number of layers in each stage, where the training error converges (220) or stays above the threshold (250). Therefore, at each training stage i , a CNN with c*( i -1)+b layers is trained. When stage i = 1, train a CNN with the first b number of layers. After the stage i =1, the cascade training starts to add the intermediate layer to the b number of layers as needed, especially the c number of layers each time.

在步驟220處，判定網路是否開始收斂，例如，訓練誤差是否已停止減小某一量（自先前階段）。若其已如此（亦即，CNN收斂），則在步驟230處添加 c 數目個中間層，且下一迭代在步驟240處開始（ i = i +1）。在此迭代過程期間，新層可經設定為任意加權，此是由於中間層將不影響其他層的權重矩陣大小。實際上，所有現有層繼承其先前權重矩陣。此級聯訓練迭代過程繼續，從而使CNN愈來愈深，直至在步驟250處訓練誤差小於臨限值，且隨後在步驟255處輸出CNN模式。At step 220, it is determined whether the network has started to converge, for example, whether the training error has stopped decreasing by a certain amount (from the previous stage). If this is already the case (ie, CNN converges), c number of intermediate layers are added at step 230, and the next iteration starts at step 240 ( i = i +1). During this iterative process, the new layer can be set to any weight, because the middle layer will not affect the weight matrix size of other layers. In fact, all existing layers inherit their previous weight matrix. This cascaded training iteration process continues to make the CNN deeper and deeper until the training error is less than the threshold at step 250, and then the CNN mode is output at step 255.

圖3A及圖3B說明級聯訓練與現有訓練方法之間的差異中的一些。Figures 3A and 3B illustrate some of the differences between cascade training and existing training methods.

在圖3A中，展示圖2中的流程圖的實例。在圖3A中，層的 b 數目等於三，如頂部（310）處所展示，其表示待訓練的第一CNN，且在每一階段中添加的層的數目 c 為一。每一新層具有隨機設定的其權重，而每一預先存在的層自先前階段繼承其權重。利用每一新近***的中間層，CNN變得更深。在每一階段處，再次訓練更深CNN。由於大多數權重繼承自先前階段，故連續再訓練相對簡單，甚至具有固定學習速率。In Fig. 3A, an example of the flowchart in Fig. 2 is shown. In Figure 3A, the number of layers b is equal to three, as shown at the top (310), which represents the first CNN to be trained, and the number c of layers added in each stage is one. Each new layer has its weight set randomly, and each pre-existing layer inherits its weight from the previous stage. With each newly inserted middle layer, the CNN becomes deeper. At each stage, a deeper CNN is trained again. Since most of the weights are inherited from the previous stage, continuous retraining is relatively simple and even has a fixed learning rate.

然而，如圖3B中所展示，現有方法以需要在同一時間調諧的「完全」層集合開始。如圖3B中所展示在同一時間訓練所有層由於緩慢收斂而比圖3A中展示的方案複雜得多，其中級聯訓練訓練更淺網路直至收斂，遞增地***具有隨機權重的層，同時保持先前訓練層完好，且再訓練整個網路直至更深網路收斂。此外，級聯訓練可簡單地固定學習速率且生成具有隨機權重的新層。However, as shown in Figure 3B, the existing method starts with a set of "complete" layers that need to be tuned at the same time. As shown in Figure 3B, training all layers at the same time is much more complicated than the scheme shown in Figure 3A due to slow convergence. The cascaded training trains the shallower network until convergence, and incrementally inserts layers with random weights while maintaining The previous training layer is intact, and the entire network is trained until the deeper network converges. In addition, cascaded training can simply fix the learning rate and generate new layers with random weights.

圖4A及圖4B分別說明在根據一個實施例的級聯訓練之後的開始CNN及結束CNN。4A and 4B respectively illustrate the start CNN and the end CNN after the cascade training according to an embodiment.

假設 x 表示內插LR影像且 y 表示其匹配HR影像。給定具有 N 個樣本的訓練集{( x_i , y_i ), i = 1, …, N }，CT-SRCNN的目標是學習預測HR輸出

= g(x) 的模型 g 。在訓練期間，均方誤差（mean square error；MSE）

在訓練集內最小化。Suppose x represents the interpolated LR image and y represents it matches the HR image. Given a training set with N samples {( x _i , y _i ), i = 1, …, N }, the goal of CT-SRCNN is to learn to predict HR output

= g (x) model g . During training, mean square error (MSE)

Minimize within the training set.

在圖4A中，級聯訓練自3層模型（ b = 3）開始。第一層（410）由64個9×9過濾器組成，且第二層（413）及第三層（415）由32個5×5過濾器組成。（新層的）所有權重由具有σ = 0.001的高斯函數隨機地初始化，且所有卷積具有跨步一。「跨步」為卷積層的超參數中的一者，且控制空間維度（寬度及高度）周圍的深度行如何分配，換言之，跨步指示過濾器如何圍繞輸入體積卷積，亦即，「跨步一」指示過濾器每次圍繞輸入體積卷積一個像素，「跨步二」指示過濾器每次卷積兩個像素等。參見例如以下的定義：2017年6月5日自維基百科在https://en.wikipedia.org/ wiki/Convolutional_neural_network處下載的「卷積神經網路」；2017年6月5日自https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks-Part-2/下載的「初學者理解卷積網路指南 - 第2部分」；所述兩個文獻以全文引用的方式併入。In Figure 4A, the cascade training starts from the 3-layer model (b = 3). The first layer (410) is composed of 64 9×9 filters, and the second layer (413) and the third layer (415) are composed of 32 5×5 filters. All weights (of the new layer) are randomly initialized by a Gaussian function with σ = 0.001, and all convolutions have a stride one. "Stride" is one of the hyperparameters of the convolutional layer, and it controls how the depth rows around the spatial dimensions (width and height) are distributed. In other words, the stride instructs how the filter is convolved around the input volume, that is, "span "Step 1" instructs the filter to convolve one pixel at a time around the input volume, "Step 2" instructs the filter to convolve two pixels at a time, etc. See, for example, the following definition: "Convolutional Neural Network" downloaded from Wikipedia at https://en.wikipedia.org/wiki/Convolutional_neural_network on June 5, 2017; from https:/ on June 5, 2017 /adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks-Part-2/Download "Beginner's Guide to Understanding Convolutional Networks-Part 2"; the two The literature is incorporated by reference in its entirety.

返回至圖4A，當當前階段的MSE停止明顯減小時，例如，誤差在一時期中減小小於3%，訓練轉至下一階段。參見例如圖2的步驟220。為在此實施例中加速訓練，每一階段將兩個新層***至網路中（亦即，在圖2中的步驟230中， c = 2）。因此，訓練自3個層開始，如圖4A處所展示，且隨後進行至5個層、7個層、…及在五（5）個階段之後的最後13個層。每一新層由32個3×3過濾器組成。即使當CNN變得逐漸較深時，此大小確保較小網路。新中間層緊接在最後一個32 5×5過濾器層415之前***。來自先前階段中存在的任何層的權重自先前階段繼承權重，且兩個新層的權重始終隨機地初始化（具有σ = 0.001的高斯分佈）。由於新卷積層將減小特徵映射的大小，故在每一新中間3×3層中對2個像素進行零填補。因此，級聯訓練中的所有階段具有相同大小的輸出，使得訓練樣本可共用。Returning to FIG. 4A, when the MSE of the current stage stops significantly reducing, for example, the error is reduced by less than 3% in a period, and the training moves to the next stage. See, for example, step 220 of FIG. 2. To speed up training in this embodiment, two new layers are inserted into the network at each stage (ie, in step 230 in FIG. 2, c = 2). Therefore, training starts with 3 layers, as shown in Figure 4A, and then proceeds to 5 layers, 7 layers,... and the last 13 layers after five (5) stages. Each new layer consists of 32 3×3 filters. Even when the CNN becomes gradually deeper, this size ensures a smaller network. The new middle layer is inserted immediately before the last 32 5×5 filter layer 415. The weights from any layer existing in the previous stage inherit the weights from the previous stage, and the weights of the two new layers are always initialized randomly (with a Gaussian distribution of σ = 0.001). Since the new convolutional layer will reduce the size of the feature map, 2 pixels are zero-filled in each new middle 3×3 layer. Therefore, all stages in the cascade training have the same size output, so that the training samples can be shared.

隨著網路變得愈深，通常用現有方法收斂的訓練變得更困難。舉例而言，董等人2016a中的SRCNN未能展示具有超過三個層的優異效能。在以全文引用的方式併入的金智媛（Kim, Jiwon）；李貞權（Lee, Jung Kwon）；以及李景武（Lee, Kyoung Mu），使用極深度卷積網路的準確影像超解析度（Accurate image super-resolution using very deep convolutional networks ），2016 CVPR，第1646至1654頁（下文中，「VDSR」）中，高的初始學習速率經調諧且逐漸減小。但當使用大的不同訓練集（例如，來自160,000個影像的超過三千萬個修補）時，高學習速率並不很好地起作用。對此的潛在原因在於高學習速率引起消失/***梯度。As the network gets deeper, the training that usually converges with existing methods becomes more difficult. For example, SRCNN in Dong et al. 2016a failed to show excellent performance with more than three layers. In Kim, Jiwon (Kim, Jiwon); Lee, Jung Kwon (Lee, Jung Kwon); and Lee, Kyoung Mu (Lee, Kyoung Mu), which are incorporated in full-text citations, accurate image super-resolution ( Accurate image super) using extremely deep convolutional networks -resolution using very deep convolutional networks ), 2016 CVPR, pages 1646 to 1654 (hereinafter, "VDSR"), the high initial learning rate is tuned and gradually decreased. But when using large different training sets (for example, more than 30 million patches from 160,000 images), the high learning rate does not work well. The underlying reason for this is that the high learning rate causes vanishing/explosive gradients.

在CT-SRCNN中，在每一階段中隨機地初始化僅幾個權重，故收斂相對簡單。固定學習速率0.0001對不具有任何衰減的CT-SRCNN中的所有層亦為可行的。為了加速訓練，僅需要改變第一階段，例如，第一階段的學習速率可設定為0.001。在實驗/模擬中，13層CT-SRCNN（如圖4B中的一者）已實現當前最新技術準確性，同時相較於其他網路使用少得多的參數，所述其他網路諸如VDSR或以全文引用的方式併入的金智媛；李貞權；以及李景武，用於影像超解析度的深度回歸卷積網路（Deeply-recursive convolutional network for image super-resolution ），2016 CVPR，第1637至1645頁（下文中，「DRCN」）。相反，隨機初始化的更深網路的直接訓練需要大量參數調諧工作以確保這些其他網路中的最佳收斂，即使實驗已展示這些網路可未能以可接受的誤差收斂。In CT-SRCNN, only a few weights are randomly initialized in each stage, so the convergence is relatively simple. A fixed learning rate of 0.0001 is also feasible for all layers in CT-SRCNN without any attenuation. In order to speed up training, only the first stage needs to be changed. For example, the learning rate of the first stage can be set to 0.001. In experiments/simulations, 13-layer CT-SRCNN (the one in Figure 4B) has achieved the current state-of-the-art technical accuracy, while using much fewer parameters than other networks, such as VDSR or Jin Zhiyuan; Li Zhenquan; and Li Jingwu, incorporated by reference in full, Deeply-recursive convolutional network for image super-resolution , 2016 CVPR, pp. 1637-1645 ( Hereinafter, "DRCN"). In contrast, the direct training of randomly initialized deeper networks requires a lot of parameter tuning work to ensure the best convergence among these other networks, even though experiments have shown that these networks may fail to converge with acceptable errors.

如下表1中所展示，當量測兩個影像品質度量（峰值訊號雜訊比（peak signal to noise ratio；PSNR）及結構相似性量測（structure similarity measure；SSIM））時，可看出，CT-SRCNN實現較佳品質及較快速度。此外，CT-SRCNN相較於VDSR及DRCN擷取更多細節。As shown in Table 1 below, when measuring two image quality metrics (peak signal to noise ratio (PSNR) and structure similarity measure (SSIM)), it can be seen that CT-SRCNN achieves better quality and faster speed. In addition, CT-SRCNN captures more details than VDSR and DRCN.

給定CNN中的L層，假定第i層具有

個輸入通道、

個卷積核心以及

個過濾器。第i層中的參數的數目為

。在此計算中忽略偏置項。則參數的總體數目為

。因此，例如，在每一層中具有64-32-1過濾器的3層CT-SRCNN中，

，故參數的總體數目為

。Given the L layer in the CNN, it is assumed that the i-th layer has

Input channels,

Convolution cores and

Filters. The number of parameters in the i-th layer is

. The offset term is ignored in this calculation. Then the total number of parameters is

. So, for example, in a 3-layer CT-SRCNN with a 64-32-1 filter in each layer,

, So the total number of parameters is

.

PSNR/SSIM用以量測影像重構品質。PSNR是影像像素的最大可能功率與影響保真度的破壞性雜訊的功率之間的比。其計算如下：

，其中MSE在地面實況與經重構影像（SR輸出）之間進行計算。PSNR愈大，影像品質愈佳。PSNR的最大值為無限。參見例如2017年6月27日自維基百科在https://en.wikipedia.org/wiki/Peak_signal-to-noise_ratio處下載的「峰值信雜比」的定義，所述文獻以全文引用的方式併入。PSNR/SSIM is used to measure the quality of image reconstruction. PSNR is the ratio between the maximum possible power of an image pixel and the power of destructive noise that affects fidelity. The calculation is as follows:

, Where MSE is calculated between the ground truth and the reconstructed image (SR output). The higher the PSNR, the better the image quality. The maximum value of PSNR is infinite. See, for example, the definition of "Peak Signal-to-noise Ratio" downloaded from Wikipedia at https://en.wikipedia.org/wiki/Peak_signal-to-noise_ratio on June 27, 2017. The documents are cited in full enter.

SSIM是將影像降級視作結構資訊的所感知改變同時亦併有亮度掩蔽及對比度掩蔽的基於感知的模型。其比PSNR與人類視覺更一致。SSIM計算如下：

，其中 x 為經重構影像， y 為參考影像（地面實況），

為平均值，

為方差，

為 x 與 y 之間的協方差，

，且

。SSIM位於[0,1]之間。若x為y的完美複本，則SSIM將為1。參見例如2017年6月27日自維基百科在https://en.wikipedia.org/wiki/Structural_similarity處下載的「結構相似性」的定義，所述文獻以全文引用的方式併入。表I. CT-SRCNN與現有方法的比較

參數數目 PSNR SSIM 時間每影像 （以秒為單位） VDSR ＞600,000 29.77 0.8314 0.17 DRCN ＞1,000,000 29.76 0.8311 4.19 13層級聯訓練（僅）的SRCNN ~150,000 29.91 0.8324 0.03 級聯微調的13層CT-SRCNN ~120,000 29.91 0.8322 0.02

II. 級聯網路微調 SSIM is a perception-based model that regards image degradation as a perceived change of structural information and also includes brightness masking and contrast masking. It is more consistent with human vision than PSNR. SSIM is calculated as follows:

, Where x is the reconstructed image and y is the reference image (ground truth),

Is the average,

Is the variance,

Is the covariance between x and y,

,and

. SSIM is located between [0,1]. If x is a perfect copy of y, then SSIM will be 1. See, for example, the definition of "structural similarity" downloaded from Wikipedia at https://en.wikipedia.org/wiki/Structural_similarity on June 27, 2017, which is incorporated by reference in its entirety. Table I. Comparison of CT-SRCNN with existing methods

Number of parameters PSNR SSIM Time per image (in seconds) VDSR ＞600,000 29.77 0.8314 0.17 DRCN ＞1,000,000 29.76 0.8311 4.19 13-layer cascade training (only ) SRCNN ~150,000 29.91 0.8324 0.03 Cascade fine-tuned 13-layer CT-SRCNN ~120,000 29.91 0.8322 0.02

II. Fine-tuning of level network connection

大多數神經網路具有冗餘。移除此類冗餘明顯地改良效率。在本揭露內容的實施例中，可自某些層移除較大數目的過濾器及/或權重而準確性損失較少。Most neural networks have redundancy. Removing such redundancy significantly improves efficiency. In the embodiments of the present disclosure, a larger number of filters and/or weights can be removed from certain layers with less accuracy loss.

此技術/方法（級聯網路微調）可與上文所描述的級聯訓練一起使用，或可獨立於級聯訓練而使用。給定具有可接受準確性或效能的深度卷積神經網路，始終需要用於減小網路大小、計算複雜度及/或處理速度的技術/方法，同時使網路深度保持相同且不降級準確性。This technique/method (fine tuning of the cascade network) can be used together with the cascade training described above, or can be used independently of the cascade training. Given a deep convolutional neural network with acceptable accuracy or performance, techniques/methods for reducing network size, computational complexity and/or processing speed are always required, while keeping the network depth the same without degradation accuracy.

類似於級聯訓練，級聯網路微調亦包含迭代過程。在每一階段中，自僅 d 個層微調過濾器，此意謂對於L層網路，在階段 i 中微調第(L-(i-1)d-1)層至第(L-id)層。舉例而言，當自13層CT-SRCNN微調d=2層時，在第一階段i=1中微調第12層及第11層，且隨後精細調諧網路。當其收斂時，第二階段i=2以微調第9層及第10層開始。此程序迭代地重複直至微調所有層。雖然在以上程序中忽略第13層，但程序亦可視為在第一階段中微調第12層及第13層，及在第二階段中微調第10層及第11層等。Similar to the cascade training, the fine-tuning of the cascade network also includes an iterative process. In each stage, the filter is fine-tuned from only d layers, which means that for the L-layer network, fine-tune the (L-(i-1)d-1)th layer to the (L-id) th layer in stage i Floor. For example, when d=2 layers are fine-tuned from the 13-layer CT-SRCNN, the 12th and 11th layers are fine-tuned in the first stage i=1, and then the network is fine-tuned. When it converges, the second stage i=2 starts with fine-tuning the 9th and 10th layers. This procedure is repeated iteratively until all layers are fine-tuned. Although the 13th layer is ignored in the above procedure, the procedure can also be regarded as fine-tuning the 12th and 13th layers in the first stage, and fine-tuning the 10th and 11th layers in the second stage.

圖5說明根據一個實施例的級聯網路微調的例示性圖。在步驟505處，微調的過程以具有 L 個層的經訓練CNN開始。FIG. 5 illustrates an exemplary diagram of fine-tuning of a hierarchical network connection according to an embodiment. At step 505, the process of fine-tuning starts with a trained CNN with L layers.

在步驟510處，微調在階段 i = 1處開始。如上文所提及，在階段中微調總計L層CNN的僅 d 個層。因此，在步驟510處在階段 i 中微調第(L-(i-1)d-1)層至第(L-id)層。在步驟520處，進行精細調諧。在步驟530處，判定訓練誤差是否已停止減小某一量（自先前階段）。若其已如此，則在步驟540處判定乘以每階段微調的層的階段的總數目是否大於或等於層的總數目（「(id＞=L)?」）。若在步驟530處訓練誤差尚未停止減小，則方法在步驟520處返回至精細調諧。At step 510, fine-tuning starts at stage i =1. As mentioned above, only d layers of the total L-layer CNN are fine-tuned in the stage. Therefore, the (L-(i-1)d-1)th layer to the (L-id)th layer are fine-tuned in stage i at step 510. At step 520, fine tuning is performed. At step 530, it is determined whether the training error has stopped decreasing by a certain amount (from the previous stage). If this is the case, it is determined at step 540 whether the total number of stages multiplied by the fine-tuned layers of each stage is greater than or equal to the total number of layers (“(id>=L)?”). If the training error has not stopped decreasing at step 530, the method returns to fine tuning at step 520.

若在步驟540處判定乘以每階段微調的層的階段的總數目大於或等於層的總數目（「(id＞=L)?」），則在步驟565處過程結束且輸出經微調CNN模式。若在步驟540處判定乘以每階段微調的層的階段的總數目小於層的總數目（「(id＞=L)?」），則方法在步驟550處開始下一階段（「i=i+1」）。If it is determined in step 540 that the total number of stages multiplied by the fine-tuned layers of each stage is greater than or equal to the total number of layers ("(id>=L)?"), then the process ends at step 565 and the fine-tuned CNN mode is output . If it is determined in step 540 that the total number of stages multiplied by the fine-tuned layers of each stage is less than the total number of layers ("(id>=L)?"), the method starts the next stage at step 550 ("i=i +1").

圖6A及圖6B說明根據一個實施例的網路微調方法之間的差異中的一些。6A and 6B illustrate some of the differences between the network fine-tuning methods according to one embodiment.

在圖6A中，根據本揭露內容的實施例，每階段微調CNN的一個層，且在每一階段之間進行精細調諧。相反地，圖6B中的CNN的所有層皆在同一時間進行精細調諧及微調。如圖6B中所展示在同一時間調諧及微調所有層比圖6A中所展示的方案複雜得多。In FIG. 6A, according to an embodiment of the present disclosure, one layer of the CNN is fine-tuned at each stage, and fine tuning is performed between each stage. In contrast, all layers of the CNN in FIG. 6B are fine-tuned and fine-tuned at the same time. Tuning and fine-tuning all layers at the same time as shown in FIG. 6B is much more complicated than the solution shown in FIG. 6A.

藉由微調來自層的全部過濾器來進行級聯網路微調。為恢復任何損失的準確性，利用精細調諧逐層進行微調直至在每一經微調層或層群組之後收斂。By fine-tuning all the filters from the layer to fine-tune the tiered network. To restore any loss of accuracy, fine tuning is performed layer by layer until convergence after each fine-tuned layer or layer group.

如圖7中所展示，在微調過濾器後，將亦影響相鄰層。在圖7中，自第 i 層微調過濾器710（點線的區塊）， n_i = n_i - 1，亦將微調第( i + 1)層中的一些權重720（由過濾器內的點線指示）。如此在第 i 層中微調過濾器將減小第 i 層及第( i + 1)層兩者的計算成本。在CNN中，第( i + 1)層的輸入通道的數目等於第 i 層的過濾器（輸出通道）的數目。As shown in Figure 7, after fine-tuning the filter, the adjacent layers will also be affected. In Figure 7, from the i-th layer fine-tuning the filter 710 (dotted block), n _i = n _i -1, will also fine-tune some weights 720 in the (i + 1) layer (by the filter in the Dotted line indication). In this way, fine-tuning the filter in the i-th layer will reduce the computational cost of both the i-th layer and the ( i+1)-th layer. In CNN, the number of input channels of the (i + 1) layer is equal to the number of filters (output channels) of the i layer.

在圖7中，假定在微調之前在第 i 層中存在 n_i = 4個過濾器及 n_i-1 = 5個輸入通道，且在第( i + 1)層中存在 n_i+1 = 10個過濾器及 n_i = 4個輸入通道。若自第 i 層微調過濾器710，則經微調 n_i 將減少至3，且 n_i+1 仍為10。第( i + 1)層中的圖塊720為對應於乘法的經微調權重。如最後一個區段中所提及，在第 i 層中將存在

乘法，且在第( i + 1)層中將存在

乘法。由於 n_i 減小，第 i 層及第( i + 1)層兩者中的乘法的數目亦減小。In Figure 7, it is assumed that there are n _i = 4 filters and n _i-1 = 5 input channels in the i-th layer before fine-tuning, and n _i+1 = 10 in the (i + 1)-th layer. Filters and n _i = 4 input channels. If the filter 710 is fine-tuned from the i-th layer, the fine-tuned n _i will be reduced to 3, and n _{i+1 will} still be 10. The tile 720 in the (i +1)th layer is the fine-tuned weight corresponding to the multiplication. As mentioned in the last section, there will be in the i-th layer

Multiplication, and will exist in the (i + 1)th layer

multiplication. As n _i decreases, the number of multiplications in both the i-th layer and the ( i +1)-th layer also decreases.

適當準則用於決定將微調哪些過濾器。在此實施例中，使用相對價值的量測。更確切而言，第 i 層中的第 j 過濾器的相對價值

由第 j 過濾器中的所有權重的平方和定義，其中

為第 i 層中的第 j 過濾器的權重矩陣，如等式（1）中所展示：

… （1）

Appropriate criteria are used to decide which filters will be fine-tuned. In this embodiment, a measure of relative value is used. More precisely, the relative value of the j-th filter in the i-th layer

Defined by the sum of squares of all weights in the jth filter, where

Is the weight matrix of the j-th filter in the i-th layer, as shown in equation (1):

… (1)

因此，移除具有最小

的過濾器。如上文所論述，當自第 i 層微調過濾器710時，亦將微調第( i + 1)層中的一些權重720，從而引起

。因此，當計算

時，在等式（3）中使用非微調權重

（亦稱為「獨立微調」），或在等式（2）中使用經微調權重

：

… （2）

Therefore, removal has the smallest

Filter. As discussed above, when the filter 710 is fine-tuned from the i-th layer, some weights 720 in the (i+1)-th layer will also be fine-tuned, resulting in

. Therefore, when calculating

, Use non-fine-tuned weights in equation (3)

(Also known as "independent fine-tuning"), or use fine-tuned weights in equation (2)

:

… (2)

以下演算法提供用於自層微調過濾器的迭代過程的例示性高級描述。用於微調過濾器的演算法 每一層的過濾器微調的參數

_(filters,i), i = 1、…、L比率輸入：具有L層的CT-SRCNN，每一層具有Mi個過濾器 1. 重複i = 1、2、…、L 1.1 使用（3）或（4）來計算第i層中的所有濾波器的Ri,j, j = 1、…、Mi 1.2 自第i層移除

_(filters,i) × Mi個過濾器 1.3 若i ＜ L，則移除第i + 1個層中的對應權重 2. 精細調諧且輸出經微調模型利用不同速率/臨限值

及

，可形成不同微調模型。由於過濾器微調影響鄰近層，故精細調諧將需要在其中使用過濾器微調的大多數情況下擷取準確性。相反地，權重修剪具有相對較小影響。在適當微調速率（例如，小於0.2）下，準確性即使在無精細調諧的情況下將亦不減小。III. 擴張卷積 The following algorithm provides an illustrative high-level description of the iterative process for fine-tuning the filter from the layer. Algorithm for fine-tuning the filter

Parameters for fine-tuning of filters in each layer

_(filters,i), i = 1,..., L ratio input: CT-SRCNN with L layers, each layer has Mi filters 1. Repeat i = 1, 2,..., L 1.1 Use (3) or (4) To calculate the Ri,j, j = 1,..., Mi 1.2 of all filters in the i-th layer and remove them from the i-th layer

_(filters,i) × Mi filters 1.3 If i <L, remove the corresponding weight in the i + 1th layer 2. Fine-tune and output the fine-tuned model

Use different rates/thresholds

and

, Can form different fine-tuning models. Since filter fine-tuning affects adjacent layers, fine-tuning will need to capture accuracy in most cases where filter fine-tuning is used. Conversely, weight pruning has a relatively small effect. At an appropriate fine-tuning rate (for example, less than 0.2), the accuracy will not decrease even without fine-tuning. III. Dilated Convolution

擴張卷積（亦稱為空洞卷積）為一種類型的卷積，其最初經開發以用於小波分解（參見M.霍爾施奈德（Holschneider, M.）；R.克隆郎-馬蒂內（Kronland-Martinet, R.）；J.莫萊（Morlet, J.）；以及Ph.恰米齊安（Tchamitchian, Ph.），在J. M.庫姆斯（J. M. Combes）等人編的小波：時間-頻率方法及相位空間（WAVELETS: TIME-FREQUENCY METHODS AND PHASE SPACE）中的用於藉助於小波轉換的訊號分析的實時演算法（A Real-Time Algorithm for Signal Analysis with the Help of the Wavelet Transform ），第286至297頁（1987），所述文獻以全文引用的方式併入），但已應用於語義分段，尤其以便得到密集特徵（參見例如費希爾餘（Yu, Fisher）及弗拉德連寇頓（Koltun, Vladlen），利用擴張卷積的多標度上下文聚集（Multi-scale context aggregation by dilated convolutions ），2016學習表示國際會議（ICLR）（下文中，「餘等人2016」），所述文獻以全文引用的方式併入）。Dilated convolution (also known as hole convolution) is a type of convolution that was originally developed for wavelet decomposition (see M. Holschneider, M.); R. Kronlan-Marty Inside (Kronland-Martinet, R.); J. Morlet (Morlet, J.); and Ph. Chamizian (Tchamitchian, Ph.), in the wavelet edited by JM Combes and others: A Real-Time Algorithm for Signal Analysis with the Help of the Wavelet Transform in WAVELETS: TIME-FREQUENCY METHODS AND PHASE SPACE (A Real-Time Algorithm for Signal Analysis with the Help of the Wavelet Transform) , Pages 286 to 297 (1987), the document is incorporated by reference in its entirety), but has been applied to semantic segmentation, especially in order to obtain dense features (see, for example, Yu, Fisher and Fra Koltun, Vladlen, Multi-scale context aggregation by dilated convolutions , 2016 International Conference on Learning Representation (ICLR) (hereinafter, "Yu et al. 2016") , The document is incorporated by reference in its entirety).

在不具有池化的由卷積層構成的純卷積網路中，單元的接收場可僅逐層線性地生長，此是因為特徵映射基於來自輸入的卷積鄰近像素而生成。增大接收場的可行方式為卷積來自較大區的輸入像素。此類似於在擴張卷積中使用『擴張核心』而非將習知密集核心用於習知卷積。In a pure convolutional network composed of convolutional layers without pooling, the receptive field of the unit can only grow linearly layer by layer, because the feature map is generated based on convolutional neighboring pixels from the input. A feasible way to increase the receiving field is to convolve input pixels from a larger area. This is similar to using "expanded core" in expanded convolution instead of using conventional dense core for conventional convolution.

假設

為離散函數，

為卷積核心，且擴張卷積

為典型卷積的通用版本，如由以下等式（3）所定義，其中

為擴張因子。習知卷積為簡單1擴張卷積（亦即，當時d = 1）。

… （3）

Hypothesis

Is a discrete function,

It is the core of the convolution, and the expansion of the convolution

Is a general version of a typical convolution, as defined by the following equation (3), where

Is the expansion factor. The conventional convolution is a simple 1 dilated convolution (that is, d = 1 at the time).

… (3)

將擴張卷積應用於CNN中的一個優點為擴張版本具有更大接收場，如圖8A及圖8B中所說明。擴張卷積過濾器藉由對初始過濾器上取樣，亦即藉由在其元素之間***零來獲得。因此，藉由設計，擴張過濾器具有零元素的結構化模式。相較於權重修剪，其中零元素具有隨機模式及位置，擴張過濾器具有用於零權重的結構化模式，且對降低硬體及軟體的計算複雜度更有用。因此，對於超解析度，本揭露內容的實施例藉由保持相同接收場及相較於具有相同接收場的非擴張過濾器將其實際上用於降低計算複雜度來與其典型用途不同地部署擴張過濾器。One advantage of applying dilated convolution to CNN is that the dilated version has a larger receiving field, as illustrated in Figs. 8A and 8B. The dilated convolution filter is obtained by sampling the initial filter, that is, by inserting zeros between its elements. Therefore, by design, the expansion filter has a structured pattern of zero elements. Compared with weight pruning, where the zero elements have random patterns and positions, the expansion filter has a structured pattern for zero weights, and is more useful for reducing the computational complexity of hardware and software. Therefore, for super-resolution, the embodiments of the present disclosure deploy expansion differently from its typical use by keeping the same receiving field and comparing it to a non-expanding filter with the same receiving field to reduce computational complexity. filter.

圖8A及圖8B分別說明根據一個實施例的擴張卷積與習知卷積之間的差異中的一些。在圖8B中，以跨步二進行習知卷積，而在圖8A中，存在根據本揭露內容的一實施例的具有跨步一的2擴張卷積（意謂每2個像素而非每一個像素應用卷積中的乘法及聚集操作）。雖然圖8A及圖8B具有相同特徵映射大小（具有用於擴張版本的填補），但2擴張特徵映射的接收場相較於卷積一更大。在CNN中，輸入及輸出為2-D特徵映射，故圖8A或圖8B僅在x方向或y方向上。8A and 8B respectively illustrate some of the differences between dilated convolution and conventional convolution according to an embodiment. In FIG. 8B, the conventional convolution is performed with a stride of two, and in FIG. 8A, there is a 2-dilated convolution with a stride of one according to an embodiment of the present disclosure (meaning every 2 pixels instead of every One pixel applies the multiplication and aggregation operations in convolution). Although FIGS. 8A and 8B have the same feature map size (with padding for the expanded version), the receptive field of the 2 expanded feature map is larger than that of the convolution one. In CNN, the input and output are 2-D feature maps, so Figure 8A or Figure 8B is only in the x direction or the y direction.

圖8B說明具有大小3核心及跨步2的習知卷積的實例，其中輸入為7像素訊號（由7個圓圈表示）。在圖8B中，每3個鄰近像素經核心卷積（如由連接線指示）且隨後生成特徵映射的輸出（方塊），以特徵映射的第1至第3像素（具線紋的圓圈）及第一輸出（具線紋的方塊）開始。因為跨步為2，圖8B中的下一卷積為第3至第5像素，且特徵映射的下一輸出（黑色方塊）由具有接收場3的3個元素組成。Figure 8B illustrates an example of a conventional convolution with size 3 cores and stride 2, where the input is a 7-pixel signal (represented by 7 circles). In Figure 8B, every 3 neighboring pixels are core convolved (as indicated by connecting lines) and then the output of the feature map (squares) is generated, with the first to third pixels of the feature map (circles with lines) and The first output (box with lines) starts. Because the stride is 2, the next convolution in FIG. 8B is the 3rd to 5th pixels, and the next output (black square) of the feature map is composed of 3 elements with the receiving field 3.

相反地，圖8A說明具有大小3核心及跨步1的2擴張卷積的實例。在 d 擴張卷積中，每 d 個像素應用卷積。如此特徵映射的第一輸出（具線紋的方塊）藉由用3×3核心卷積第1像素、第3像素以及第5像素（具線紋的圓圈）生成。隨後下一輸出（黑色方塊）藉由卷積第2像素、第4像素以及第6像素生成。In contrast, FIG. 8A illustrates an example of 2-dilated convolution with size 3 cores and stride 1. In d dilated convolution, convolution is applied every d pixels. The first output (lined square) of such feature mapping is generated by convolving the first pixel, the third pixel, and the fifth pixel (lined circle) with a 3×3 core. Then the next output (black square) is generated by convolving the second pixel, the fourth pixel, and the sixth pixel.

在其中CNN中的所有層為具有跨步一的卷積的一實施例中，擴張卷積可以不同方式應用。給定具有跨步一的

卷積核心，所得特徵映射的接收場為

。若使用2擴張卷積，則所得特徵映射的接收場為

。舉例而言，圖4A及圖4B中的CT-SRCNN的

1擴張層410及

1擴張層413可實際上分別由

2擴張層及

2擴張層替換。所得網路將具有相同大小接收場，但由於較小核心大小而具有較少參數。In an embodiment where all layers in the CNN are convolutions with stride one, the dilated convolution can be applied in different ways. Given a step one

Convolution core, the received field of the resulting feature map is

. If 2 dilated convolution is used, the receptive field of the resulting feature map is

. For example, the CT-SRCNN in Figure 4A and Figure 4B

1

expansion layer

410 and

1 The expansion layer 413 can actually be composed of

2 expansion layer and

2 Expansion layer replacement. The resulting network will have the same size receiving field, but with fewer parameters due to the smaller core size.

因此，在一個實施例中，在訓練具有

1擴張層及兩個

1擴張層的CT-SRCNN後，那些層可在進行精細調諧之前由

2擴張層及兩個

2擴張層替換。不同於餘等人2016，根據本揭露內容的一實施例的擴張CT-SRCNN並不需要擴張層中的任何補零。Therefore, in one embodiment, the training has

1 expansion layer and two

1 After expanding the CT-SRCNN of the layers, those layers can be

2 expansion layer and two

2 Expansion layer replacement. Unlike Yu et al. 2016, the expanded CT-SRCNN according to an embodiment of the present disclosure does not require any zero padding in the expansion layer.

如上文所提及，許多研究人員嘗試藉由例如使用較多層（例如，VDSR）或深度回歸結構（例如，DRCN）來提高SRCNN的準確性及效率。其他研究人員類似地提出使用更複雜的網路。王兆文（Wang, Zhaowen）；劉鼎（Liu, Ding）；楊健超（Yang, Jianchao）；韓偉（Han, Wei）；以及托馬斯黃（Huang, Thomas），用於具有稀疏先驗的影像超解析度的深度網路（Deep networks for image super-resolution with sparse prior ），2015 IEEE電腦視覺國際會議（ICCV），第370至378頁，所述文獻以引用的方式併入本文中，基於經學習迭代收縮及定限演算法來整合具有前饋網路的稀疏性表示先驗。VDSR將層的數目增大至20且使用小的過濾器及具有可調整梯度裁剪的高學習速率；相同群組亦設計在DRCN中具有回歸監督及跳過連接的深度回歸CNN。瑞安達爾（Dahl, Ryan）；***努魯茲（Norouzi, Mohammad）；以及喬納森舍琳（Shlens, Jonathon），像素回歸超解析度（Pixel Recursive Super Resolution ），arXiv 1702.00783 [2017年3月22日]，所述文獻以引用的方式併入本文中，將ResNet與像素回歸超解析度組合，其展示在其中超解析度應用於床影像的面部及床SR上的有前景的結果）。As mentioned above, many researchers try to improve the accuracy and efficiency of SRCNN by, for example, using more layers (for example, VDSR) or deep regression structures (for example, DRCN). Other researchers similarly proposed using more complex networks. Wang, Zhaowen (Wang, Zhaowen); Liu, Ding (Liu, Ding); Yang, Jianchao (Yang, Jianchao); Han, Wei (Han, Wei); and Thomas Huang (Huang, Thomas) for super-resolution images with sparse priors Deep networks for image super-resolution with sparse prior , 2015 IEEE International Conference on Computer Vision (ICCV), pages 370 to 378. The documents are incorporated into this article by reference, based on the iterative contraction of learning And a definite limit algorithm to integrate the sparsity representation prior with feedforward network. VDSR increases the number of layers to 20 and uses small filters and a high learning rate with adjustable gradient clipping; the same group also designs a deep regression CNN with regression supervision and skip connections in DRCN. Dahl, Ryan; Norouzi, Mohammad; and Shlens, Jonathon, Pixel Recursive Super Resolution , arXiv 1702.00783 [March 22, 2017] The document is incorporated herein by reference, combining ResNet and pixel regression super-resolution, which shows promising results in which super-resolution is applied to the face of the bed image and the bed SR).

其他人偏好使用感知損失而非均方誤差（MSE）以用於訓練誤差，其更靠近自然紋理及人類視覺。卡斯珀森納比（Sønderby, Casper）；約瑟卡瓦列羅（Caballero, Jose）；盧卡斯泰斯（Theis, Lucas）；施聞哲（Shi, Wenzhe）；以及費倫茨胡薩爾（Huszár, Ferenc），用於影像超解析度的攤銷MAP推斷（Amortised MAP Inference for Image Super-resolution ），arXiv 1610.04490 [2017年2月21日]，所述文獻以引用的方式併入本文中，引入用於攤銷MAP推斷的方法，所述方法直接使用CNN來計算MAP估計。賈斯汀強森（Johnson, Justin）；亞歷山大阿拉希（Alahi, Alexandre）；以及李飛飛（Fei-Fei, Li），用於實時風格轉換及超解析度的感知損失（Perceptual lossesfor real-time style transfer and super-resolution ），2016 ECCV，第694至711頁，所述文獻以引用的方式併入本文中，提議使用用於訓練用於影像轉換任務的前饋網路的感知損失函數。克里斯蒂安萊迪希（Ledig, Christian）等人，使用生成對抗網路的逼真單個影像超解析度（Photo-realistic single image super-resolution using a generative adversarial network ），arXiv 1609.04802 [2017年4月13日]，所述文獻以引用的方式併入本文中，採用極深度殘差網路（ResNet），且進一步呈現超解析度生成對抗網路（SRGAN）以獲得類似於自然紋理的影像。Others prefer to use perceptual loss instead of mean square error (MSE) for training error, which is closer to natural texture and human vision. Sønderby (Casper); Joseph Caballero (Caballero, Jose); Lucas (Theis, Lucas); Shi Wenzhe (Shi, Wenzhe); and Ferenc Husar ( Huszár, Ferenc), Amortised MAP Inference for Image Super-resolution , arXiv 1610.04490 [February 21, 2017], which is incorporated herein by reference, A method for amortizing MAP inference is introduced, which directly uses CNN to calculate MAP estimation. Justin Johnson (Johnson, Justin); Alexander Alahi (Alahi, Alexandre); and Fei-Fei (Li) for real-time style transfer and super-resolution perceptual losses (Perceptual losses for real-time style transfer and super-resolution ), 2016 ECCV, pages 694 to 711, which are incorporated into this article by reference, suggesting the use of perceptual loss functions for training feedforward networks for image conversion tasks. Ledig, Christian and others, Photo-realistic single image super-resolution using a generative adversarial network (Photo-realistic single image super-resolution using a generative adversarial network), arXiv 1609.04802 [April 13, 2017 ], the document is incorporated into this article by reference, using a very deep residual network (ResNet), and further presenting a super-resolution generation confrontation network (SRGAN) to obtain images similar to natural textures.

然而，雖然以上列出的工作改良SR系統的準確性，但經改良準確性的代價是具有更多層/參數及/或更困難的超參數調諧程序。換言之，準確性的任何進展由複雜度的極大增大抵消。However, although the work listed above improves the accuracy of the SR system, the price of improved accuracy is more layers/parameters and/or more difficult hyperparameter tuning procedures. In other words, any progress in accuracy is offset by a huge increase in complexity.

其他研究人員專注於藉由提取LR空間中的特徵映射及使用超標度過濾器進行訓練來改良效率。施聞哲等人，使用有效子像素卷積神經網路的實時單個影像及視訊超解析度（Real-time Single Image and Video Super-Resolution Using an Efficient sub-pixel convolutional neural network ），2016 CVPR，第1874至1883頁，所述文獻以引用的方式併入本文中，引入學習超標度過濾器的陣列以將LR特徵映射超標度為HR輸出的有效子像素卷積層。董超；呂健勤；以及湯曉鷗，加速超解析度卷積神經網路（Accelerating the super-resolution convolutional neural network ），2016 ECCV，第391至407頁，所述文獻以全文引用的方式併入本文中（下文中，「董等人2016b」），藉由添加較小過濾器、解卷積層以及特徵空間收縮來再設計SRCNN以加速速度而不損失準確性。Other researchers focused on improving efficiency by extracting feature maps in LR space and training with overscale filters. Shi Wenzhe and others, Real-time Single Image and Video Super-Resolution Using an Efficient sub-pixel convolutional neural network (Real-time Single Image and Video Super-Resolution Using an Efficient sub-pixel convolutional neural network), 2016 CVPR, No. 1874 to On page 1883, the document is incorporated into this article by reference, and an array of learning super-scale filters is introduced to super-scale the LR feature mapping into an effective sub-pixel convolution layer of the HR output. Dong Chao; Lu Jianqin; and Tang Xiaoou, Accelerating the super-resolution convolutional neural network , 2016 ECCV, pages 391 to 407, which are incorporated into this article by reference in their entirety ( Hereinafter, "Dong et al. 2016b"), by adding smaller filters, deconvolution layers, and feature space shrinking to redesign SRCNN to accelerate speed without loss of accuracy.

然而，由於使用超標度層，這些網路的修補大小及內容接收場將相對小。因此，準確性相較於自經上取樣LR空間提取特徵映射相對更低。However, due to the use of overscale layers, the patch size and content receiving field of these networks will be relatively small. Therefore, the accuracy is relatively lower than that of extracting feature maps from the upsampled LR space.

相反地，本文中所描述的CT-SRCNN可變得更深，藉此實現高準確性，而無需參數的大量調諧。CT-SRCNN的網路大小相較於當前最新技術解決方案（諸如以上列出的那些）小得多。CT-SRCNN亦可在單個GPU中處理具有720 × 480的解析度的20至25個圖框/秒的視訊。此效率可藉由網路微調及擴張卷積進一步增強。Conversely, the CT-SRCNN described in this article can be made deeper, thereby achieving high accuracy without requiring extensive tuning of parameters. The network size of CT-SRCNN is much smaller than the current state-of-the-art technical solutions (such as those listed above). CT-SRCNN can also process 20-25 frames per second video with a resolution of 720 × 480 in a single GPU. This efficiency can be further enhanced by network fine-tuning and expanded convolution.

在本揭露內容中，描述級聯訓練方法，所述級聯訓練方法訓練深度CNN以用於具有高準確性及效率兩者的超解析度。級聯訓練確保網路可以相對較小大小不斷變得更深。本文中所描述的網路微調及擴張卷積進一步降低網路複雜度。基準影像及視訊資料集上的實驗結果展示本文中的所揭露方法實現相較於其他當前最新技術解決方案但在高得多的速度下的競爭性效能。In this disclosure, a cascaded training method is described, which trains a deep CNN for super-resolution with both high accuracy and efficiency. Cascading training ensures that the network can become deeper and deeper with a relatively small size. The network fine-tuning and expanded convolution described in this article further reduce network complexity. Experimental results on benchmark images and video data sets show that the method disclosed in this article achieves competitive performance at a much higher speed than other current state-of-the-art technical solutions.

雖然在影像超解析度的構架中加以描述，但本文中所描述的技術可為用於任何類型的目的（諸如去雜訊或影像復原）的通用任何類型的CNN。Although described in the framework of image super-resolution, the technology described in this article can be used for any type of purpose (such as de-noising or image restoration) for any type of general purpose CNN.

圖9說明根據一個實施例的本發明設備的例示性圖。設備900包含至少一個處理器910及一或多個非暫時性電腦可讀媒體920。至少一個處理器910當執行儲存於一或多個非暫時性電腦可讀媒體920上的指令時進行以下步驟：訓練具有三個或大於三個層的CNN；在經訓練CNN上進行級聯訓練以添加一或多個中間層直至訓練誤差小於臨限值；以及進行自級聯訓練輸出的CNN的網路微調。此外，一或多個非暫時性電腦可讀媒體920儲存用於至少一個處理器910以進行以下步驟的指令：訓練具有三個或大於三個層的CNN；在經訓練CNN上進行級聯訓練以添加一或多個中間層直至訓練誤差小於臨限值；以及進行自級聯訓練輸出的CNN的網路微調。Figure 9 illustrates an illustrative diagram of the apparatus of the present invention according to one embodiment. The device 900 includes at least one processor 910 and one or more non-transitory computer-readable media 920. At least one processor 910 performs the following steps when executing instructions stored on one or more non-transitory computer-readable media 920: training a CNN with three or more layers; performing cascade training on the trained CNN To add one or more intermediate layers until the training error is less than the threshold; and perform network fine-tuning of the CNN output from the cascade training. In addition, one or more non-transitory computer-readable media 920 store instructions for at least one processor 910 to perform the following steps: train a CNN with three or more layers; perform cascade training on the trained CNN To add one or more intermediate layers until the training error is less than the threshold; and perform network fine-tuning of the CNN output from the cascade training.

圖10說明根據一個實施例的用於製造及測試本發明設備的例示性流程圖。Figure 10 illustrates an exemplary flow chart for manufacturing and testing the device of the present invention according to one embodiment.

在步驟1050處，製造包含至少一個處理器及一或多個非暫時性電腦可讀媒體的設備（在此實例中，上文所描述的晶片組）。當執行儲存於一或多個非暫時性電腦可讀媒體上的指令時，至少一個處理器進行以下步驟：訓練具有三個或大於三個層的CNN；在經訓練CNN上進行級聯訓練以添加一或多個中間層直至訓練誤差小於臨限值；以及進行自級聯訓練輸出的CNN的網路微調。一或多個非暫時性電腦可讀媒體儲存用於至少一個處理器以進行以下步驟的指令：訓練具有三個或大於三個層的CNN；在經訓練CNN上進行級聯訓練以添加一或多個中間層直至訓練誤差小於臨限值；以及進行自級聯訓練輸出的CNN的網路微調。At step 1050, a device including at least one processor and one or more non-transitory computer-readable media (in this example, the chipset described above) is manufactured. When executing instructions stored on one or more non-transitory computer-readable media, at least one processor performs the following steps: training a CNN with three or more layers; performing cascade training on the trained CNN to Add one or more intermediate layers until the training error is less than the threshold; and perform network fine-tuning of the CNN output from the cascade training. One or more non-transitory computer-readable media store instructions for at least one processor to perform the following steps: train a CNN with three or more layers; perform cascaded training on the trained CNN to add one or Multiple intermediate layers until the training error is less than the threshold; and perform network fine-tuning of the CNN output from the cascade training.

在步驟1060處，測試設備（在此實例中，晶片組）。測試1060包含測試設備是否具有至少一個處理器，所述至少一個處理器當執行儲存於一或多個非暫時性電腦可讀媒體上的指令時進行以下步驟：訓練具有三個或大於三個層的CNN，在經訓練CNN上進行級聯訓練以添加一或多個中間層直至訓練誤差小於臨限值，以及進行自級聯訓練輸出的CNN的網路微調；以及測試設備是否具有一或多個非暫時性電腦可讀媒體，所述非暫時性電腦可讀媒體儲存用於至少一個處理器以進行以下步驟的指令：訓練具有三個或大於三個層的CNN，在經訓練CNN上進行級聯訓練以添加一或多個中間層直至訓練誤差小於臨限值，以及進行自級聯訓練輸出的CNN的網路微調。實驗驗證A. 級聯訓練表A-I. 集合14、標度3中的級聯訓練與習知訓練的比較 PSNR SSIM CT-SRCNN 5層 29.44 0.8232 非CT-SRCNN 5層 29.56 0.8258 CT-SRCNN 7層 29.50 0.8245 非CT-SRCNN 7層 29.71 0.8287 CT-SRCNN 9層 29.52 0.8250 非CT-SRCNN 9層 29.75 0.8299 CT-SRCNN 13層 29.56 0.8265 非CT-SRCNN 13層 29.91 0.8324 At step 1060, the equipment (in this example, the chipset) is tested. Test 1060 includes testing whether the device has at least one processor, and the at least one processor performs the following steps when executing instructions stored on one or more non-transitory computer-readable media: training has three or more layers CNN, perform cascade training on the trained CNN to add one or more intermediate layers until the training error is less than the threshold, and perform network fine-tuning of the CNN output from the cascade training; and test whether the device has one or more A non-transitory computer-readable medium that stores instructions for at least one processor to perform the following steps: training a CNN with three or more layers, performing on the trained CNN Cascade training to add one or more intermediate layers until the training error is less than the threshold, and perform network fine-tuning of the CNN output from the cascade training. Experimental verification A. Cascade training table AI. Comparison of cascade training and conventional training in set 14, scale 3 PSNR SSIM CT-SRCNN 5 layers 29.44 0.8232 Non-CT-SRCNN 5 layers 29.56 0.8258 CT-SRCNN 7 layers 29.50 0.8245 Non-CT-SRCNN 7 layers 29.71 0.8287 CT-SRCNN 9 layers 29.52 0.8250 Non-CT-SRCNN 9 layers 29.75 0.8299 CT-SRCNN 13 layers 29.56 0.8265 Non-CT-SRCNN 13 layers 29.91 0.8324

在表A-I中，根據本揭露內容的級聯訓練CNN的PSNR/SSIM與具有來自VDSR的無監督權重初始化的非級聯訓練CNN進行比較。可看出，在相同網路架構下，CT-SRCNN的PSNR/SSIM比非級聯訓練明顯更佳。In Table A-I, the PSNR/SSIM of the cascaded training CNN according to the present disclosure is compared with the non-cascaded training CNN with unsupervised weight initialization from VDSR. It can be seen that under the same network architecture, the PSNR/SSIM of CT-SRCNN is significantly better than non-cascaded training.

圖11是說明根據一個實施例的級聯訓練CNN與非級聯訓練CNN的收斂速度的例示性圖。發現CT-SRCNN相較於非CT-SRCNN收斂更快。當利用更多層時，CT-SRCNN的準確性不斷提高。此指示級聯網路訓練亦將SRCNN訓練愈來愈深。級聯網路訓練相較於習知訓練在準確性及收斂速度兩者中更佳地進行。FIG. 11 is an exemplary diagram illustrating the convergence speed of a cascaded training CNN and a non-cascaded training CNN according to an embodiment. It is found that CT-SRCNN converges faster than non-CT-SRCNN. When more layers are used, the accuracy of CT-SRCNN continues to improve. This instruction-level networking training also deepens the training of SRCNN. The multi-level network training is performed better in both accuracy and convergence speed than conventional training.

在表A-II中，根據本揭露內容的CT-SRCNN-13的參數的數目、PSNR、SSIM以及時間每影像與標度3中的已知SR網路進行比較。表A-II. 集合14、標度3中的級聯訓練與現有網路的比較 參數數目 集合14 PSNR 集合14 SSIM 時間每影像 （以秒為單位） VDSR ＞600,000 29.77 0.8314 0.17 DRCN ＞1,000,000 29.76 0.8311 4.19 13層CT-SRCNN-13 ~150,000 29.91 0.8324 0.03 B. 級聯網路微調In Table A-II, the number of CT-SRCNN-13 parameters, PSNR, SSIM, and time per image according to the present disclosure are compared with the known SR network in scale 3. Table A-II. Comparison of cascade training in set 14, scale 3 and existing networks Number of parameters Set 14 PSNR Collection 14 SSIM Time per image (in seconds) VDSR ＞600,000 29.77 0.8314 0.17 DRCN ＞1,000,000 29.76 0.8311 4.19 13-layer CT-SRCNN-13 ~150,000 29.91 0.8324 0.03 B. Fine-tuning of level networking

表A-III展示級聯微調CT-SRCNN（其中微調13個層的4輸出）實現與非級聯微調CT-SRCNN的類似效能，但網路大小減小20%。根據本揭露內容的級聯網路微調亦應用於另一網路，亦即，快速SR-CNN（FSRCNN）（參見董等人2016b）。此網路由7個卷積層及一個解卷積層組成。類似於以上根據一實施例微調CT-SRCNN，亦在每一階段中微調FSRCNN的2個層。表A-III展示根據本揭露內容的網路級聯微調亦對FSRCNN有效。表A-III. 集合14、標度3中的級聯微調網路的評估 參數數目 PSNR SSIM 時間每影像 （以秒為單位） CT-SRCNN 13層，無微調 ~150,000 29.91 0.8324 0.03 級聯微調13層CT-SRCNN，微調4個層 ~120,000 29.91 0.8322 0.02 FSRCNN 8層，無微調 ~12,000 29.52 0.8246 0.009 級聯微調FSRCNN 8層，微調2個層 ~8,500 29.51 0.8244 0.008 級聯微調FSRCNN 8層，微調4個層 ~6,800 29.35 0.8228 0.007 級聯微調FSRCNN 8層，微調6個層 ~4,900 29.35 0.8208 0.006 級聯微調FSRCNN 8層，微調8個層 ~3,400 29.22 0.8189 0.005 FSRCNN官方精簡版 ~3,900 29.17 0.8175 0.006 Table A-III shows that the cascaded fine-tuning CT-SRCNN (with 13 layers of 4 outputs) achieves similar performance to the non-cascaded fine-tuning CT-SRCNN, but the network size is reduced by 20%. The fine-tuning of the hierarchical networking based on the content of this disclosure is also applied to another network, namely, Fast SR-CNN (FSRCNN) (see Dong et al. 2016b). This net is composed of 7 convolutional layers and one deconvolutional layer. Similar to the above fine-tuning CT-SRCNN according to an embodiment, the 2 layers of FSRCNN are also fine-tuned in each stage. Table A-III shows that network cascading fine-tuning based on this disclosure is also effective for FSRCNN. Table A-III. Evaluation of the cascaded fine-tuning network in set 14, scale 3 Number of parameters PSNR SSIM Time per image (in seconds) CT-SRCNN 13 layers, no fine-tuning ~150,000 29.91 0.8324 0.03 Cascade fine-tuning 13-layer CT-SRCNN, fine-tuning 4 layers ~120,000 29.91 0.8322 0.02 FSRCNN 8 layers, no fine-tuning ~12,000 29.52 0.8246 0.009 Cascade fine-tuning FSRCNN 8 layers, fine-tuning 2 layers ~8,500 29.51 0.8244 0.008 Cascade fine-tuning FSRCNN 8 layers, fine-tuning 4 layers ~6,800 29.35 0.8228 0.007 Cascade fine-tuning FSRCNN 8 layers, fine-tuning 6 layers ~4,900 29.35 0.8208 0.006 Cascade fine-tuning FSRCNN 8 layers, fine-tuning 8 layers ~3,400 29.22 0.8189 0.005 FSRCNN official lite version ~3,900 29.17 0.8175 0.006

在微調速率與準確性之間存在折衷。若微調僅2個層（第7層及第8層），則幾乎不存在準確性損失，同時移除30%的參數。若微調所有8個層（級聯微調FSRCNN 8層，微調8個層），則準確性相較於官方模型（FSRCNN官方精簡版）仍更佳，具有較小網路大小（相較於3,900個參數，3,400個參數）。C. 擴張卷積There is a trade-off between fine-tuning rate and accuracy. If you fine-tune only 2 layers (layer 7 and layer 8), there is almost no loss of accuracy, and 30% of the parameters are removed. If you fine-tune all 8 layers (cascade fine-tune FSRCNN 8 layers, fine-tune 8 layers), the accuracy is still better than the official model (FSRCNN official lite version), with a smaller network size (compared to 3,900 Parameters, 3,400 parameters). C. Dilated Convolution

表A-IV展示擴張13層CT-SRCNN的實驗結果。擴張應用於第一個9×9層、第二個5×5層以及最後一個5×5層。實際上，利用5×5、3×3以及3×3 2擴張卷積層。可看出，CT-SRCNN的擴張版本可實現與非擴張版本的類似PSNR/SSIM，但網路大小明顯減小。表A-IV. 集合14、標度3上的擴張CT-SRCNN的評估 參數數目 PSNR SSIM 時間每影像 （以秒為單位） CT-SRCNN 13層 ~150,000 29.91 0.8324 0.03 擴張CT-SRCNN 13層 ~110,000 29.90 0.8324 0.02 Table A-IV shows the experimental results of the expanded 13-layer CT-SRCNN. The expansion is applied to the first 9×9 layer, the second 5×5 layer, and the last 5×5 layer. In fact, 5×5, 3×3, and 3×32 expansion convolutional layers are used. It can be seen that the expanded version of CT-SRCNN can achieve PSNR/SSIM similar to the non-expanded version, but the network size is significantly reduced. Table A-IV. Evaluation of expanded CT-SRCNN on set 14, scale 3 Number of parameters PSNR SSIM Time per image (in seconds) CT-SRCNN 13 layers ~150,000 29.91 0.8324 0.03 Expansion of 13 layers of CT-SRCNN ~110,000 29.90 0.8324 0.02

影像增強技術包含用於自低解析度輸入擷取高解析度影像的影像及視訊超解析度，用於自給定雜訊輸入生成乾淨影像的影像去雜訊，以及改良經解碼壓縮影像的影像品質的壓縮影像恢復。此外，不同網路架構可經實施以用於不同影像增強任務。Image enhancement technology includes images used to capture high-resolution images from low-resolution inputs and video super-resolution, image denoising used to generate clean images from a given noise input, and improved image quality of decoded compressed images Compressed images are restored. In addition, different network architectures can be implemented for different image enhancement tasks.

影像壓縮減少影像的不相關性及冗餘以便在低位元速率下儲存或傳輸影像。影像壓縮為已在成像裝置中使用的影像處理的基礎元素。傳統影像寫碼標準（例如，JPEG JPEG2000，較佳可攜式圖形（Better Portable Graphics；BPG））嘗試將用於每一非零經量化變換係數的可用位元分佈在整個影像中。在壓縮比增加時，位元每像素（bpp）作為使用較大量化步驟的結果而減小，其使得經解碼影像具有區塊假影或雜訊。為克服此類問題，可使用後處理去區塊或去雜訊方法來改良經解碼影像的品質。典型方法包含後過濾。然而，此類後處理方法極耗時，因為求解最優解決方案涉及計算上昂貴的迭代過程。因此，難以將其應用於實際應用。Image compression reduces the irrelevance and redundancy of images in order to store or transmit images at low bit rates. Image compression is the basic element of image processing that has been used in imaging devices. Traditional image coding standards (for example, JPEG JPEG2000, Better Portable Graphics (BPG)) try to distribute the available bits for each non-zero quantized transform coefficient throughout the image. As the compression ratio increases, bits per pixel (bpp) decrease as a result of using a larger quantization step, which makes the decoded image have block artifacts or noise. To overcome such problems, post-processing deblocking or denoising methods can be used to improve the quality of the decoded image. Typical methods include post-filtration. However, this type of post-processing method is extremely time-consuming because solving the optimal solution involves a computationally expensive iterative process. Therefore, it is difficult to apply it to practical applications.

影像去雜訊自給定雜訊影像

生成乾淨影像

，其遵循影像降級模型

。對於加成性白高斯雜訊（additive white Gaussian noise；AWGN）模型，第

個觀測到的像素為

，其中

為具有零平均值及方差

的獨立及相同地分佈的(i.i.d)高斯雜訊。AWGN已用於模型化訊號獨立熱雜訊及其他系統缺陷。由於低光散粒雜訊引起的降級為訊號相依的且通常已使用柏松（Poisson）雜訊模型化，其中

，使得

為具有平均值

的柏松隨機變數。然而，對於足夠大λ，此雜訊接近用於如

的平均光條件的高斯分佈。因此，由於由成像裝置俘獲引起的雜訊較佳地模型化為具有AWGN的柏松雜訊，稱作柏松高斯雜訊，使得對於一些標量α ＞ 0而言

。Image denoising from a given noise image

Generate clean images

, Which follows the image degradation model

. For the additive white Gaussian noise (AWGN) model, the first

The observed pixels are

,in

Has zero mean and variance

Independent and equally distributed (iid) Gaussian noise. AWGN has been used to model signal independent thermal noise and other system defects. The degradation caused by low-light shot noise is signal-dependent and has usually been modeled using Poisson noise, where

, Making

To have an average value

The Baisong random variable. However, for a sufficiently large λ, this noise is close to

Gaussian distribution of average light conditions. Therefore, the noise caused by the capture by the imaging device is better modeled as the Poisson noise with AWGN, called Poisson Gaussian noise, so that for some scalar α> 0

.

對於影像去雜訊，輸入為雜訊影像，且輸出為乾淨影像。本文中所揭露的額外系統可應用類似於如上文所描述的影像超解析但移除輸入中的上取樣模組的級聯訓練網路架構。級聯訓練網路架構可進一步適用於盲去雜訊，其中雜訊位準為未知的。For image denoising, the input is a noisy image and the output is a clean image. The additional system disclosed herein can apply a cascaded training network architecture similar to the image super-resolution described above but with the upsampling module removed from the input. The cascaded training network architecture can be further applied to blind noise removal, where the noise level is unknown.

本文中所揭露的系統及方法可訓練用於影像去雜訊的深度CNN。系統及方法可接收雜訊影像

作為輸入且預測乾淨影像

作為輸出。給定具有

個樣本的訓練集

，系統學習預測乾淨影像

的模型

。訓練旨在最小化訓練集內的均方誤差（MSE）

。The system and method disclosed in this article can train a deep CNN for image denoising. System and method capable of receiving noise image

As input and predict a clean image

As output. Given with

Training set

, The system learns to predict clean images

Model of

. Training aims to minimize the mean square error (MSE) in the training set

.

對於壓縮影像恢復，輸入為經解碼壓縮影像且輸出為細化影像。下文描述的系統及方法可應用類似於如上文所描述的影像超解析但移除輸入中的上取樣模組的級聯訓練網路架構。For compressed image restoration, the input is a decoded compressed image and the output is a refined image. The system and method described below can apply a cascaded training network architecture similar to the image super-resolution described above but with the up-sampling module removed from the input.

此外，自經解碼影像至未壓縮影像的擷取可視為兩個特徵映射之間的映射。系統可應用神經網路以進行自經解碼影像的擷取。系統可自經解碼影像至未壓縮地面實況訓練深度CNN。CNN接收經解碼影像作為輸入且預測乾淨影像作為輸出。給定具有

個樣本的訓練集，系統學習預測經擷取影像的模型。訓練旨在最小化訓練集內的MSE。In addition, the extraction from the decoded image to the uncompressed image can be regarded as a mapping between two feature maps. The system can apply neural networks to capture from decoded images. The system can train deep CNN from decoded images to uncompressed ground truth. CNN receives decoded images as input and predicts clean images as output. Given with

With a training set of three samples, the system learns a model to predict the captured images. The training aims to minimize the MSE in the training set.

殘差網路（ResNet）已顯示在諸如影像分類或超解析度的電腦視覺應用中的可觀效能。系統及方法可提供去雜訊殘差網路（DN-ResNet）。DN-ResNet包含在訓練期間逐階段逐漸***至網路中的殘餘區塊（ResBlock）。此類訓練策略使得所得DN-ResNet快速收斂且比典型去雜訊網路計算上更有效。在一個實施例中，系統將ResBlock修改為具有可學習加權跳過連接以提供較佳去雜訊效能。DN-ResNet提供針對柏松-高斯損壞影像的盲去雜訊訓練的深度CNN。藉由級聯多個加權ResBlock（例如，5個），DN-ResNet實現針對已知雜訊位準（非盲去雜訊）及未知雜訊位準（盲去雜訊）兩者的在三個去雜訊問題（高斯、柏松及柏松-高斯）上的當前最新技術效能。DN-ResNet的速度比先前去雜訊網路快許多倍。DN-ResNet亦很好地處理與壓縮影像恢復問題相關的問題。因此，DN-ResNet可通用於其他應用。ResNet has shown considerable performance in computer vision applications such as image classification or super-resolution. The system and method can provide a noise-removing residual network (DN-ResNet). DN-ResNet contains residual blocks (ResBlock) that are gradually inserted into the network stage by stage during training. This type of training strategy makes the resulting DN-ResNet converge quickly and is computationally more efficient than typical denoising networks. In one embodiment, the system modifies ResBlock to have learnable weighted skip connections to provide better denoising performance. DN-ResNet provides a deep CNN for blind denoising training of Baisson-Gaussian damaged images. By cascading multiple weighted ResBlocks (for example, 5), DN-ResNet achieves a three-dimensional response to both the known noise level (non-blind denoising) and the unknown noise level (blind denoising). The current state-of-the-art technology performance on noise removal problems (Gauss, Person and Person-Gauss). The speed of DN-ResNet is many times faster than previous denoising networks. DN-ResNet also handles issues related to the recovery of compressed images well. Therefore, DN-ResNet can be universally used in other applications.

圖12是根據一個實施例的習知ResBlock 1200的例示性圖。圖13是根據一個實施例的簡化ResBlock 1300的例示性圖。圖14是根據一個實施例的加權ResBlock 1400的例示性圖。Figure 12 is an illustrative diagram of a conventional ResBlock 1200 according to one embodiment. Figure 13 is an illustrative diagram of a simplified ResBlock 1300 according to one embodiment. FIG. 14 is an illustrative diagram of weighted ResBlock 1400 according to one embodiment.

參考圖12、圖13以及圖14，DN-ResNet可包含基礎元素，諸如簡化ResBlock 1300或加權ResBlock 1400。不同於習知ResBlock 1200，移除批次正規化（BN）層1202及批次正規化層1204以及在加法之後的經整流線性單元（ReLU）層1206，由於移除此類層並不損害基於特徵映射的ResNet的效能。此外，簡化ResBlock 1300可如加權ResBlock 1400中所展示經修改以具有可學習加權跳過連接1402，其中每一ResBlock 1400的跳過連接穿過具有若干可學習權重

的標度層1404，其中

為跳過連接處的特徵映射的數目。Referring to FIG. 12, FIG. 13, and FIG. 14, DN-ResNet may include basic elements, such as simplified ResBlock 1300 or weighted ResBlock 1400. Different from the conventional ResBlock 1200, the batch normalization (BN) layer 1202 and the batch normalization layer 1204 and the rectified linear unit (ReLU) layer 1206 after the addition are removed, because removing such layers does not damage the The performance of feature-mapped ResNet. In addition, the simplified ResBlock 1300 can be modified as shown in the weighted ResBlock 1400 to have a learnable weighted skip connection 1402, where each skip connection of ResBlock 1400 has a number of learnable weights.

The scale layer 1404, where

Is the number of feature maps at the skip connection.

隨著DN-ResNet變得更深，訓練及超參數調諧變得愈來愈困難。系統可藉由級聯簡化ResBlock來訓練深度神經網路，亦稱為級聯訓練ResNet（CT-ResNet）。級聯訓練將整個訓練分離為階段且逐個地進行。系統提供自簡單3層CNN模式開始的CT-ResNet的訓練。第一層可包含64個9×9過濾器，第二層可包含32個5×5過濾器，且最後一層可包含1個5×5過濾器。卷積可具有跨步一，且權重可利用例如δ 0.001自高斯分佈隨機地初始化。As DN-ResNet becomes deeper, training and hyperparameter tuning become more and more difficult. The system can train deep neural networks by cascading simplified ResBlock, also known as cascaded training ResNet (CT-ResNet). The cascade training separates the entire training into stages and performs them one by one. The system provides CT-ResNet training starting from the simple 3-layer CNN mode. The first layer may contain 64 9×9 filters, the second layer may contain 32 5×5 filters, and the last layer may contain 1 5×5 filter. The convolution may have a stride of one, and the weights may be randomly initialized from a Gaussian distribution using, for example, δ 0.001.

圖15是根據一個實施例的級聯訓練系統（CT-ResNet）1500的例示性圖。在訓練3層CNN之後，系統1500逐階段級聯ResBlock。在每一階段中，***一個新ResBlock。在所展示的實例中，訓練自3個層開始，且進行至5個層、7個層等等。ResBlock中的每一卷積層可包含32個3×3過濾器。當變得更深時，此確保較小網路。新層僅在最後一個5×5層之前***。預先存在的層的權重繼承自先前階段，且隨機地初始化新ResBlock的權重。因此，在每一階段處隨機地初始化CT-ResNet的僅幾個權重，如此收斂相對簡單。舉例而言，將固定學習速率0.0001用於不具有任何衰減的所有層為可行的。FIG. 15 is an exemplary diagram of a cascaded training system (CT-ResNet) 1500 according to an embodiment. After training the 3-layer CNN, the system 1500 cascades ResBlock stage by stage. In each stage, a new ResBlock is inserted. In the example shown, training starts with 3 layers and proceeds to 5 layers, 7 layers, and so on. Each convolutional layer in ResBlock can contain 32 3×3 filters. When getting deeper, this ensures a smaller net. The new layer is only inserted before the last 5×5 layer. The weight of the pre-existing layer is inherited from the previous stage, and the weight of the new ResBlock is randomly initialized. Therefore, only a few weights of CT-ResNet are randomly initialized at each stage, so the convergence is relatively simple. For example, it is feasible to use a fixed learning rate of 0.0001 for all layers without any attenuation.

由於新卷積層減小特徵映射的大小，故系統可對每一新3×3層中的2個像素進行補零。因此，級聯訓練中的所有階段具有與輸出的相同大小，使得訓練樣本可共用。Since the new convolutional layer reduces the size of the feature map, the system can pad 2 pixels in each new 3×3 layer. Therefore, all stages in the cascade training have the same size as the output, so that the training samples can be shared.

圖16是根據一個實施例的彩色影像解碼的例示性圖。系統可分別在不同色彩通道上訓練CT-ResNet，所述色彩通道諸如紅色/綠色/藍色（RGB）通道或明度/藍色差/紅色差（YCbCr）通道。2,000個訓練影像用以生成訓練資料。在測試中，在（例如，藉由JPEG 2000或BPG）解碼壓縮影像之後，將經訓練CT-ResNet應用於分離成RGB通道的經解碼影像的每一通道上。進一步融合經擷取影像以導出最終輸出。本發明系統可使用JPEG 2000（CR=159）及BPG（QF=40）來壓縮/解碼影像，且在RGB通道及YCbCr通道上訓練CT-ResNet。Fig. 16 is an exemplary diagram of color image decoding according to an embodiment. The system can train CT-ResNet on different color channels, such as red/green/blue (RGB) channels or lightness/blue difference/red difference (YCbCr) channels. 2,000 training images are used to generate training data. In the test, after the compressed image is decoded (for example, by JPEG 2000 or BPG), the trained CT-ResNet is applied to each channel of the decoded image separated into RGB channels. The captured images are further fused to derive the final output. The system of the present invention can use JPEG 2000 (CR=159) and BPG (QF=40) to compress/decode images, and train CT-ResNet on RGB channels and YCbCr channels.

藉由使用邊緣感知損失函數而非習知均方誤差（MSE）來提供進一步去雜訊效能改良。藉由來深度可分離ResBlock（DS-ResBlock）併入至DN-ResNet中。DN-DS-ResNet可藉由上文所描述的級聯訓練自DN-ResNet精細調諧。DN-ResNet中的ResBlock可由DN-DS-ResBlock逐階段替換。因此，在可接受的準確性損失的情況下，提供進一步複雜度成本降低。By using edge perception loss function instead of the conventional mean square error (MSE) to provide further denoising performance improvement. Incorporate deep separable ResBlock (DS-ResBlock) into DN-ResNet. DN-DS-ResNet can be fine-tuned from DN-ResNet through the cascade training described above. ResBlock in DN-ResNet can be replaced by DN-DS-ResBlock stage by stage. Therefore, in the case of an acceptable loss of accuracy, further complexity cost reduction is provided.

雖然級聯訓練DN-ResNet的網路大小相對小（例如，當級聯高達13個層時，150K參數），但網路大小可藉由使用深度可分離DN-ResNet進一步減小。Although the network size of cascaded training DN-ResNet is relatively small (for example, when cascading up to 13 layers, 150K parameters), the network size can be further reduced by using deep separable DN-ResNet.

圖17是根據一實施例的深度可分離卷積的圖式。參考圖17，標準卷積層1702經分解為深度卷積1704及1 × 1逐點卷積1706。標準卷積層1702具有M個輸入通道及N × K × K個過濾器。在深度卷積層1704中，來自標準卷積層1702的M個輸入通道由具有一個K × K過濾器的M個深度卷積層及具有M個輸入通道的N個1 × 1卷積層替換。Fig. 17 is a diagram of a depth separable convolution according to an embodiment. Referring to FIG. 17, the standard convolution layer 1702 is decomposed into deep convolution 1704 and 1×1 pointwise convolution 1706. The standard convolutional layer 1702 has M input channels and N × K × K filters. In the deep convolution layer 1704, the M input channels from the standard convolution layer 1702 are replaced by M deep convolution layers with a K×K filter and N 1×1 convolution layers with M input channels.

因此，乘法的數目自

減小至

且計算中的減小為

Therefore, the number of multiplications is from

Reduced to

And the reduction in the calculation is

圖18是根據一實施例的ResBlock的圖式。參考圖18，展示DN-ResNet中的類似於圖13的ResBlock 1300的深度可分離ResBlock 1802及ResBlock 1804。ResBlock 1804中的標準卷積層（Conv）由深度可分離卷積層（DW-Conv）替換。FIG. 18 is a diagram of ResBlock according to an embodiment. Referring to FIG. 18, it is shown that ResBlock 1802 and ResBlock 1804 can be separated in depth similar to ResBlock 1300 in FIG. 13 in DN-ResNet. The standard convolutional layer (Conv) in ResBlock 1804 is replaced by a depthwise separable convolutional layer (DW-Conv).

在DN-ResNet中，ResBlock中的卷積層具有32個3×3過濾器，且輸入通道亦為32。在ResBlock 1804中，作為一實例，特徵映射的大小為640×480，乘法的數目因此為

。In DN-ResNet, the convolutional layer in ResBlock has 32 3×3 filters, and the input channel is also 32. In ResBlock 1804, as an example, the size of the feature map is 640×480, and the number of multiplications is therefore

.

在DS-ResBlock 1802中，乘法的數目為

。In DS-ResBlock 1802, the number of multiplications is

.

因此，DS-ResBlock 1802的計算成本相較於ResBlock 1804降低6倍。Therefore, the computational cost of DS-ResBlock 1802 is 6 times lower than that of ResBlock 1804.

有可能將相同級聯訓練程序應用於藉由級聯DS-ResBlock來建構DN-DS-ResNet。然而，由於在級聯訓練中隨機地初始化權重，將存在大量訓練時間。如本文中所揭露，基於現有DN-ResNet來訓練DN-DS-ResNet的另一方式稱為「級聯演進」。It is possible to apply the same cascade training procedure to construct DN-DS-ResNet by cascading DS-ResBlock. However, since the weights are randomly initialized in the cascade training, there will be a lot of training time. As disclosed in this article, another way to train DN-DS-ResNet based on the existing DN-ResNet is called "cascaded evolution".

圖19是根據一實施例的級聯演進的圖式。參考圖19，給定DN-ResNet，以獲得DN-DS-ResNet，所有ResBlock 1902可由DS-ResBlock 1904替換，且整個網路可經精細調諧。此在單個運行中進行，精細調諧將並不很好地收斂。實際上，可逐個地替換ResBlock 1902。在每一精細調諧階段（例如，演進階段1、演進階段2、演進階段3等）中，僅一個ResBlock由DS-ResBlock替換，繼之以精細調諧，如圖19中所展示。Fig. 19 is a diagram of cascading evolution according to an embodiment. Referring to Figure 19, given DN-ResNet to obtain DN-DS-ResNet, all ResBlock 1902 can be replaced by DS-ResBlock 1904, and the entire network can be fine-tuned. This is done in a single run, and the fine tuning will not converge well. In fact, ResBlock 1902 can be replaced one by one. In each fine tuning stage (for example, evolution stage 1, evolution stage 2, evolution stage 3, etc.), only one ResBlock is replaced by DS-ResBlock, followed by fine tuning, as shown in FIG. 19.

類似於級聯訓練，隨機地初始化新DS-ResBlock中的權重，且繼承所有其他層中的權重。替換在網路結束時開始以確保對整個網路的較少影響。在每一演進階段中，由於繼承大多數權重，故收斂將相對簡單。Similar to cascade training, the weights in the new DS-ResBlock are randomly initialized and the weights in all other layers are inherited. The replacement starts at the end of the network to ensure less impact on the entire network. In each evolution stage, since most of the weights are inherited, the convergence will be relatively simple.

去雜訊網路通常旨在最小化訓練集內的均方誤差（MSE）

。本文中提供邊緣感知MSE，其中邊緣中的像素相較於非邊緣像素給出更高權重。邊緣感知損失函數可給定如下：

其中M為邊緣映射，N為像素的總數目，且w為常量。第二項將約束添加至損失函數。影像去雜訊的問題發生，因為邊緣更難以自雜訊影像擷取，尤其當雜訊位準高時。應用以上邊緣感知損失函數後，所述約束使邊緣較不難以擷取。此外，由於高頻率信息（諸如邊緣在人類視覺中更敏感），用損失函數提高邊緣像素的準確性有助於知覺品質。Noise removal networks usually aim to minimize the mean square error (MSE) in the training set

. This article provides edge-aware MSE, where pixels in edges are given higher weights than non-edge pixels. The edge perception loss function can be given as follows:

Where M is the edge mapping, N is the total number of pixels, and w is a constant. The second term adds constraints to the loss function. The problem of image denoising occurs because the edges are more difficult to extract from the noise image, especially when the noise level is high. After applying the above edge-aware loss function, the constraints make the edges less difficult to capture. In addition, because of high-frequency information (such as edges are more sensitive in human vision), using a loss function to improve the accuracy of edge pixels contributes to the quality of perception.

在實驗試驗中，對於影像去雜訊，PASCAL VOC 2010資料集用於生成訓練樣本。1,000測試影像用於評估上文所描述的DN-ResNet的效能，同時其餘影像用於訓練。以不同雜訊位準生成隨機高斯/柏松/柏松-高斯雜訊影像。33 × 33雜訊修補及對應17 × 17乾淨修補經裁剪。考慮不同雜訊方差σ^2，其中σ ϵ {10, 25, 50, 75}。在由柏松或柏松-高斯雜訊損壞之前，輸入影像像素值經縮放以使來自集合峰值ϵ {1, 2, 4, 8}的峰值最大。對於柏松-高斯雜訊，σ ϵ {0.1, 0.2, 0.5, 1, 2, 3, 6, 12}且峰值 = 10σ。In the experimental experiment, for image denoising, the PASCAL VOC 2010 data set is used to generate training samples. 1,000 test images are used to evaluate the performance of the DN-ResNet described above, while the remaining images are used for training. Generate random Gauss/Bassson/Bassson-Gaussian noise images with different noise levels. 33 × 33 noise repair and corresponding 17 × 17 clean repair and cropping. Consider different noise variances σ^2, where σ ϵ {10, 25, 50, 75}. Before being corrupted by Boisson or Boisson-Gaussian noise, the input image pixel values are scaled to maximize the peak from the aggregate peak ϵ {1, 2, 4, 8}. For Person-Gaussian noise, σ ϵ {0.1, 0.2, 0.5, 1, 2, 3, 6, 12} and peak value = 10σ.

對於壓縮影像恢復，獲得經解碼影像。自訓練集進一步提取33 × 33經解碼修補及對應17 × 17經恢復修補。PSNR用以評估效能。在Y通道上訓練網路，但網路亦可直接應用於RGB通道而不損失很多品質。For compressed image restoration, a decoded image is obtained. From the training set, further extract 33 × 33 decoded and repaired and corresponding 17 × 17 restored and repaired. PSNR is used to evaluate performance. Train the network on the Y channel, but the network can also be directly applied to the RGB channel without losing a lot of quality.

使用PASCAL VOC資料集在高斯、柏松及柏松-高斯去雜訊上測試自3個層至13個層的DN-ResNet。藉由級聯已知雜訊位準上的ResBlock（例如，ResBlock 1300）來訓練這些DN-ResNet。MSE損失用於所有模型。PSNR隨著使用更多層而不斷增大。自3個層至13個層，PSNR在所有δ及峰值下增大0.4分貝至0.5分貝。雖然展示的最深網路為13層DN-ResNet，但準確性仍可藉由級聯更多層來進一步改良。級聯訓練與單發訓練相比較，其中13層DN-ResNet自無監督權重初始化訓練。DN-ResNet-13的單發訓練引起比用於所有測試的級聯訓練低0.3分貝的PSNR。由於級聯訓練可視為「部分監督初始化」，故其收斂將相較於基於無監督權重初始化的單發訓練更容易。在下表2中，粗體項表示最佳結果。表 2 DN-ResNet δ/峰值 3層 5層 7層 9層 11層 13層 13層os 參數 57,184 75,616 94,048 112,480 130,912 149,344 149,344 10 34.43 34.56 34.71 34.80 34.93 34.99 34.70 高斯 25 29.86 30.03 30.10 30.30 30.44 30.52 30.27 50 26.86 27.05 27.22 27.29 27.38 27.50 27.14 75 25.24 25.43 25.55 25.63 25.81 25.89 25.61 1 22.51 22.66 22.74 22.88 22.95 23.06 22.80 柏松 2 23.66 23.74 23.92 24.05 24.14 24.23 23.96 4 24.67 24.80 24.91 25.14 25.27 25.39 25.01 8 26.01 26.24 26.35 26.55 26.64 26.77 26.49 0.1/1 22.11 22.27 22.36 22.50 22.65 22.73 22.30 0.2/2 22.99 23.14 23.22 23.40 23.59 23.75 23.44 0.5/5 24.54 24.61 24.77 24.90 25.00 25.10 24.78 柏松-高斯 1/10 25.61 25.69 25.77 25.91 25.99 26.14 25.67 2/20 26.59 26.70 26.89 26.99 27.14 27.29 26.88 3/30 27.10 27.22 27.37 27.50 27.61 27.77 27.41 6/60 27.87 27.98 28.16 28.32 28.48 28.59 28.11 12/120 28.19 28.30 28.44 28.58 28.72 28.88 28.50 Use PASCAL VOC data set to test DN-ResNet from 3 to 13 layers on Gaussian, Baisong and Baisong-Gaussian denoising. Train these DN-ResNets by cascading ResBlocks (for example, ResBlock 1300) on known noise levels. The MSE loss is used for all models. The PSNR continues to increase as more layers are used. From 3 layers to 13 layers, the PSNR increases by 0.4 decibels to 0.5 decibels at all δ and peak values. Although the deepest network shown is a 13-layer DN-ResNet, the accuracy can be further improved by cascading more layers. The cascade training is compared with the single-shot training, in which the 13-layer DN-ResNet is initialized with unsupervised weights. The single shot training of DN-ResNet-13 caused a PSNR that was 0.3 dB lower than the cascade training used for all tests. Since cascaded training can be regarded as "partially supervised initialization", its convergence will be easier than single-shot training based on unsupervised weight initialization. In Table 2 below, bold items indicate the best results. Table 2 DN-ResNet δ/peak 3 layers 5th floor 7 floors 9 floors 11 floors 13 floors 13 layer os parameter 57,184 75,616 94,048 112,480 130,912 149,344 149,344 10 34.43 34.56 34.71 34.80 34.93 34.99 34.70 Gauss 25 29.86 30.03 30.10 30.30 30.44 30.52 30.27 50 26.86 27.05 27.22 27.29 27.38 27.50 27.14 75 25.24 25.43 25.55 25.63 25.81 25.89 25.61 1 22.51 22.66 22.74 22.88 22.95 23.06 22.80 Bai Song 2 23.66 23.74 23.92 24.05 24.14 24.23 23.96 4 24.67 24.80 24.91 25.14 25.27 25.39 25.01 8 26.01 26.24 26.35 26.55 26.64 26.77 26.49 0.1/1 22.11 22.27 22.36 22.50 22.65 22.73 22.30 0.2/2 22.99 23.14 23.22 23.40 23.59 23.75 23.44 0.5/5 24.54 24.61 24.77 24.90 25.00 25.10 24.78 Person-Gauss 1/10 25.61 25.69 25.77 25.91 25.99 26.14 25.67 2/20 26.59 26.70 26.89 26.99 27.14 27.29 26.88 3/30 27.10 27.22 27.37 27.50 27.61 27.77 27.41 6/60 27.87 27.98 28.16 28.32 28.48 28.59 28.11 12/120 28.19 28.30 28.44 28.58 28.72 28.88 28.50

DN-ResNet亦藉由上文所描述的不同邊緣感知損失函數訓練，以及利用DN-ResNet進行盲去雜訊。與其中針對每一雜訊位準分別訓練多個網路的非盲去雜訊相反，藉由混合具有不同雜訊位準的所有高斯/柏松/柏松-高斯雜訊來訓練僅一個DN-ResNet用於盲去雜訊。在表3中，與非盲去雜訊相比，利用DN-ResNet進行盲去雜訊將不會降低很多。此折衷為有價值的，此是由於盲去雜訊不需要耗時的雜訊位準估計。此外，利用邊緣感知損失函數可將PSNR提高0.1分貝至0.15分貝，且亦增強感知品質。最佳群組為直接使用自Sobel運算子生成的梯度大小。DN-ResNet is also trained by the different edge perception loss functions described above, and DN-ResNet is used for blind denoising. In contrast to the non-blind denoising in which multiple networks are trained separately for each noise level, only one DN is trained by mixing all Gauss/Person/Person-Gaussian noises with different noise levels -ResNet is used for blind noise removal. In Table 3, compared with non-blind denoising, using DN-ResNet for blind denoising will not reduce much. This compromise is valuable because blind noise removal does not require time-consuming noise level estimation. In addition, using the edge perception loss function can increase the PSNR by 0.1 decibels to 0.15 decibels, and also enhance the perceived quality. The best group is to directly use the gradient size generated from the Sobel operator.

Sobel運算子用於影像處理及電腦視覺，尤其是邊緣偵測演算法中，其中所述Sobel運算子創建強調邊緣的影像。運算子使用與初始影像卷積的兩個3×3核心來計算導數的近似值，一個核心用於水平變化，且一個核心用於豎直變化。若吾人將A定義為源影像，且Gx及Gy為在每一點處分別含有水平及豎直導數近似值的兩個影像，則計算如下：

The Sobel operator is used in image processing and computer vision, especially in edge detection algorithms, where the Sobel operator creates images that emphasize edges. The operator uses two 3×3 cores convolved with the initial image to calculate the approximate value of the derivative, one core is used for horizontal changes, and one core is used for vertical changes. If we define A as the source image, and Gx and Gy are two images with approximate values of the horizontal and vertical derivatives at each point, the calculation is as follows:

可藉由以下獲得最終梯度圖G

表 3

DN-ResNet δ/峰值非盲盲盲+'e-a' 盲+'e-b' 參數 _- 149,344 149,344 149,344 149,344 10 34.99 34.88 35.07 35.05 高斯 25 30.52 30.44 30.59 30.59 50 27.50 27.44 27.58 27.52 75 25.89 25.80 25.94 25.87 1 23.06 22.99 23.14 23.07 柏松 2 24.23 24.17 24.31 24.25 4 25.39 25.33 25.50 25.41 8 26.77 26.72 26.88 26.81 0.1/1 22.73 22.61 22.74 22.69 0.2/2 23.75 23.69 23.78 23.76 0.5/5 25.10 24.98 25.12 25.08 柏松-高斯 1/10 26.14 26.07 26.19 26.11 2/20 27.29 27.18 27.30 27.26 3/30 27.77 27.64 27.78 27.70 6/60 28.59 28.51 28.64 28.55 12/120 28.88 28.80 28.93 28.88

The final gradient graph G can be obtained by

Table 3

DN-ResNet δ/peak Non-blind blind Blind +'e-a' Blind +'e-b' parameter _- 149,344 149,344 149,344 149,344 10 34.99 34.88 35.07 35.05 Gauss 25 30.52 30.44 30.59 30.59 50 27.50 27.44 27.58 27.52 75 25.89 25.80 25.94 25.87 1 23.06 22.99 23.14 23.07 Bai Song 2 24.23 24.17 24.31 24.25 4 25.39 25.33 25.50 25.41 8 26.77 26.72 26.88 26.81 0.1/1 22.73 22.61 22.74 22.69 0.2/2 23.75 23.69 23.78 23.76 0.5/5 25.10 24.98 25.12 25.08 Person-Gauss 1/10 26.14 26.07 26.19 26.11 2/20 27.29 27.18 27.30 27.26 3/30 27.77 27.64 27.78 27.70 6/60 28.59 28.51 28.64 28.55 12/120 28.88 28.80 28.93 28.88

DN-ResNet亦藉由用於盲去雜訊網路的不同類型的ResBlock建構。在表4中，與DN-ResBlock相比，藉由DS-ResBlock的DN-DS-ResNet建構僅降低小於0.1分貝的PSNR，但計算成本（例如，乘法及累積（multiplication and accumulation；MAC）的數目）及網路大小顯著降低。此指示改良上文所描述的DS-ResBlock的網路效率的有效性。另外，若DN-DS-ResNet藉由單發微調DN-ResNet建構，則準確度將顯著降低。此指示上文所描述的級聯演進的有效性。將DS-ResBlock與邊緣感知損失函數一起使用，可達成高準確度及較少計算成本。表 4 DN-ResNet DN DN-DS DN-DS-os DN+'e-a' DN-DS+'e-a' 參數MAC （十億） 149,344 45.878 63,728 19.582 63,728 19.582 149,344 45.878 63,728 19.582 DN-ResNet is also constructed by different types of ResBlock for blind noise removal networks. In Table 4, compared with DN-ResBlock, the construction of DN-DS-ResNet by DS-ResBlock only reduces the PSNR of less than 0.1 dB, but the number of calculation costs (for example, multiplication and accumulation (MAC)) ) And the network size is significantly reduced. This indicates the effectiveness of improving the network efficiency of the DS-ResBlock described above. In addition, if DN-DS-ResNet is constructed by single-shot fine-tuning DN-ResNet, the accuracy will be significantly reduced. This indicates the effectiveness of the cascading evolution described above. Using DS-ResBlock together with the edge-sensing loss function can achieve high accuracy and less computational cost. Table 4 DN-ResNet DN DN-DS DN-DS-os DN+'e-a' DN-DS+'e-a' Parameter MAC (billions) 149,344 45.878 63,728 19.582 63,728 19.582 149,344 45.878 63,728 19.582

所揭露的DN-ResNet及DN-DS-ResNet在高斯、柏松/柏松-高斯去雜訊上達成當前最新技術效能，與現有深度CNN相比具有更佳效率及更小模型大小。所揭露的網路對於已知雜訊位準及未知雜訊位準兩者為有效的。The disclosed DN-ResNet and DN-DS-ResNet achieve the current state-of-the-art technology performance on Gaussian, Person/Person-Gaussian denoising, and have better efficiency and smaller model size than existing deep CNNs. The disclosed network is effective for both the known noise level and the unknown noise level.

除影像去雜訊以外，所揭露的DN-ResNet亦可應用於壓縮影像恢復上。對於包含JPEG、JPEG 2000以及BPG的所有壓縮方法，DN-ResNet能夠改良解碼影像的品質。對於JPEG、JPEG 2000以及BPG，可分別觀測到1分貝至2分貝、0.5分貝至1.5分貝以及0.3分貝至0.5分貝的增益。In addition to image noise removal, the disclosed DN-ResNet can also be applied to compressed image recovery. For all compression methods including JPEG, JPEG 2000 and BPG, DN-ResNet can improve the quality of decoded images. For JPEG, JPEG 2000, and BPG, gains of 1 decibel to 2 decibels, 0.5 decibels to 1.5 decibels, and 0.3 decibels to 0.5 decibels can be observed respectively.

用於影像去雜訊的DN-ResNet達成高準確度及高效率兩者。級聯訓練在訓練高效深度ResNet中是高效且有效的。可藉由在跳過連接處添加可學習的權重來進一步增強去雜訊準確度。DN-ResNet for image denoising achieves both high accuracy and high efficiency. Cascade training is efficient and effective in training efficient deep ResNet. The accuracy of noise removal can be further enhanced by adding learnable weights at the skip connections.

影像SR藉由嘗試恢復丟失的資訊自給定的LR影像生成HR影像。近來，已部署深度CNN以解決影像超解析度問題，此是由於所述深度CNN展現顯著的準確度改良。The image SR generates an HR image from a given LR image by trying to recover the lost information. Recently, deep CNNs have been deployed to solve the problem of image super-resolution, because the deep CNNs exhibit significant accuracy improvements.

由於缺少真實世界LR-HR修補，故對影像進行雙三次下取樣以創建LR-HR訓練對。此產生乾淨且無雜訊的LR影像。令人遺憾的是，在影像直接來自攝影機的真實世界情形中，將始終存在額外雜訊或未知降級。因此，僅訓練為使用雙三次下取樣來重建構人工下取樣的影像的當前最新技術CNN方法在應用於真實世界影像時可能導致嚴重的偽影。本文中揭露一種訓練真實世界SR系統的方法，其給出具有良好感知品質的SR輸出。Due to the lack of real-world LR-HR patching, the images were down-sampled twice and three times to create an LR-HR training pair. This produces a clean and noise-free LR image. Unfortunately, in real-world situations where the image comes directly from the camera, there will always be additional noise or unknown degradation. Therefore, the current state-of-the-art CNN method that is only trained to reconstruct artificially down-sampled images using bi-cubic down-sampling may cause serious artifacts when applied to real-world images. This article discloses a method for training a real-world SR system, which gives an SR output with good perceptual quality.

圖20說明根據一實施例的用於真實世界超解析度的方法的流程圖2000。在2002處，生成用於真實世界SR的資料集。資料集可藉由由通用降級模型對低品質影像進行下取樣以用作LR影像且直接使用對應的高品質影像以用作HR影像來生成，或藉由由通用SR網路直接使用低品質影像作為LR影像且自LR影像超解析高品質影像以用作HR影像來生成。FIG. 20 illustrates a flowchart 2000 of a method for real-world super-resolution according to an embodiment. At 2002, a data set for real-world SR is generated. The data set can be generated by down-sampling low-quality images by the general degradation model to be used as LR images and directly using the corresponding high-quality images as HR images, or by using the low-quality images directly by the general SR network As LR images and super-resolution high-quality images from LR images to be used as HR images to generate.

為了訓練通用SR網路，可基於來自影像處理偽影的多個降級來生成SR資料集。更具體而言，根據公式化為以下的降級模型自HR影像y 生成LR影像x ：x =D (y *k ) +n 其中D 為下取樣操作，k 為模糊核心且n 為雜訊。雜訊並非必需為相加性的。To train a general SR network, an SR data set can be generated based on multiple degradations from image processing artifacts. More specifically, the LR image x is generated from the HR image y according to the degraded model formulated as follows: x = D ( y * k ) + n, where D is the down-sampling operation, k is the blur core and n is the noise. The noise does not have to be additive.

考慮多種下取樣方法，諸如最近鄰、雙線性、雙三次以及lanczos。當生成LR修補時，隨機選擇下取樣方法。Consider a variety of downsampling methods, such as nearest neighbor, bilinear, bicubic, and lanczos. When generating LR repairs, the down-sampling method is randomly selected.

SR的模糊核心設定通常為簡單的。可利用藉由標準偏差來參數化的最常用的等向高斯藍色核心。高斯核心的標準偏差可在[0.2, 3]範圍內隨機取樣，且核心大小可固定為15×15。The fuzzy core setting of SR is usually simple. The most commonly used isotropic Gaussian blue core parameterized by standard deviation can be used. The standard deviation of the Gaussian core can be randomly sampled in the range of [0.2, 3], and the core size can be fixed at 15×15.

大多數真實世界LR影像由於一些影像處理偽影而有雜訊。一些真實世界雜訊包含高斯、柏松或柏松-高斯分量。因此，當生成LR影像時，可隨機選擇高斯、柏松以及柏松-高斯雜訊。參數基於偽影，其中高斯雜訊的δ在[0, 25]範圍內，柏松雜訊的峰值自[50, 150]均一地取樣。當生成柏松-高斯雜訊時，可使用類似的柏松峰值範圍，但高斯δ可減小至[0, 5]的範圍。Most real-world LR images are noisy due to some image processing artifacts. Some real-world noise contains Gaussian, Baison or Baison-Gaussian components. Therefore, when generating LR images, Gaussian, Person and Person-Gaussian noise can be randomly selected. The parameters are based on artifacts, where the δ of the Gaussian noise is in the range of [0, 25], and the peak of the Poisson noise is uniformly sampled from [50, 150]. When generating Poisson-Gaussian noise, a similar Poisson peak range can be used, but the Gaussian δ can be reduced to the range of [0, 5].

在真實世界SR中，與目標成像裝置的域匹配的LR-HR影像的高品質資料集對於SR網路的效能可能是至關重要的。本發明方法及系統提供用於創建高品質行動SR資料集的有效方法。更具體而言，本發明方法及系統在無準確的HR-LR降級模型的情況下，使用相同尺度的暫存影像對來創建新的行動SR資料集。給定在真實世界中相對容易獲得的暫存的行動數位單透鏡反射相機（digital single lens reflex；DSLR）影像集合，本發明方法及系統超解析DSLR影像以創建HR影像。資料集中的LR影像為對應行動修補。In real-world SR, a high-quality data set of LR-HR images that matches the domain of the target imaging device may be critical to the performance of the SR network. The method and system of the present invention provide an effective method for creating high-quality mobile SR data sets. More specifically, in the absence of an accurate HR-LR degradation model, the method and system of the present invention use temporary image pairs of the same scale to create a new mobile SR data set. Given a set of temporarily stored mobile digital single lens reflex (DSLR) images that are relatively easy to obtain in the real world, the method and system of the present invention super-resolve the DSLR images to create HR images. The LR image in the data set is the corresponding action repair.

存在超解析DLSR影像的多種方式。應用簡單雙三次上取樣以超解析DSLR修補可生成用於訓練行動SR網路的高品質HR修補。最佳效能藉由使用在通用SR資料集上訓練的通用模型來達成，其中LR-HR對藉由隨機生成下取樣/模糊/雜訊模擬多個降級來生成。There are many ways to super-resolve DLSR images. Applying simple double-triple upsampling to super-resolution DSLR patching can generate high-quality HR patching for training mobile SR networks. The best performance is achieved by using a general model trained on a general SR data set, where the LR-HR pair is generated by randomly generating downsampling/blur/noise simulating multiple degradations.

在2004處，訓練GAN以生成超解析影像，其中藉由在增強的超解析度GaN（super-resolution GAN；ESRGAN）中使用不同的鑑別器及超參數來訓練兩個GAN。此提供具有互補特性的兩個SR網路。SR網路可使用GAN（例如，ESRGAN框架）來訓練，同時使用RCAN而非ResNet作為生成器。在測試期間，生成器（RCAN）可直接用於自給定的LR影像估計HR影像。為了進一步改良感知品質，可藉由在ESRGAN中使用不同鑑別器及超參數來訓練兩個RCAN。此提供具有互補特性的兩個SR網路。最終SR預測為這兩個RCAN的逐像素集合。In 2004, GAN was trained to generate super-resolution images, in which two GANs were trained by using different discriminators and hyperparameters in enhanced super-resolution GAN (ESRGAN). This provides two SR networks with complementary characteristics. The SR network can be trained using GAN (for example, the ESRGAN framework), while using RCAN instead of ResNet as the generator. During the test, the generator (RCAN) can be directly used to estimate HR images from a given LR image. In order to further improve the perceptual quality, two RCANs can be trained by using different discriminators and hyperparameters in ESRGAN. This provides two SR networks with complementary characteristics. The final SR prediction is the pixel-by-pixel set of these two RCANs.

圖21說明根據一實施例的SR方法的圖式。將SR資料集2102輸入至兩個GAN中以用於訓練。第一GAN 2104在RCAN生成器2108中處理LR影像2106，此產生偽估計的HR影像2110。真實HR影像2112及估計的HR影像2110由標準鑑別器2114處理，且GAN 2104產生真實/偽判定2116。第二GAN 2120在RCAN生成器2124中處理LR影像2122，此產生偽估計的HR影像2126。真實HR影像2128及估計的HR影像2126由相對鑑別器2130處理，且GAN 2120產生真實/偽判定2132。Figure 21 illustrates a diagram of an SR method according to an embodiment. Input the SR data set 2102 into two GANs for training. The first GAN 2104 processes the LR image 2106 in the RCAN generator 2108, which generates a pseudo-estimated HR image 2110. The real HR image 2112 and the estimated HR image 2110 are processed by the standard discriminator 2114, and the GAN 2104 generates a true/false decision 2116. The second GAN 2120 processes the LR image 2122 in the RCAN generator 2124, which generates a pseudo-estimated HR image 2126. The real HR image 2128 and the estimated HR image 2126 are processed by the relative discriminator 2130, and the GAN 2120 generates a true/false decision 2132.

圖22說明根據一實施例的RCAN 2200的圖式。RCAN可基於殘差中殘差（residual in residual；RIR）結構，其包含具有長跳過連接的若干殘差群組。每一殘差群組包含具有短跳過連接的一些殘差塊（ResBlock）。在每一ResBlock中，通道注意機制可用以藉由考慮通道當中的互依性而自適應地重新按比例調整逐通道特徵。當訓練GAN網路時，RCAN可用作生成器。Figure 22 illustrates a diagram of the RCAN 2200 according to an embodiment. RCAN may be based on a residual in residual (RIR) structure, which includes several residual groups with long skip connections. Each residual group contains some residual blocks (ResBlock) with short skip connections. In each ResBlock, the channel attention mechanism can be used to adaptively rescale the channel-by-channel characteristics by considering the interdependence among the channels. When training GAN networks, RCAN can be used as a generator.

生成對抗網路（例如，超解析度GaN（SRGAN））利用GaN網路的強度來對天然影像的空間進行建模，且使用感知及對抗損失來引導SR網路偏好駐存於天然影像的流形上的輸出影像。此後，提供與SRGAN中的感知驅動的基於GaN的方法相關的若干修改。在一個實施例中，由於使用相對鑑別器而利用ESRGAN框架，所述相對鑑別器能夠創建更銳利的邊緣及更逼真的紋理細節。Generating confrontation networks (for example, Super Resolution GaN (SRGAN)) uses the strength of GaN networks to model the space of natural images, and uses perception and confrontation losses to guide SR networks to prefer flows that reside in natural images Physical output image. Thereafter, several modifications related to the perceptually driven GaN-based approach in SRGAN are provided. In one embodiment, the ESRGAN framework is utilized due to the use of a relative discriminator, which can create sharper edges and more realistic texture details.

本發明方法及系統可基於產生的真實世界SR資料集來訓練兩個GAN。可藉由RCAN實施所述兩個GAN的生成器兩者。利用兩個鑑別器：（1）相對鑑別器，其預測真實影像（HR影像）是否比偽影像（HR輸出）更逼真；及（2）標準鑑別器，其簡單地預測真實影像是真實的，且偽影像是偽的。The method and system of the present invention can train two GANs based on the generated real-world SR data set. Both the generators of the two GANs can be implemented by RCAN. Two discriminators are used: (1) a relative discriminator, which predicts whether the real image (HR image) is more realistic than the fake image (HR output); and (2) a standard discriminator, which simply predicts that the real image is real, And the fake image is fake.

在訓練期間，相對生成器的損失函數

包含

影像損失、感知損失

以及對抗損失

，如下：

During training, the relative generator's loss function

Include

Image loss, perceptual loss

And against loss

,as follows:

計算來自RCAN生成器網路

的超解析影像

與地面實況HR影像

之間的

距離。

表示對小批量中的所有影像取平均值的操作。感知損失

利用預訓練的19層VGG網路計算

及

的特徵圖距離。對抗損失

基於相對GAN鑑別器且定義如下：

其中

表示基於相對GAN的鑑別器網路，

為非變換鑑別器輸出，且

為S型函數。

預測真實影像是否比偽影像更逼真，而非決定輸入影像是絕對真實的抑或偽的。若來自生成器網路G 的輸出影像表示為

且對應真實影像表示為

，則對應鑑別器損失函數可定義如下：

The calculation comes from the RCAN generator network

Super resolution image

Live HR image with the ground

between

distance.

Represents the operation of averaging all images in a small batch. Perceptual loss

Using pre-trained 19-layer VGG network calculation

and

The feature map distance. Fight against loss

Based on the relative GAN discriminator and defined as follows:

in

Represents a discriminator network based on relative GAN,

Is the output of the non-transformed discriminator, and

It is a sigmoid function.

Predict whether the real image is more realistic than the fake image, rather than deciding whether the input image is absolutely real or fake. If the output image from the generator network G is expressed as

And the corresponding real image is expressed as

, The corresponding discriminator loss function can be defined as follows:

超參數

及

判定不同損失分量在最終損失函數中的貢獻。可增加參數

以減少估計的定量誤差，而增加對抗損失權重可產生結果的感知品質的改良。另一GAN可利用RCAN生成器但基於標準GAN利用不同的生成器損失

來訓練，且可展示如下：

其中

為基於標準GAN的對抗損失，其對應鑑別器可描述為

。Hyperparameter

and

Determine the contribution of different loss components in the final loss function. Parameters can be added

In order to reduce the quantitative error of the estimation, and increase the weight of the counter loss can produce the improvement of the perceptual quality of the result. Another GAN can use the RCAN generator but based on the standard GAN uses a different generator loss

To train, and can be shown as follows:

in

For the standard GAN-based adversarial loss, its corresponding discriminator can be described as

.

在2006處，融合來自訓練GAN的輸出。可根據照度臨限值來融合輸出。若第二GAN的輸出（使用相對鑑別器的GAN）的照度位準低於照度臨限值，則將來自GAN的兩個輸出的逐像素集合用作最終輸出。若第二GAN的輸出的照度位準高於照度臨限值，則將第二GAN的輸出用作最終輸出。At 2006, the output from the training GAN was fused. The output can be fused according to the illuminance threshold. If the illuminance level of the output of the second GAN (GAN using the relative discriminator) is lower than the illuminance threshold, the pixel-by-pixel set of the two outputs from the GAN is used as the final output. If the illuminance level of the output of the second GAN is higher than the illuminance threshold, the output of the second GAN is used as the final output.

最終SR預測可為這兩個RCAN的逐像素集合。由具有相對生成器的GAN生成的SR輸出在高頻區展示良好的感知品質。相比之下，由具有標準生成器的GAN生成的SR輸出在一些低照度影像的平滑區中生成較少偽影。融合使用以上兩個GAN的SR估計值。可基於影像中的所有像素的中值亮度來利用選擇性平均技術，以便改良低照度影像上的視覺品質。來自使用相對GAN損失訓練的GAN的HR輸出為

，且來自使用標準GAN損失函數訓練的RCAN模型的HR輸出為

。融合的輸出影像

導出為：

其中

為YCbCr顏色空間表示的Y（亮度）分量中所有像素的像素強度值的中值，

為[0, 1]之間的常量。融合具有不同互補效應的基於不同對抗損失來訓練的兩個GAN模型，使得用於融合的兩個影像的感知品質接近。此確保可在低照度影像的一些區中減少偽影，同時不犧牲總體感知品質。The final SR prediction can be a pixel-by-pixel set of these two RCANs. The SR output generated by the GAN with the relative generator shows good perceptual quality in the high frequency region. In contrast, the SR output generated by a GAN with a standard generator generates less artifacts in the smooth areas of some low-illuminance images. The SR estimates of the above two GANs are used for fusion. The selective averaging technique can be used based on the median brightness of all pixels in the image to improve the visual quality of the low-illuminance image. The HR output from GAN trained with relative GAN loss is

, And the HR output from the RCAN model trained with the standard GAN loss function is

. Fused output image

Export as:

in

Is the median value of the pixel intensity values of all pixels in the Y (brightness) component represented by the YCbCr color space,

It is a constant between [0, 1]. The fusion of two GAN models trained based on different confrontation losses with different complementary effects makes the perceived quality of the two images used for fusion close. This ensures that artifacts can be reduced in some areas of the low-light image without sacrificing the overall perceptual quality.

圖23說明根據一實施例的融合來自訓練GAN的結果的圖式。影像2302輸入至產生具有較少偽影的平滑輸出的第一GAN 2304中（例如，使用標準抽選器），且輸入至產生具有偽影的銳利輸出的第二GAN 2306中（例如，使用相對抽選器）。兩個輸出融合至逐像素集合2308中以產生輸出2310。Figure 23 illustrates a diagram of fusing results from training a GAN according to an embodiment. The image 2302 is input to the first GAN 2304 that produces a smooth output with less artifacts (for example, using a standard decimator), and input to the second GAN 2306 that produces a sharp output with artifacts (for example, using relative decimation)器). The two outputs are merged into a pixel-by-pixel set 2308 to produce an output 2310.

上文關於本揭露內容的實施例所描述的步驟及/或操作可取決於具體實施例及/或實施方案按不同次序或並行或針對不同時期同時發生，如將由所屬技術領域中具有通常知識者理解。不同實施例可按不同次序或藉由不同方式或手段進行動作。如將由所屬技術領域中具有通常知識者理解，一些圖式為所進行的動作的簡化表示，其在本文中的描述簡化概覽，且真實世界實施方案將複雜得多、需要更多階段及/或組件且亦將取決於特定實施方案的要求而變化。為了簡化表示，這些圖式並未展示其他所需步驟，此是由於這些步驟可能是於所屬技術領域中具有通常知識者已知及理解的，且可不與本發明描述切合及/或並非對本發明描述有幫助。The steps and/or operations described above in the embodiments of the present disclosure may occur in a different order or in parallel or simultaneously for different periods depending on the specific embodiment and/or implementation, such as those with ordinary knowledge in the technical field. understand. Different embodiments may perform actions in different orders or by different methods or means. As will be understood by those with ordinary knowledge in the technical field, some of the diagrams are simplified representations of the actions performed, and their descriptions in this article simplify the overview, and real-world implementations will be much more complicated, require more stages, and/or The components will also vary depending on the requirements of a particular implementation. In order to simplify the representation, these drawings do not show other required steps. This is because these steps may be known and understood by those with ordinary knowledge in the art, and may not be consistent with the description of the present invention and/or are not relevant to the present invention. The description is helpful.

類似地，一些圖式為僅展示切合組件的簡化方塊圖，且這些組件中的一些僅表示領域中熟知的功能及/或操作，而非一件實際硬體，如將由所屬技術領域中具有通常知識者理解。在這些情況下，組件/模組中的一些或全部可以多種方式及/或方式的組合進行實施或設置，諸如至少部分地以韌體及/或硬體，包含但不限於一或多個特殊應用積體電路（「application-specific integrated circuit；ASIC」）、標準積體電路、執行適當指令的控制器，且包含微控制器及/或嵌入式控制器、場可程式閘陣列（「field-programmable gate array；FPGA」）、複雜可程式邏輯裝置（「complex programmable logic device；CPLD」）以及類似者。系統組件及/或資料結構中的一些或全部亦可作為內容（例如，作為可執行或其他機器可讀軟體指令或結構化資料）儲存於非暫時性電腦可讀媒體（例如，作為硬碟、記憶體、電腦網路或蜂巢式無線網路或其他資料傳輸媒體；或待藉由適當驅動器或經由適當連接件（諸如DVD或快閃記憶體裝置）讀取的攜帶型媒體物品）上，以便使得電腦可讀媒體及/或一或多個相關聯計算系統或裝置能夠或對其進行組態以執行或以其他方式使用或提供內容來進行所描述技術中的至少一些。Similarly, some drawings are simplified block diagrams showing only suitable components, and some of these components only represent well-known functions and/or operations in the field, rather than a piece of actual hardware. The knowledgeable person understands. In these cases, some or all of the components/modules can be implemented or configured in a variety of ways and/or combinations of ways, such as at least partially in firmware and/or hardware, including but not limited to one or more special Application integrated circuits ("application-specific integrated circuit; ASIC"), standard integrated circuits, controllers that execute appropriate instructions, and include microcontrollers and/or embedded controllers, field programmable gate arrays ("field- programmable gate array; FPGA"), complex programmable logic device ("complex programmable logic device; CPLD") and the like. Some or all of the system components and/or data structure can also be stored as content (for example, as executable or other machine-readable software instructions or structured data) on a non-transitory computer-readable medium (for example, as a hard disk, Memory, computer network or cellular wireless network or other data transmission media; or portable media items to be read by appropriate drives or via appropriate connections (such as DVD or flash memory devices) in order to To enable or configure a computer-readable medium and/or one or more associated computing systems or devices to execute or otherwise use or provide content to perform at least some of the described techniques.

一或多個處理器、簡單微控制器、控制器以及類似者（不論單獨還是在多處理配置中）可用以執行儲存於非暫時性電腦可讀媒體上的指令序列以實施本揭露內容的實施例。在一些實施例中，可代替或結合軟體指令而使用硬連線電路系統。因此，本揭露內容的實施例不限於硬體電路系統、韌體及/或軟體的任何特定組合。One or more processors, simple microcontrollers, controllers, and the like (either alone or in a multi-processing configuration) can be used to execute the sequence of instructions stored on a non-transitory computer-readable medium to implement the present disclosure example. In some embodiments, hard-wired circuitry can be used instead of or in combination with software commands. Therefore, the embodiments of the present disclosure are not limited to any specific combination of hardware circuit systems, firmware, and/or software.

如本文中所使用的術語「電腦可讀媒體」指代儲存可提供至處理器以供執行的指令的任何媒體。此媒體可呈許多形式，包含但不限於非揮發性媒體及揮發性媒體。非暫時性電腦可讀媒體的常見形式包含例如軟碟、可撓性磁碟、硬碟、磁帶或任何其他磁性媒體、CD-ROM、任何其他光學媒體、打孔卡、紙帶、具有孔洞圖案的任何其他實體媒體、RAM、PROM以及EPROM、FLASH-EPROM、任何其他記憶體晶片或匣，或其上儲存有可由處理器執行的指令的任何其他媒體。The term "computer-readable medium" as used herein refers to any medium that stores instructions that can be provided to a processor for execution. This media can take many forms, including but not limited to non-volatile media and volatile media. Common forms of non-transitory computer readable media include, for example, floppy disks, flexible disks, hard disks, tapes or any other magnetic media, CD-ROMs, any other optical media, punch cards, paper tapes, patterns with holes Any other physical media, RAM, PROM and EPROM, FLASH-EPROM, any other memory chip or cartridge, or any other media on which instructions executable by the processor are stored.

可至少部分地在攜帶型裝置上實施本揭露內容的一些實施例。如本文中所使用的「攜帶型裝置」及/或「行動裝置」指代有能力接收無線信號的任何攜帶型或可移動電子裝置，包含但不限於多媒體播放機、通信裝置、計算裝置、導航裝置等。因此，行動裝置包含（但不限於）使用者設備（user equipment；UE）、膝上型電腦、平板電腦、攜帶型數位助理（portable digital assistant；PDA）、mp3播放器、手持型PC、即時通信報裝置（instant messaging device；IMD）、蜂巢式電話、全球導航衛星系統（global navigational satellite system；GNSS）接收器、手錶，或可穿戴及/或攜帶於一個人上的任何此裝置。Some embodiments of the disclosure may be implemented at least partially on a portable device. As used herein, "portable device" and/or "mobile device" refers to any portable or movable electronic device capable of receiving wireless signals, including but not limited to multimedia players, communication devices, computing devices, navigation Devices, etc. Therefore, mobile devices include (but are not limited to) user equipment (UE), laptop computers, tablet computers, portable digital assistants (PDA), mp3 players, handheld PCs, instant messaging An instant messaging device (IMD), a cellular phone, a global navigational satellite system (GNSS) receiver, a watch, or any such device that can be worn and/or carried on a person.

鑒於本揭露內容，本揭露內容的各種實施例可實施於積體電路（integrated circuit；IC）中，所述積體電路亦稱為微晶片、矽晶片、電腦晶片或僅稱為「晶片」，如將由所屬技術領域中具有通常知識者理解。此IC可為例如寬頻及/或基頻數據機晶片。In view of the present disclosure, various embodiments of the present disclosure may be implemented in an integrated circuit (IC), which is also called a microchip, a silicon chip, a computer chip or just a "chip", For example, it will be understood by those with ordinary knowledge in the relevant technical field. This IC can be, for example, a broadband and/or baseband modem chip.

儘管已描述若干實施例，但應理解，可在不脫離本揭露內容的範圍的情況下進行各種修改。因此，對於所屬技術領域中具有通常知識者將顯而易見的是，本揭露內容不限於本文中所描述的實施例中的任一者，而是具有僅由所附申請專利範圍及其等效物界定的涵蓋範圍。Although several embodiments have been described, it should be understood that various modifications can be made without departing from the scope of the disclosure. Therefore, it will be obvious to those with ordinary knowledge in the technical field that the content of this disclosure is not limited to any of the embodiments described herein, but is limited only by the scope of the attached patent application and its equivalents. The scope of coverage.

110、120、130、140、150、205、210、220、230、240、250、255、505、510、520、530、540、550、565、1050、1060、2002、2004、2006:步驟 310:頂部 410:第一層 413:第二層 415:第三層 710:過濾器 720:權重 900:設備 910:處理器 920:非暫時性電腦可讀媒體 1200:習知ResBlock 1202、1204:BN層 1206:ReLU層 1300:簡化ResBlock 1400:加權ResBlock 1402:可學習加權跳過連接 1404:標度層 1500:級聯訓練系統 1702:標準卷積層 1704:深度卷積 1706:逐點卷積 1802:深度可分離ResBlock 1804、1902:ResBlock 1904:DS-ResBlock 2000:流程圖 2102:SR資料集 2104、2304:第一GAN 2108、2124:RCAN生成器 2106、2122:LR影像 2110、2126:估計的HR影像 2112、2128:真實HR影像 2114:標準鑑別器 2116、2132:真實/偽判定 2120、2306:第二GAN 2130:相對鑑別器 2200:RCAN 2302:影像 2308:像素式集合 2310:輸出110, 120, 130, 140, 150, 205, 210, 220, 230, 240, 250, 255, 505, 510, 520, 530, 540, 550, 565, 1050, 1060, 2002, 2004, 2006: steps 310: top 410: first layer 413: second layer 415: third layer 710: filter 720: weight 900: Equipment 910: processor 920: Non-transitory computer-readable media 1200: learned ResBlock 1202, 1204: BN layer 1206: ReLU layer 1300: simplify ResBlock 1400: Weighted ResBlock 1402: Learnable weighted skip connection 1404: scale layer 1500: Cascade Training System 1702: standard convolutional layer 1704: deep convolution 1706: point-by-point convolution 1802: Depth separable ResBlock 1804, 1902: ResBlock 1904: DS-ResBlock 2000: flow chart 2102: SR data set 2104, 2304: the first GAN 2108, 2124: RCAN generator 2106, 2122: LR image 2110, 2126: Estimated HR image 2112, 2128: Real HR image 2114: standard discriminator 2116, 2132: true/false judgment 2120, 2306: second GAN 2130: Relative Discriminator 2200: RCAN 2302: Image 2308: Pixel Collection 2310: output

本揭露內容的某些實施例的以上及其他態樣、特徵以及優點將自結合附圖的以下詳細描述而更顯而易見，在所述附圖中：圖1說明根據一個實施例的用於建構級聯訓練超解析度卷積神經網路（cascade trained super resolution convolutional neural network；CT-SRCNN）的方法的例示性方塊圖。圖2說明根據一個實施例的級聯訓練的例示性圖。圖3A及圖3B說明現有訓練方法與根據一個實施例的級聯訓練之間的差異中的一些。圖4A及圖4B分別說明在根據一個實施例的級聯訓練之後的開始CNN及結束CNN。圖5說明根據一個實施例的級聯網路微調的例示性圖。圖6A及圖6B說明根據一個實施例的網路微調方法之間的差異中的一些。圖7說明用於進行根據一個實施例的過濾器微調的例示性圖。圖8A及圖8B分別說明根據一個實施例的擴張卷積與習知卷積之間的差異中的一些。圖9說明根據一個實施例的本發明設備的例示性圖。圖10說明根據一個實施例的用於製造及測試本發明設備的例示性流程圖。圖11是說明根據一個實施例的級聯訓練CNN與非級聯訓練CNN的收斂速度的例示性圖。圖12是根據一個實施例的習知ResBlock的例示性圖。圖13是根據一個實施例的簡化ResBlock的例示性圖。圖14是根據一個實施例的加權ResBlock的例示性圖。圖15是根據一個實施例的級聯訓練系統的例示性圖。圖16是根據一個實施例的彩色影像解碼的例示性圖。圖17是根據一實施例的深度可分離卷積的例示性圖。圖18是根據一個實施例的ResBlock的例示性圖。圖19是根據一個實施例的級聯演進的例示性圖。圖20說明根據一實施例的用於真實世界SR的方法的流程圖。圖21說明根據一實施例的SR方法的圖式。圖22說明根據一實施例的殘差通道注意力網路（residual channel attention network；RCAN）的圖式。圖23說明根據一實施例的融合來自訓練GAN的結果的圖式。The above and other aspects, features, and advantages of certain embodiments of the present disclosure will be more apparent from the following detailed description in conjunction with the accompanying drawings, in which: FIG. 1 illustrates an exemplary block diagram of a method for constructing a cascade trained super resolution convolutional neural network (CT-SRCNN) according to an embodiment. Figure 2 illustrates an exemplary diagram of cascaded training according to one embodiment. Figures 3A and 3B illustrate some of the differences between existing training methods and cascaded training according to one embodiment. 4A and 4B respectively illustrate the start CNN and the end CNN after the cascade training according to an embodiment. FIG. 5 illustrates an exemplary diagram of fine-tuning of a hierarchical network connection according to an embodiment. 6A and 6B illustrate some of the differences between the network fine-tuning methods according to one embodiment. FIG. 7 illustrates an exemplary diagram for performing filter fine-tuning according to one embodiment. 8A and 8B respectively illustrate some of the differences between dilated convolution and conventional convolution according to an embodiment. Figure 9 illustrates an illustrative diagram of the apparatus of the present invention according to one embodiment. Figure 10 illustrates an exemplary flow chart for manufacturing and testing the device of the present invention according to one embodiment. FIG. 11 is an exemplary diagram illustrating the convergence speed of a cascaded training CNN and a non-cascaded training CNN according to an embodiment. FIG. 12 is an illustrative diagram of a conventional ResBlock according to an embodiment. Figure 13 is an illustrative diagram of a simplified ResBlock according to one embodiment. FIG. 14 is an exemplary diagram of weighted ResBlock according to an embodiment. FIG. 15 is an exemplary diagram of a cascade training system according to an embodiment. Fig. 16 is an exemplary diagram of color image decoding according to an embodiment. Fig. 17 is an exemplary diagram of a depth separable convolution according to an embodiment. FIG. 18 is an illustrative diagram of ResBlock according to an embodiment. Figure 19 is an illustrative diagram of cascading evolution according to one embodiment. FIG. 20 illustrates a flowchart of a method for real-world SR according to an embodiment. Figure 21 illustrates a diagram of an SR method according to an embodiment. FIG. 22 illustrates a schematic diagram of a residual channel attention network (RCAN) according to an embodiment. Figure 23 illustrates a diagram of fusing results from training a GAN according to an embodiment.

2000:流程圖2000: flow chart

2002、2004、2006:步驟2002, 2004, 2006: steps

Claims

一種訓練網路的方法，包括：生成用於真實世界超解析度（SR）的資料集；訓練第一生成對抗網路（GAN）；訓練第二生成對抗網路；以及融合所述第一生成對抗網路的輸出及所述第二生成對抗網路的輸出。A method of training a network includes: Generate data sets for real-world super-resolution (SR); Training the first generation confrontation network (GAN); Training the second generation confrontation network; and The output of the first generation confrontation network and the output of the second generation confrontation network are merged.

如請求項1所述的方法，其中生成所述資料集包括：藉由通用降級模型將低品質影像下取樣為低解析度（LR）影像；以及將對應於所述低解析度影像的高品質影像直接用作高解析度（HR）影像。The method according to claim 1, wherein generating the data set includes: Down-sampling low-quality images into low-resolution (LR) images through a general degradation model; and The high-quality image corresponding to the low-resolution image is directly used as a high-resolution (HR) image.

如請求項1所述的方法，其中生成所述資料集包括：將低品質影像直接用作低解析度（LR）影像；以及自所述低品質影像的超解析高品質影像以用作高解析度（HR）影像。The method according to claim 1, wherein generating the data set includes: Use low-quality images directly as low-resolution (LR) images; and The super-resolution high-quality image from the low-quality image is used as a high-resolution (HR) image.

如請求項1所述的方法，其中使用標準鑑別器來訓練所述第一生成對抗網路。The method according to claim 1, wherein a standard discriminator is used to train the first generative adversarial network.

如請求項4所述的方法，其中使用相對鑑別器來訓練所述第二生成對抗網路。The method according to claim 4, wherein a relative discriminator is used to train the second generative adversarial network.

如請求項1所述的方法，其中使用殘差通道注意力網路（RCAN）來訓練所述第一生成對抗網路及所述第二生成對抗網路。The method according to claim 1, wherein a residual channel attention network (RCAN) is used to train the first generation confrontation network and the second generation confrontation network.

如請求項6所述的方法，其中所述殘差通道注意力網路是基於殘差（RIR）結構中的殘差。The method according to claim 6, wherein the residual channel attention network is based on residuals in a residual error (RIR) structure.

如請求項1所述的方法，其中所述第一生成對抗網路及所述第二生成對抗網路包括增強型超解析度生成對抗網路（ESRGAN）。The method according to claim 1, wherein the first generation confrontation network and the second generation confrontation network include an enhanced super-resolution generation confrontation network (ESRGAN).

如請求項1所述的方法，其中根據照度臨限值來融合所述第一生成對抗網路的所述輸出及所述第二生成對抗網路的所述輸出。The method according to claim 1, wherein the output of the first generation confrontation network and the output of the second generation confrontation network are merged according to an illuminance threshold value.

如請求項9所述的方法，其中使用相對鑑別器來訓練所述第二生成對抗網路，且其中當所述第二生成對抗網路的所述輸出的照度位準低於所述照度臨限值時，融合所述第一生成對抗網路的所述輸出及所述第二生成對抗網路的所述輸出。The method according to claim 9, wherein a relative discriminator is used to train the second generative adversarial network, and Wherein when the illuminance level of the output of the second generation confrontation network is lower than the illuminance threshold value, the output of the first generation confrontation network and the second generation confrontation network are merged Of the output.

一種訓練網路的設備，包括：一或多個非暫時性電腦可讀媒體；以及至少一個處理器，當執行儲存於所述一或多個非暫時性電腦可讀媒體上的指令時，所述至少一個處理器進行以下步驟：生成用於真實世界超解析度（SR）的資料集；訓練第一生成對抗網路（GAN）；訓練第二生成對抗網路；以及融合所述第一生成對抗網路的輸出及所述第二生成對抗網路的輸出。A device for training a network, including: One or more non-transitory computer-readable media; and At least one processor, when executing instructions stored on the one or more non-transitory computer-readable media, the at least one processor performs the following steps: Generate data sets for real-world super-resolution (SR); Training the first generation confrontation network (GAN); Training the second generation confrontation network; and The output of the first generation confrontation network and the output of the second generation confrontation network are merged.

如請求項11所述的設備，其中生成所述資料集包括：藉由通用降級模型將低品質影像下取樣為低解析度（LR）影像；以及將對應於所述低解析度影像的高品質影像直接用作高解析度（HR）影像。The device according to claim 11, wherein generating the data set includes: Down-sampling low-quality images into low-resolution (LR) images through a general degradation model; and The high-quality image corresponding to the low-resolution image is directly used as a high-resolution (HR) image.

如請求項11所述的設備，其中生成所述資料集包括：將低品質影像直接用作低解析度（LR）影像；以及自所述低品質影像的超解析高品質影像以用作高解析度（HR）影像。The device according to claim 11, wherein generating the data set includes: Use low-quality images directly as low-resolution (LR) images; and The super-resolution high-quality image from the low-quality image is used as a high-resolution (HR) image.

如請求項11所述的設備，其中使用標準鑑別器來訓練所述第一生成對抗網路。The device according to claim 11, wherein a standard discriminator is used to train the first generative confrontation network.

如請求項14所述的設備，其中使用相對鑑別器來訓練所述第二生成對抗網路。The device according to claim 14, wherein a relative discriminator is used to train the second generative confrontation network.

如請求項11所述的設備，其中使用殘差通道注意力網路（RCAN）來訓練所述第一生成對抗網路及所述第二生成對抗網路。The device according to claim 11, wherein a residual channel attention network (RCAN) is used to train the first generation confrontation network and the second generation confrontation network.

如請求項16所述的設備，其中所述殘差通道注意力網路是基於殘差（RIR）結構中的殘差。The device according to claim 16, wherein the residual channel attention network is based on residuals in a residual error (RIR) structure.

如請求項11所述的設備，其中所述第一生成對抗網路及所述第二生成對抗網路包括增強型超解析度生成對抗網路（ESRGAN）。The device according to claim 11, wherein the first generation confrontation network and the second generation confrontation network include an enhanced super-resolution generation confrontation network (ESRGAN).

如請求項11所述的設備，其中根據照度臨限值來融合所述第一生成對抗網路的所述輸出及所述第二生成對抗網路的所述輸出。The device according to claim 11, wherein the output of the first generation confrontation network and the output of the second generation confrontation network are merged according to an illuminance threshold value.

如請求項19所述的設備，其中使用相對鑑別器來訓練所述第二生成對抗網路，且其中當所述第二生成對抗網路的所述輸出的照度位準低於所述照度臨限值時，融合所述第一生成對抗網路的所述輸出及所述第二生成對抗網路的所述輸出。The device according to claim 19, wherein a relative discriminator is used to train the second generative adversarial network, and Wherein when the illuminance level of the output of the second generation confrontation network is lower than the illuminance threshold value, the output of the first generation confrontation network and the second generation confrontation network are merged Of the output.