CN117875382A - Computing storage device of energy-efficient deep neural network training system - Google Patents

Computing storage device of energy-efficient deep neural network training system Download PDF

Info

Publication number
CN117875382A
CN117875382A CN202311309342.7A CN202311309342A CN117875382A CN 117875382 A CN117875382 A CN 117875382A CN 202311309342 A CN202311309342 A CN 202311309342A CN 117875382 A CN117875382 A CN 117875382A
Authority
CN
China
Prior art keywords
training
data
training data
dram
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311309342.7A
Other languages
Chinese (zh)
Inventor
金钟律
凯文·唐
李世举
林炯辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SK Hynix Inc
Original Assignee
SK Hynix Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US18/457,171 external-priority patent/US20240127056A1/en
Application filed by SK Hynix Inc filed Critical SK Hynix Inc
Publication of CN117875382A publication Critical patent/CN117875382A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Processing (AREA)

Abstract

The present disclosure relates to a training system comprising: a Dynamic Random Access Memory (DRAM) configured to buffer training data; a Central Processing Unit (CPU) coupled to the DRAM and configured to downsample the training data and provide the downsampled training data to the DRAM; a computing storage device including a Solid State Drive (SSD) and a Field Programmable Gate Array (FPGA), and configured to perform dimension reduction on the downsampled training data to generate a training data batch; and a Graphics Processing Unit (GPU) configured to perform training on the training data batch.

Description

Computing storage device of energy-efficient deep neural network training system
Cross Reference to Related Applications
The present application claims the benefit of U.S. provisional patent application Ser. No. 63/415,476, filed 10/2022, 12, and U.S. non-provisional patent application Ser. No. 18/457,171, filed 8/2023, 28, both of which are incorporated herein by reference in their entireties.
Technical Field
Embodiments of the present disclosure relate to a scheme for processing data in a deep neural network.
Background
Deep Neural Networks (DNNs) play a key role in numerous fields such as computer vision, natural language processing, biomedical analysis, and robotics. However, their development and deployment is challenging. When training a DNN model on a large dataset or dataset containing high-dimensional data, storing all training data in a Graphics Processing Unit (GPU) may become impractical due to the limited memory capacity of the GPU, which results in memory starvation errors, which may not be able to train further. To overcome this problem, the data in the smaller buffer chunks may be accessed by partitioning the data. Nevertheless, even with data partitioning, there are limitations due to the relatively low memory performance improvement.
Reading data from memory is slower than processing data in a GPU, which makes accessing data from memory a bottleneck. This may slow down the training process and potentially lead to model convergence problems. This bottleneck is further exacerbated when multiple periods (epochs) of training are required or when hyper-parameter adjustments are required. In this case, the same data must be repeatedly accessed, which results in a slow memory access speed and aggravates the performance bottleneck. This is called a "GPU memory capacity wall". As the size of the data set and the complexity of the DNN model increase, the amount of memory required to store the data also increases.
To address the memory issues associated with training DNN models, one common approach is to distribute the training of each model to multiple GPUs. A unified architecture that accelerates distributed DNN training in heterogeneous GPU/CPU clusters has been considered. This approach involves splitting the dataset or model variables into GPUs, thereby speeding training time and improving performance. However, this may result in a linear increase in GPU and energy costs. Another recent approach is to utilize the host Central Processing Unit (CPU) memory as a buffer to offload some of the forthcoming tensors during training.
In this case, an embodiment of the present invention is proposed.
Disclosure of Invention
Aspects of the invention include a scheme to enhance the performance and energy efficiency of a training system such as a deep neural network.
In one aspect, a training system includes: a Dynamic Random Access Memory (DRAM) configured to buffer training data; a Central Processing Unit (CPU) coupled to the DRAM and configured to downsample the training data and provide downsampled training data to the DRAM; a computing storage device including a Solid State Drive (SSD) and a Field Programmable Gate Array (FPGA), and configured to perform dimension reduction on the downsampled training data to generate a training data batch; and a Graphics Processing Unit (GPU) configured to perform training on the training data batch.
In another aspect, a method of operating a training system includes: buffering training data by a Dynamic Random Access Memory (DRAM); downsampling, by a Central Processing Unit (CPU) coupled to the DRAM, the training data to provide downsampled training data to the DRAM; performing dimension reduction on the downsampled training data by a computing storage device coupled to the DRAM to generate a training data batch; and performing, by a Graphics Processing Unit (GPU), training on the training data batch.
Other aspects of the invention will become apparent from the description below.
Drawings
FIG. 1A is a diagram illustrating a DRAM buffered DNN training system.
FIG. 1B is a diagram of a DNN training system illustrating storage device buffering.
FIG. 2 is a diagram illustrating a training system for computing storage buffering according to one embodiment of the invention.
Fig. 3 is a diagram illustrating a computing unit according to another embodiment of the present disclosure.
Fig. 4 is a diagram illustrating a tiled data format according to another embodiment of the present invention.
FIG. 5 is a diagram illustrating a computing storage prototype and training system test platform according to another embodiment of the present invention.
FIG. 6 is a flow chart illustrating operation of a training system for computing storage buffering according to another embodiment of the present invention.
Fig. 7 is a diagram illustrating a workflow for unconstrained scene text recognition according to another embodiment of the invention.
Fig. 8 is a diagram illustrating CNN model training in CDRNN according to another embodiment of the present invention.
Fig. 9 shows a test data set for testing a training system according to another embodiment of the invention.
Fig. 10 shows a workflow of a CDRNN according to another embodiment of the present invention.
11A-11C illustrate the run times of different workloads of various training systems.
Fig. 12A-12C illustrate comparison of workload performance of various training systems.
FIGS. 13A-13C illustrate comparison of workload performance for various training systems at different batch sizes.
FIG. 14 shows the accuracy of models with different training data sizes and workloads.
Fig. 15 shows an example of model prediction results.
Fig. 16 shows RP phase comparisons.
Fig. 17 shows a comparison of average power and energy consumption for different numbers of training samples under all workloads.
FIG. 18 illustrates workload run times for various training systems at different training data sizes.
Detailed Description
Hereinafter, various embodiments of the present disclosure are described in more detail with reference to the accompanying drawings. This disclosure may, however, be embodied in different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will convey the scope of the disclosure to those skilled in the art. Furthermore, references herein to "an embodiment," "another embodiment," etc., do not necessarily refer to only one embodiment, and different references to any such phrases are not necessarily referring to the same embodiment. The term "embodiment" as used herein does not necessarily refer to all embodiments. Throughout this disclosure, like reference numerals refer to like parts throughout the drawings and detailed description.
The present disclosure may be implemented in numerous ways, including as a process, for example; an apparatus; a system; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor adapted to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these embodiments, or any other form that the invention may take, may be referred to as techniques. In general, the order of the operations of disclosed processes may be altered within the scope of the invention. Unless otherwise indicated, components such as processors or memories described as suitable for performing the tasks may be implemented as general purpose devices or circuit components configured or otherwise programmed to perform the tasks at a given time, or as specific devices or circuit components manufactured or pre-configured or pre-programmed to perform the tasks. As used herein, the term "processor" or the like refers to one or more devices, circuits, and/or processing cores adapted to process data, such as computer program instructions.
The methods, processes, and/or operations described herein may be performed by code or instructions executed by a computer, processor, controller, or other signal processing device. The computer, processor, controller, or other signal processing device may be those described herein or one other than the elements described herein. Because algorithms forming the basis of the methods (or the operation of a computer, processor, controller, or other signal processing apparatus) are described herein, code or instructions for implementing the operations of the method embodiments may transform a computer, processor, controller, or other signal processing apparatus into a dedicated processor for performing any of the methods herein.
If implemented at least in part in software, the controller, processor, device, module, unit, multiplexer, generator, logic, interface, decoder, driver, generator, and other signal generation and signal processing features may include, for example, code or instructions for storage to be executed by, for example, a computer, processor, microprocessor, controller, or other signal processing device.
The following provides a detailed description of various embodiments of the invention, and the accompanying drawings that illustrate aspects of the disclosure. The present disclosure has been described in connection with these embodiments, but is not limited to any particular embodiment. The present disclosure includes many alternatives, modifications, and equivalents of the disclosed embodiments. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. These details are provided for the purpose of illustration; the present invention may be practiced without some or all of these specific details described herein. For the sake of clarity, technical material that is known in the technical fields related to the disclosure has not been described in detail so that the invention is not unnecessarily obscured.
During training, the method of offloading some of the upcoming tensors using the host Central Processing Unit (CPU) memory as a buffer (as described above) may in practice result in lower training throughput and severe interference of the CPU memory. To address the low training throughput and severe interference of the CPU memory and to alleviate the burden of utilizing the CPU during training, in one embodiment, the present disclosure does not address the problem of time to read memory at the hardware level, but rather proposes an orthogonal method that preprocesses training data in a manner that speeds up model training while alleviating the problem of time to read data from memory.
Table 1 provides definitions of several abbreviations used in the specification.
Table 1:
according to various embodiments of the present disclosure, a computing storage system (hereinafter referred to as the training system of the present invention) is provided that provides accelerated data preprocessing for DNN training by performing a data preprocessing step (e.g., RP) in a computing storage device in the vicinity of an SSD to minimize overall data movement. As described below, lower end-to-end latency and energy efficiency are achieved with computing storage in the vicinity of the SSD. In one embodiment of the present disclosure, the training system of the present invention may reduce training time and improve accuracy of the DNN model, which not only solves the time problem of reading memory during DNN training, but also ensures lower power consumption.
Embodiments of the present disclosure provide the following contributions:
(1) These embodiments of the training system of the present invention provide a computational storage that accelerates data preprocessing for AI training by integrating dimension reduction into the computational components inside the computational storage (e.g., inside the computational unit 300 in fig. 3). As used herein, "dimension reduction" refers to a technique that preserves the nature of the original data feature set by reducing the dimensions of the data feature set by extracting a subset of the data set that has the most relevant or prominent features.
(2) As described below, such a computational storage may be used with a generic DNN model, such as MLP, to reduce training time and power consumption. Experimental results on real world datasets (see in detail below) show that there is a significant difference in training time between the workload with and without such a computing storage that accelerates the preprocessing of data for AI training by integrating dimension reduction into the computing components in the computing storage. Using such a computational storage to reduce the dimensions instead of a CPU may reduce power consumption and may improve model accuracy, especially for relatively large data sets.
(3) To take advantage of the near-stored data preprocessing function of the computational storage, the present disclosure provides a training system of the present invention that supports large dataset DNN training that can improve the performance of continuous time recurrent neural network CRNN models in, for example, text recognition. Further, the experimental results of training large datasets showed that there was a clear advantage in performing RP compared to not performing RP. Performing RP using such a computation store (to accelerate data preprocessing for AI training by integrating dimension reduction into the computation components in the computation store) may achieve accuracy similar to the CRNN model while ensuring lower end-to-end latency and energy efficiency.
A. Computing storage for DNN preprocessing
The training system may be implemented with a DRAM buffered DNN training system or a storage buffered DNN training system.
The DRAM buffered DNN training system may include three operations: data loading ((1)), downsampling ((2)) and DNN (or AI) training ((3)). As shown in fig. 1A, assuming the data source is external to the training server 100 (e.g., at the host), the generic DNN training process typically buffers the data in DRAM 110, downsamples the data by CPU 120, and loads the data to GPU 130 during training. If the input data for training is larger than the GPU memory, then GPU 130 reads data from DRAM 110 during the training period to load the entire batch of data from DRAM 110 to the GPU memory during each training period.
For large data set DNN training, a local storage may be used to buffer the training data, since the input data is too large to be accommodated in DRAM 110. As shown in fig. 1B, the original data ((1)) is first sent from the target storage device (e.g., SSD of fig. 5) to the DRAM 110. The data is then partitioned, downsampled by CPU 120 ((2)), and buffered in local storage 140 ((3)). The buffered data will be read by the training batch generator as a plurality of training batches during training and eventually fed into the GPU 130 for training ((4)). This process is repeated for each training period with negligible input/output (IO) time costs.
System for calculating storage device buffer
FIG. 2 illustrates a training system for computing storage buffering according to one embodiment of the present disclosure.
Referring to fig. 2, a training system (or server) 200 of the present invention may include a DRAM 210, a CPU 220, and a GPU 230. Further, the training system 200 of the present invention may include a computational storage 240 to accelerate data preprocessing using techniques described below. The training system of the present invention may minimize data movement in the machine learning workload by performing preprocessing operations near the SSD.
Accordingly, training system 200 may be implemented using computing storage (also referred to as in-situ storage). As used herein, a computing storage device is a technology that allows data to be stored and processed within the memory of a computer rather than transferred from disk or other external storage devices. The idea of integrating in-memory calculations with DNN enables data processing to be performed directly in memory and significantly reduces the latency and power consumption of DNN operations. In particular, computing storage has become a promising solution for accelerating both CNNs and RNNs.
By using custom hardware designs, quantization methods, pruning techniques, and memory access patterns, the performance of CNNs on computing storage devices such as FPGAs can be significantly improved, enabling CNNs to be deployed on resource-constrained devices and speeding up the use of FPGAs in large-scale applications. By taking advantage of the high parallelism and energy efficiency of FPGAs, the CNN training process is accelerated.
Computing storage has been explored as a potential solution to overcome the computational limitations of conventional CPU and GPU implementations in RNNs. Research has demonstrated the potential of computing storage devices, particularly FPGAs, for achieving high performance, low power consumption and real-time processing of deep neural networks.
Referring back to fig. 2, in most cases, reading data from storage 240 may be relatively slow. Thus, rather than simply buffering data on DRAM 210 prior to training, the various embodiments of the present disclosure apply Dimension Reduction (DR) as online operations (Inlineoperations), and the dimension reduced data may be stored in compute storage 240.
High-dimensional data may contain a large proportion of redundant features and may increase space and computation time requirements while being easy to overfit. Dimension reduction is one approach that may be used to solve such problems. In particular, random Projection (RP) is a DR technique that can be used, where the original d-dimensional data is projected as lower k-dimensional data using a random matrix R with columns of unit length. RP has shown potential in feature extraction. One of the advantages of RP is the visual benefit of counteracting the heavy computational requirements of processing high-dimensional data and the ability to meet real-time processing requirements. The simplicity and parallelism of RP enable efficient implementation of FPGAs, which is particularly useful for high performance computing systems.
To apply dimension reduction in the training system of the present invention, training server 200 may first load raw data from main memory to working memory (i.e., DRAM 210) on computing storage 240. In some embodiments, computing storage 240 may include a computing unit, such as computing unit 300 in fig. 3. The calculation unit may perform the dimension reduction ((3)), and may store the dimension reduced data in the storage 240. The reduced-dimension data may then be transferred to GPU 230 for DNN (or AI) training ((4)). In some embodiments, the reduced-size data of the compute storage device 240 may be transferred to the GPU 230 through P2P-DMA techniques. By applying dimension reduction to the data writing process for buffering, additional data movement and CPU 220 use to perform dimension reduction is no longer required and may be partially or completely eliminated, and memory space on DRAM 210 may be reserved for storing raw data. In addition, the reduced-dimension data may be transmitted to GPU 230, and as the size of the training model is reduced, the data transfer time from storage 240 to GPU 230 and the training time in GPU 230 may be reduced.
In one embodiment, the training system of the present invention becomes more efficient in handling relatively large dataset DNN training. As described above, embodiments of the present disclosure do not utilize too much memory bandwidth of CPU 220, but rather may use computing storage 240 to perform dimension reduction, and may store reduced-size data for training by GPU 230 ((4)). In some embodiments, the computing storage 240 may be utilized by the training batch generator to locally generate training batches, which avoids consuming host CPU cycles or DRAM bandwidth. As described above, P2P-DMA may enable direct memory access between GPU 230 and compute storage 240 without using a host DRAM buffer to minimize host intervention during SSD reads/writes. Accordingly, embodiments of the present disclosure may take advantage of the computing storage 240 and reduce the burden on the CPU 220.
B. System implementation details
Embodiments of the present disclosure may use SGEMM kernels with U200FPGA of siren Alveo to perform RP. The kernel may be implemented using siren OpenCL High Level Synthesis (HLS) programming. Under HLS development flow, the FPGA can be managed by a siren runtime (XRT) software that provides APIs and drivers to reprogram and communicate with the FPGA from a host. The SGEMM accelerator may include a portion running on the U200FPGA and management code programmed using OpenCL running on the host.
SGEMM kernel using Siring OpenCL HLS
Embodiments of the present disclosure may implement tiled single-precision generic matrix multiplication (SGEMM) accelerator functionality via multiple Compute Units (CUs). Multiple computation units may compute tiles (tiles) of the output matrix in parallel. Each of the computing units may be implemented using the structure of the computing unit 300 as shown in fig. 3.
Referring to fig. 3, the sgemm kernel (i.e., each computing unit 300) may be used to perform an RP function by c=ab, where a is the raw batch data, B is the RP matrix, and C is the result matrix (or output matrix). As shown, each computing unit 300 may include a DSP unit 320 for performing matrix multiply add operations and BRAM blocks 311-313 for storing input/output tiles (sub-arrays of matrices A, B and C). Since the on-chip memory resources of the FPGA are limited compared to external memory, the complete input batch data and matrix may first be transferred to the external DRAM 210 of the FPGA (e.g., global buffer), and the input tiles (sub-arrays) for the batch data and RP matrix may be loaded to BRAMs 311-312 on the computing unit 300 as needed to perform the matrix multiplication.
In some embodiments, the input data and matrix may be double buffered to overlap with writes from external DRAMs and reads for computing output tiles. However, employing double buffering may require a tradeoff because it doubles the core's need for BRAM. Since the on-chip memory of the FPGA is limited, in one embodiment the tile size can be reduced to compensate, which results in higher memory bandwidth requirements. To this end, embodiments of the present disclosure may buffer input a/B tiles, but may not double buffer output C tiles. The number of accesses to the a/B tiles may increase per tile with matrix size, while the number of accesses to the C tiles may not. For large matrices, the performance improvement for double buffering of the C output matrix is very small compared to the associated loss of reducing tile size. In the embodiment shown in fig. 3, BRAM block 311 may be implemented on input a tile a 0 And A 1 Buffering BRAM block 312 may buffer input B tile B 0 And B 1 Buffering is performed and BRAM block 313 may buffer output C tiles corresponding to the result matrix of DSP unit 320.
In some embodiments, to utilize the DRAM memory bandwidth of the FPGA, the data access patterns are sequential. In one embodiment, the siren HLS software may be employed to provide two main optimizations for memory access, burst transfer, and read/write extensions, which use sequential, regular access patterns. Under a standard row or column master matrix layout, tiles may be located in non-contiguous areas of memory, while other possible optimizations are disabled. To address this issue, the host may reorder the input matrix into a tiled data format before transmitting the input matrix to the SGEMM core (i.e., computing element 300). As shown in FIG. 4, the original row master layout 410 may be reordered into a data layout 420. The original row master layout 410 may include data placed in the order of row 0 for each input tile, row 1 for each input tile … …, up to row n for each input tile. The data layout 420 may include tiles (submatrices of the matrix) placed in the order of row 0 through row n of the first input tile, row 0 through row n … … of the second input tile, and up to row 0 through row n of the last input tile, such that each tile is located in a contiguous area of memory under the reordered data layout. Thus, while applying data reordering results in host memory bandwidth overhead, this cost reduces overall run time by providing FPGAs to read/write tiles in bursts from successive areas of memory.
OpenCL host application
The host application may provide an API (C++, scikit-learn) for the user to perform matrix multiplication using the U200 FPGA. Internally, the host application may use the OpenCL queue to schedule I/O and kernel operations. Tiles of the output matrix may be grouped into OpenCL workitems and partitioned among CUs to compute results in parallel.
Since the matrix data is initially located outside the FPGA DRAM (in the host DRAM or SSD), in practice there is additional cost of loading the data into the FPGA. When considering the latency of a single matrix multiplication operation, the latency depends on both the PCIe transmit and the core computation latency. To address this delay, embodiments of the present disclosure may implement an asynchronous API and pipeline host-FPGA I/O and kernel computations.
P2P-DMA
In a basic input/output system (BIOS) supporting large memory mapped IOs, the U200FPGA may map 64GB DDR to host memory space, thereby enabling P2P-DMA transfers. If data is intended to be read from or written to the SSD, PCIe bandwidth may be conserved by enabling P2P-DMA and transferring data directly between the FPGA and the SSD, bypassing buffers in host memory. Embodiments of the present disclosure may use this feature in the output phase to write the reduced matrix directly to the SSD.
DNN training system with computational storage
Fig. 5 illustrates the basic setup of the computing storage enabled training system of the present invention. The system may include an object storage server 200a and a training server 200b. In one embodiment, training server 200b may use GPU 230 and computation storage 240, computation storage 240 employing a siren Alveo U200 FPGA with a 4TB SK halis SSD. The system may support (1) a C++ API or (2) a scikit-learn API to apply dimension reduction in the compute storage 240 and output the results to the host DRAM 210 (shown in FIG. 2) or SSD of the storage server 200b via P2P-DMA.
As shown in fig. 5, the entire training task may be managed and orchestrated by Apache Airflow. Training data may be initially stored in Ceph of storage server 200a (see operation 510 in FIG. 5), may be transferred to computing storage 240 for buffering and preprocessing (520), and may then be copied to GPU 240 for training (530). To enable DNN services, embodiments may use a TensorFlow (a machine learning platform) with CUDA (a parallel computing platform and programming model) and cuDNN (a GPU acceleration primitive library of deep neural networks) for GPU acceleration, and an inflight Tesla (NVIDIA Tesla) 100GPU with 16GB memory. In one embodiment, the test platform may use a 3.0GHz 48 core processor with DDR4-2666 192GB DRAM, as well as a P100 GPU and compute storage device prototypes. In fig. 5, ceph represents an open source software defined storage program designed for addressing block, file and object storage devices. TensorFlow represents a free open source software library for machine learning and Artificial Intelligence (AI). Scikit Learn represents a free software machine learning library for the Python programming language. Airflow represents an open source workflow orchestration and data pipeline processing software.
Referring to fig. 2 and 5, for a universal DNN training system using CS, raw data (training and validation and test data sets) may initially be stored in different storage spaces of Ceph in a storage server. The data may first be transferred to DRAM 210 for buffering, then downsampled by CPU220, and finally RP processed by compute store 240 and saved to compute store 240. In one embodiment, downsampling may include image resizing, data enhancement, and dimension adjustment that may reduce the size of the original data. The reduced size data may be loaded into GPU 230 during training. For large-scale training tasks, the raw training input data may be too large to be fully contained in DRAM 210. Thus, training input data may be partitioned and portions thereof buffered into DRAM 210 based on DRAM size and training batch size. These buffered data may then be downsampled by the CPU220 and ultimately RP processed by the CS 240. This process may be repeated to pre-process all buffered data.
Fig. 6 is a flowchart illustrating operation 600 of a training system according to one embodiment of the present disclosure. In various embodiments of the present disclosure, operation 600 may be performed by training server 200 in fig. 2, i.e., by Dynamic Random Access Memory (DRAM) 210, central Processing Unit (CPU) 220 coupled to DRAM 210, computing storage 240 coupled to DRAM 210, and Graphics Processing Unit (GPU) 230 coupled to computing storage 240.
Referring to fig. 6, operation 600 may include buffering (610) training data by a DRAM. Operation 600 may include downsampling, by the CPU, training data to provide downsampled training data to the DRAM (620). Operation 600 may include performing, by a computing storage, dimension reduction on the downsampled training data to generate a training data batch (630). Operation 600 may include performing training by the GPU on the training data batch (640).
In another embodiment, the dimension reduction comprises random projection.
In another embodiment, performing the dimension reduction includes providing, by the computing storage, the training data batch to the GPU via point-to-point direct memory access (P2P-DMA).
In another embodiment, the computing storage includes a plurality of computing units. In one embodiment, each computing unit performing dimension reduction includes: storing, by the buffer block, an input tile of the downsampled training data and an output tile of the training data batch; and multiplying and/or adding the input tiles by a Digital Signal Processing (DSP) unit to generate output tiles.
In another embodiment, two input tiles are stored in a buffer block.
In another embodiment, the input tiles are double buffered simultaneously by the buffer block.
In another embodiment, the data access pattern of the two input tiles is sequential access.
In another embodiment, the input tiles have a tiled data format in which the input tiles are reordered from the row main layout to the data layout of the input matrix.
In another embodiment, the downsampled training data includes data processed by image resizing, data enhancement, and/or dimension adjustment of the training data.
In another embodiment, buffering the training data includes partitioning the training data and buffering the partitioned training data in the DRAM.
C. Case study and experimental results
This section presents three case study and experimental results demonstrating the effectiveness of the training system with computational storage of the present invention in comparison to other baselines including deep learning models in the case of large data sets or data sets with high dimensional data. The performance of different systems was evaluated based on the following three criteria: AI task run time, training accuracy, and energy costs.
Case study
In this work, a generic DNN training system is applied to two real world binary classification tasks using multi-layer perceptron (MLP): pediatric pneumonic chest X-ray classification and RNA-seq cell type classification. For the first task, the goal is to distinguish between pneumonia and normal chest from chest X-ray images. For the second task, binary classification of non-diabetic (ND) and type 2 diabetic (T2D) cell types was performed on each RNA-seq sample using the true transcriptomic dataset from the single cell RNA-seq study. To demonstrate the performance of the large dataset DNN training system of the present invention, an unconstrained scene text recognition task, e.g., MJSynth, was used, training and validation using a synthetically generated word image dataset containing 900 tens of thousands of samples. ICDAR 2003 and ICDAR 2013 are used as two test data sets. As shown in fig. 9, all five data sets were summarized.
The MLP model in the first two tasks has four neural layers, including three fully connected layers with Relu activation and 0.2 information discard (Dropout), and the last layer with 1 neuron and sigmoid activation. The loss function applies a binary cross entropy. At the beginning of task 1, 5 sets of square images with different pixels are used. The whole sample set was set per group 4: the ratio of 1 is divided into training samples and validation samples. The image data is flattened and RP is applied to these image samples in the compute storage device to reduce the dimensionality. The number of neurons per FC layer is set to be the same as and equal to the pixels. In task 2, the dimension of the input data is 638×26616. In the pretreatment, the data are divided into training samples and test samples, wherein the training samples and test samples have 95% and 5% data, respectively. Training samples were run at 3: the ratio of 1 is further divided into training samples and validation samples. After applying the random projection, the number of features in all samples is reduced to 1000. The batch size is modified to show the robustness of the system in performance during training.
For task 3, CDRNN using the CS-based DNN training system for large datasets of the present invention was used, the main workflow of which is shown in fig. 7. To extract robust features for text recognition, this operation first trains a case-differentiated character classifier using 10 ten thousand image samples (see fig. 8). The word images are uniformly divided into a plurality of character images based on the length of the tag of each word, wherein each character image is assigned a corresponding tag. CNN training uses 65 ten thousand input samples. Second, for each resized word image sample of height 32, a sliding window of size equal to 32 is moved by that height, and the captured image batch is converted into a multi-layer CNN feature sequence by passing the resized word image sample through a pre-trained CNN model in the CPU. Specifically, the outputs of the flattened layers and the smallest fully connected layers are extracted and connected into a feature sequence of dimension 552. Third, random projection is used to embed features of the original 552 dimension into the 80-dimension random subspace in the computation store. After such a 85% dimension reduction, the RNN model is applied to identify dimension-reduced feature sequence samples in the GPU. The RNN model utilizes two layers of two-way Long Short Term Memory (LSTM), each having 256 nodes. Finally, join-sense temporal classification (CTC) is used to remove duplicate tags and non-character tags in the final output. Adam Optimizer (Adam Optimizer) uses a default learning rate. The workflow is shown in fig. 10. Specifically, during training of each system, a custom training batch generator is used to generate batches. The raw data is partitioned based on the determined batch size to ensure that each partitioned data just covers one batch. During data partitioning, each batch from the DRAM is written to the local storage. The training performance was compared using a conventional CRNN model (as a baseline for DNN training systems buffered using storage). The batch size used in CRNN is the same as the input size in the training system of the present invention. Comparing the CDRNN of the present invention with a conventional CRNN, the total model size was reduced from 870 tens of thousands of parameters to 320 tens of thousands of parameters, where 320 tens of thousands of parameters were obtained by adding the number of model parameters in the CNN and RNN. All systems were tested in a large DNN training environment with three different workloads (0.1M, 1M and 9M images), indicating that all training data was buffered in local storage or CS before training, rather than in DRAM. For each workload, the overall data was written as 4: the ratio of 1 is divided into training data and validation data. Note that although the original size of the 9M image dataset in memory is 32GB, after downsampling, the processed data size in memory will increase to 734.4GB, much larger than the DRAM size.
Experimental results
The performance of task 1 under different loads was evaluated. Fig. 11A to 11C show the run time of each baseline at different input image sizes. It is apparent that the data loading time for all workloads is nearly the same for a fixed input image size. As the image size increases, the runtime of the baseline without RP preprocessing will increase linearly with the square of the pixel. Since the reduced feature sizes are similar for different input sizes, the training time of the two RP related systems is nearly identical. The performance differences between the three systems were analyzed next. As shown in fig. 12A, the greater the input size, the greater the accuracy of the model is relative, as it retains more features. Systems employing RP have significant advantages over systems that do not do so. From fig. 12B, it is noted that the RP process may reduce training time by more than 50%. It can be seen that the increase in training time without RP with increasing input size is much greater than the increase in RP time in a system involving RP. Fig. 12 shows the average power and total energy consumption collected based on the result in the case of an input size of 500 x 500. The average power and energy consumption measurements do not include the idle power consumed in the background system. The training system of the present invention can save about 33% and 26% of average power, respectively, and further reduce 70% and 16% of total energy consumption, compared to a system that does not perform training time for RP and uses RP in the CPU.
As for the result of task 2, as shown in fig. 13A, for a system that does not perform RP, the training accuracy significantly decreases with an increase in the batch size, but for a system that performs RP, the training accuracy remains almost unchanged even in the case of a change in the batch size. As shown in fig. 13B, the training time for all systems decreases as the batch size increases. When the batch size is small, RP-based systems have greater advantages in terms of training time and end-to-end runtime than systems that do not do RP. The runtime performance between an RP-CPU (i.e., RP process in the CPU) and an RP-CS (i.e., RP process in the CS) is very similar. Furthermore, when the batch size exceeds 8, the data load time occupies more than half of the total end-to-end run time. Fig. 13C shows average power and total energy consumption. From the general trend, the average power of all systems increases with increasing batch size. However, the energy consumption decreases with increasing batch size. By calculating the average power and energy values for four different batch sizes, it is noted that the training system of the present invention can save about 30% and 12% of the average power, respectively, and further reduce the total energy consumption by 24% and 4% compared to a system that does not perform RP and uses RP-CPU training time.
Regarding the performance of each system in task 3, fig. 15 shows a presentation of test results in which the predicted results are displayed on the left-hand heading above each word image and the ground truth is displayed on the right-hand side thereof. Figure 14 compares the performance of the different systems, including the CDRNN system where RP pretreatment is performed in CS and CPU, and reports the training accuracy under four different systems, including where RP is performed and where the original CNN features are fed directly to the RNN. From an accuracy point of view, the CRNN system is determined to be optimal. However, as the size of the data set increases, this advantage is shrinking. The accuracy of CDRNN without RP is about 2% higher than CDRNN with RP due to distortion and information loss caused by RP. However, as workload size increases, this gap is greatly reduced. The difference in accuracy between the RP-CPU and the RP-CS is negligible and is entirely due to the randomness of the transformation matrix. As shown in fig. 14, when the data set is small, the accuracy of data set ic13 is higher than the accuracy of data set ic03 for all systems. However, for large datasets, the opposite is true.
Next, the run time of each system is checked. The routine uses four main phases including data loading, downsampling and data partitioning, random projection and training. The end-to-end delay is represented by the sum of the run times of each phase. The feature extraction step is included in a downsampling step that consumes a set amount of time of the cdnn related system. As shown in fig. 18, CDRNN with RP was significantly better than CRNN for the different data sets and also significantly better than CDRNN without RP. Fig. 18 shows that the training time of the training system of the present invention is reduced by 40.3% and 10% and the end-to-end delay is reduced by 29.3% and 8.2%, respectively, compared to the CRNN system and the cdnn system without RP for the 9M dataset (i.e., image).
Finally, average power and total energy consumption are collected under the system shown in fig. 17. In general, the average power consumption and energy consumption of all systems increases with increasing data set size. The results show that the CS-based CDRNN system of the present invention is superior to all other systems. By calculating the average power and energy costs of the largest test dataset, the proposed training system can save about 13.2% and 10.7% of the average power, respectively, and further reduce the total energy consumption by 38.2% and 18% compared to the CRNN system and the cdnn system without RP. In particular, the training system of the present invention can save 47.7% and 23.5% of average power, respectively, and further reduce 57.1% and 17.4% of total energy consumption. To show the advantages of performing RP in the CS-based system of the present invention over performing RP in the CPU-based system, the power consumption and CPU time at RP phase for the 9M dataset are directly compared in fig. 16. The CPU usage of the RP-CPU is 40.6 times that of the RP-CS, and the energy cost of the RP-CPU is 58.3 percent higher than that of the RP-CS.
As described above, embodiments of the present disclosure provide a computing storage device for an AI training system. Evaluation shows that the computational storage can be used to improve the training accuracy of the training system and reduce overall power consumption.
Although the foregoing embodiments have been described in some detail for purposes of clarity and understanding, the disclosure is not limited to the details provided. As will be recognized by those skilled in the pertinent art based on the above disclosure, there are numerous alternative ways of implementing the present invention. Accordingly, the disclosed embodiments are illustrative and not restrictive. This disclosure is intended to include all modifications and alterations of the disclosed embodiments. Furthermore, the disclosed embodiments may be combined to form additional embodiments.

Claims (20)

1. A training system, comprising:
a Dynamic Random Access Memory (DRAM) for buffering training data;
a central processing unit, CPU, coupled to the DRAM and downsampling the training data and providing downsampled training data to the DRAM;
a computation storage device comprising a solid state drive, i.e., SSD, and a field programmable gate array, i.e., FPGA, and performing dimension reduction on the downsampled training data to generate a training data batch; and
And the graphic processing unit, namely the GPU, performs training on the training data batch.
2. The training system of claim 1, wherein the dimension reduction comprises random projection.
3. The training system of claim 1, wherein the computing storage provides the training data batch to the GPU via a point-to-point direct memory access operation, P2P-DMA operation.
4. The training system of claim 1, wherein the computing storage comprises a plurality of computing units, each comprising:
a buffer block storing input tiles of the downsampled training data and output tiles of the training data batch; and
a digital signal processing unit, DSP unit, multiplies and adds the input tiles to generate the output tiles.
5. The training system of claim 4, wherein the buffer block stores two of the input tiles.
6. The training system of claim 5, wherein the input tiles are double buffered simultaneously by the buffer block.
7. The training system of claim 5, wherein the data access patterns of two of the input tiles are sequential.
8. The training system of claim 4, wherein the input tiles have a tiled data format and are reordered from a row master layout to a data layout of an input matrix, the input tiles being located in contiguous areas of memory.
9. The training system of claim 4, wherein the downsampled training data includes data processed by image resizing, data enhancement, and/or dimension adjustment of the training data.
10. The training system of claim 4, wherein the training data is partitioned and then buffered in the DRAM.
11. A method for operating a training system, comprising:
buffering training data by a dynamic random access memory, DRAM;
downsampling the training data by a central processing unit, CPU, coupled to the DRAM to provide downsampled training data to the DRAM;
performing dimension reduction on the downsampled training data by a computing storage device coupled to the DRAM to generate a training data batch; and
training is performed on the training data batch by a graphics processing unit, GPU.
12. The method of claim 11, wherein the dimension reduction comprises random projection.
13. The method of claim 11, wherein performing dimension reduction comprises providing, by the computing storage device, the training data batch to the GPU via a point-to-point direct memory access operation, P2P-DMA operation.
14. The method of claim 11, wherein the computing storage comprises a plurality of computing units, the performing, by each computing unit, dimension reduction comprising:
storing, by a buffer block, an input tile of the downsampled training data and an output tile of the training data batch; and
the input tiles are multiplied and added by a digital signal processing unit, i.e. a DSP unit, to generate the output tiles.
15. The method of claim 14, wherein two of the input tiles are stored in the buffer block.
16. The method of claim 15, wherein the input tiles are double buffered simultaneously by the buffer block.
17. The method of claim 15, wherein the data access patterns of two of the input tiles are sequential.
18. The method of claim 14, wherein the input tiles have a tiled data format and are reordered from a row main layout to a data layout of an input matrix, the input tiles being located in contiguous areas of memory.
19. The method of claim 14, wherein the downsampled training data includes data processed by image resizing, data enhancement, and/or dimension adjustment of the training data.
20. The method of claim 14, wherein buffering the training data comprises partitioning the training data and buffering the partitioned training data in the DRAM.
CN202311309342.7A 2022-10-12 2023-10-11 Computing storage device of energy-efficient deep neural network training system Pending CN117875382A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US63/415,476 2022-10-12
US18/457,171 2023-08-28
US18/457,171 US20240127056A1 (en) 2022-10-12 2023-08-28 Computational storage for an energy-efficient deep neural network training system

Publications (1)

Publication Number Publication Date
CN117875382A true CN117875382A (en) 2024-04-12

Family

ID=90583601

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311309342.7A Pending CN117875382A (en) 2022-10-12 2023-10-11 Computing storage device of energy-efficient deep neural network training system

Country Status (1)

Country Link
CN (1) CN117875382A (en)

Similar Documents

Publication Publication Date Title
US10872399B2 (en) Photorealistic image stylization using a neural network model
US10783394B2 (en) Equivariant landmark transformation for landmark localization
US11900253B2 (en) Tiling format for convolutional neural networks
US20190295228A1 (en) Image in-painting for irregular holes using partial convolutions
US20200394458A1 (en) Weakly-supervised object detection using one or more neural networks
US8400458B2 (en) Method and system for blocking data on a GPU
US8359281B2 (en) System and method for parallelizing and accelerating learning machine training and classification using a massively parallel accelerator
CN111465943B (en) Integrated circuit and method for neural network processing
US12008475B2 (en) Transposed sparse matrix multiply by dense matrix for neural network training
DE112020003165T5 (en) Video interpolation using one or more neural networks
CN115066692A (en) Apparatus and method for representing sparse matrices in neural networks
US20200160112A1 (en) Distributed batch normalization using partial populations
DE112020005020T5 (en) POSITION DETERMINATION USING ONE OR MORE NEURAL NETWORKS
WO2021198810A1 (en) Feature reordering based on similarity for improved memory compression transfers during machine learning jobs
US20230153604A1 (en) Performing simulations using machine learning
JP7427001B2 (en) Tiling algorithm for matrix math instruction set
CN117875382A (en) Computing storage device of energy-efficient deep neural network training system
US20240127056A1 (en) Computational storage for an energy-efficient deep neural network training system
US20220067509A1 (en) System and method for learning from partial compressed representation
US11972188B2 (en) Rail power density aware standard cell placement for integrated circuits
US11551090B2 (en) System and method for compressing images for remote processing
Li et al. Computational Storage for an Energy-Efficient Deep Neural Network Training System
US20230297643A1 (en) Non-rectangular matrix computations and data pattern processing using tensor cores
US20240062534A1 (en) Performing visual relational reasoning
Pal et al. Parallel Character Reconstruction Expending Compute Unified Device Architecture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination