CN114661637B - Data processing system and method for radio astronomical data intensive scientific operation - Google Patents

Data processing system and method for radio astronomical data intensive scientific operation Download PDF

Info

Publication number
CN114661637B
CN114661637B CN202210187329.8A CN202210187329A CN114661637B CN 114661637 B CN114661637 B CN 114661637B CN 202210187329 A CN202210187329 A CN 202210187329A CN 114661637 B CN114661637 B CN 114661637B
Authority
CN
China
Prior art keywords
data
nodes
computing node
memory
storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210187329.8A
Other languages
Chinese (zh)
Other versions
CN114661637A (en
Inventor
安涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Astronomical Observatory of CAS
Original Assignee
Shanghai Astronomical Observatory of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Astronomical Observatory of CAS filed Critical Shanghai Astronomical Observatory of CAS
Priority to CN202210187329.8A priority Critical patent/CN114661637B/en
Publication of CN114661637A publication Critical patent/CN114661637A/en
Application granted granted Critical
Publication of CN114661637B publication Critical patent/CN114661637B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0893Caches characterised by their organisation or structure
    • G06F12/0897Caches characterised by their organisation or structure with two or more cache hierarchy levels
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/084Multiuser, multiprocessor or multiprocessing cache systems with a shared cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0842Multiuser, multiprocessor or multiprocessing cache systems for multiprocessing or multitasking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data processing system for radio astronomical data intensive scientific operation, which comprises at least one data constellation, wherein each data constellation is an extensible comprehensive data unit which is arranged on one cabinet or adjacent cabinets and consists of an extensible distributed storage system, a mixed heterogeneous computing node system and a network system; each computing node is physically integrated with a local storage unit of a super-capacity memory and a flash memory type of the computing node, and the storage system consists of a local storage unit corresponding to each computing node and a distributed file system consisting of storage nodes; each data constellation has an independent distributed file system. The invention also provides a corresponding method. The data processing system adopts a data constellation architecture for big data, and each data constellation has an independent shared file system, so that the requirements of astronomical big data calculation and storage are met, and the influence caused by the traditional global file system is greatly reduced.

Description

Data processing system and method for radio astronomical data intensive scientific operation
Technical Field
The invention belongs to the field of radio astronomical data processing technology, data intensive scientific operation, big data and high-performance calculation, and particularly relates to a data processing system and method for radio astronomical data intensive scientific operation.
Background
With the new construction and operation of advanced astronomy observation equipment, the astronomy community faces the challenges of ultra-large scale data and data intensive scientific operations. For example, a Square Kilometer Array (SKA) telescope for global collaboration is the largest astronomical observation device constructed by the international astronomical world plan, and is the largest international collaboration large scientific plan in the astronomy field in which china participates. SKA gathers a large amount of small-bore antennas and realizes synthetic aperture radio interference imaging, the total receiving area of the SKA is up to one square kilometer, the sensitivity of the SKA is improved by 50 times and the speed of the SKA is improved by 10000 times compared with the current maximum radio telescope, and the SKA provides great opportunity for human cognitive universe. After the first stage (2021-2029), the scientific data storage scale of SKA is estimated to be as high as 710 petabytes (1pb =1024tb = more than one million GB) per year, the international SKA regional center data processing system for calculating and storing these scientific data needs to have a processing platform with 300PFlops (30 billion floating point operations per second) computing power, namely SKA regional center, wherein at least 20PFlops computing power is used for subsequent scientific analysis, and the data exchange between regional center nodes of each country needs to have a stable network speed of 100Gbps (100 gigabit ethernet) on average. By 2029, the total storage volume of data in the international SKA regional center is expected to be as high as 2EB (1eb = 1024pb). The existing supercomputing platforms can not achieve the planning goal, and therefore advanced data processing platforms are being developed by SKA international organization.
The analysis and processing of the ultra-large scale data (PB magnitude) are the common challenges facing the astronomy and computer science interfaces, and the success and failure of the large data-driven SKA telescope also depends on the capability of the regional center to solve the world problem. The processing process of SKA scientific data is a typical data intensive computing task, and the business mode of the SKA scientific data is greatly different from that of traditional super computing based on computation intensive business.
The traditional super computing platform has small local storage, low shared memory capacity, long data calling time consumption and single system architecture, and is not suitable for carrying out pipelined processing on the emerging super-large-scale data. In addition, the traditional supercomputing platform relies heavily on a shared file system on a storage architecture, and a higher system failure rate and even system paralysis can occur on SKA-scale data processing. The global multi-user application scenario of the SKA project will also greatly affect scientific research work of scientific users on the traditional supercomputing platform.
In the era of big data and artificial intelligence, the trend of conversion from computation intensive to data intensive is increasingly obvious. How to rapidly process massive data with complex data structure, diversified data types, multiple dimensions and large size is the core key of data intensive scientific operation represented by astronomical big data.
Disclosure of Invention
The invention aims to provide a data processing system and a data processing method aiming at radio astronomical data intensive scientific operation so as to improve the data processing speed of the data intensive scientific operation.
In order to achieve the above object, the present invention provides a data processing system for radio astronomical data intensive scientific operations, comprising at least one data constellation, each data constellation being an extensible integrated data unit installed on one cabinet or on a plurality of adjacent cabinets, each data constellation being composed of an extensible distributed storage system, a mixed heterogeneous computing node system and a network system; each computing node is physically integrated with a corresponding super-large-capacity memory and a flash memory type local storage unit, and the storage system consists of the local storage unit corresponding to each computing node and a distributed file system consisting of the storage nodes; each data constellation has an independent distributed file system.
The hybrid heterogeneous compute node system includes at least an x86CPU architecture, an ARM architecture, and an x86CPU + GPU architecture.
When the computing node is an ARM architecture, the total access bandwidth is 80GB/s; when the computing node is a CPU + GPU architecture, the access bandwidth of the computing node is 2TB/s.
The total memory capacity of the ultra-large capacity memory of each computing node is 1 TB-2 TB, the total memory capacity of the ultra-large capacity memory of each computing node is adjusted correspondingly according to the number of the cores of the CPU, and the memory capacity corresponding to each core is not lower than 32GB.
For a compute node with 32 cores, its total memory capacity is at least 1TB.
The local storage unit adopts NVMe SSD, and the storage node adopts HDD.
The distributed file system adopts a fully distributed architecture and a fully symmetric distributed architecture.
The network system comprises a plurality of IB switches connected with all computing nodes and storage nodes, network switches connected with all computing nodes, storage nodes, background storage nodes and management nodes, and a plurality of user login nodes connected with the management nodes through the Internet.
In another aspect, the present invention provides a data processing method for radio astronomical data intensive scientific operations, comprising:
s0: providing a data processing system for radio astronomical data intensive scientific arithmetic as described above;
s1: the original data is sent to the super-large-capacity memory of the current computing node through an IB switch to serve as super-large memory cache;
s2: the current computing node processes the task and judges whether the obtained intermediate data or final data is obtained; if the data is the intermediate data, continuing to execute the step S3; if the data is the final data, storing the final data to a super-large-capacity memory or a local storage unit of the computing node, writing the final data back to a distributed file system of the storage node through an IB switch for storage, and ending the process;
s3: the current computing node stores the obtained intermediate data into a local storage unit of a super-large-capacity memory or a flash memory type according to storage requirements to serve as a super-large memory cache or a flash memory cache;
s4: and the computing nodes with different architecture types from the current computing node read the super memory Cache or the SSD Cache of the current computing node through the IB switch to perform intermediate data interaction between the computing nodes, and then serve as new current computing nodes to return to the step S2.
The data processing system for the radio astronomical data intensive scientific operation aims at dealing with astronomical big data, provides a data constellation architecture, each data constellation has an independent shared file system, and not only can meet the requirements of calculation and storage of the astronomical big data, but also can greatly reduce the influence brought by a global file system (the traditional supercomputing architecture design). Meanwhile, the design concept can distribute the equipment of processors with different models to different processing tasks, and the local storage and the network are customized and efficiently utilized according to the requirement.
The data processing system for the radio astronomical data intensive scientific operation is designed for the huge data volume of astronomical big data, consists of a hybrid heterogeneous computing node system, a high-performance storage system and a high-speed network system which are physically installed together, adopts a data constellation architecture, changes the independent design schemes of the three systems in the traditional super-operation, can flexibly distribute resources according to the requirements of computing tasks, can be completed by one data constellation or a plurality of data constellations, and meets various application scenes such as a plurality of scientific data processing flows, various user requirements, different computing scales, distributed tasks and the like. (2) The large memory capacity on a single node of the data processing system of the invention solves the problem of processing a single large-size data file, and avoids or reduces the time cost of data cutting, data moving and idle waiting. In addition, the large memory has the capacity of enabling files needing to be read for many times to reside in the memory for a long time and be accessed by a plurality of nodes, so that the time consumption caused by frequently reading and reading the files is greatly reduced, and the data processing flow is accelerated. (3) According to the hybrid heterogeneous computing architecture of the data processing system, disclosed by the invention, by reasonably distributing the computation intensive tasks, the memory intensive tasks and the data intensive tasks in the process to the corresponding computing equipment, the challenges of complex astronomical data processing processes, multiple data files and high parallelism are effectively solved, the efficiency of the whole cluster is improved, and the operation cost is effectively saved. (4) The multi-level hybrid storage system of the data processing system, which comprises the SDD and the HDD, ensures high-performance reading and writing, and can meet the wide application requirements of high-performance calculation, high-data I/O, multi-load tasks and the like. In addition, the distributed storage architecture provides a high throughput, high concurrency, and high scalability storage mechanism, ensuring that performance still approaches linear growth when the data center is scaled up to meet the demands of ever-increasing scientific users. (5) The data processing system of the invention is a high-speed network system which is designed for mixed heterogeneous computing nodes and is up to 200Gbps and a topological structure thereof, is connected with computing, storing and managing equipment, solves the most serious data I/O bottleneck problem in data intensive computing, ensures smooth data exchange between the inside of the nodes and the nodes, reduces the delay of data flow and reduces the risk of system breakdown caused by network flow.
Drawings
FIG. 1 is a system architecture diagram of a data processing system for radio astronomical data intensive scientific arithmetic, according to one embodiment of the present invention.
FIG. 2 is a workflow diagram of a data processing system for radio astronomical data intensive scientific arithmetic according to one embodiment of the present invention.
FIG. 3 is a schematic connection diagram of a network system of a data processing system for radio astronomical data intensive scientific arithmetic according to an embodiment of the present invention.
FIG. 4 is a workflow diagram of a data processing system for radio astronomical data intensive scientific arithmetic according to one embodiment of the present invention.
FIG. 5 is a workflow diagram of a data processing system for radio astronomical data intensive scientific arithmetic according to another embodiment of the present invention.
FIG. 6 is a workflow diagram of a data processing system for radio astronomical data intensive scientific arithmetic according to yet another embodiment of the present invention.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
The invention provides a data processing system for intensive scientific operation of radio astronomical data, which is suitable for radio astronomical data. In addition, the system can also be suitable for large data processing in other fields of astronomy and the fields of gene, biology, medicine and the like.
Fig. 1 shows a data processing system for radio astronomical data intensive scientific arithmetic comprising at least one data constellation according to one embodiment of the present invention. The scale and configuration of a single data constellation can be adjusted according to actual requirements, and a plurality of sets of data constellation communication connections can be combined to form a more powerful cluster.
Each data constellation is a scalable integrated data unit mounted on a cabinet or adjacent cabinets, and is composed of a scalable distributed storage system 100, a hybrid heterogeneous computing node system 200, and an ultra-high speed low latency network system 300.
As shown in fig. 1, the hybrid heterogeneous refers to hybrid and heterogeneous computing architectures, each computing node 200 may adopt one of an x86 architecture, an ARM architecture, and a GPU architecture, or may adopt a combination of multiple ones of the x86 architecture, the ARM architecture, and the GPU architecture, and the combination may be freely and flexibly customized to adapt to different types of astronomical data (e.g., continuous spectrum visibility data, time domain data, and spectral line data) and different computing requirements (e.g., data intensive, computation intensive, memory intensive, etc.), and different processing requirements of different processing steps in the same pipeline, and better meet specific tasks (e.g., computation intensive, data intensive, and memory intensive) at different stages in a data processing process, thereby meeting requirements of astronomical data processing in multiple application scenarios. In the embodiment, the hybrid heterogeneous computing node system 200 at least includes three system architectures, namely an x86CPU architecture, an ARM architecture, and an x86CPU + GPU architecture (GPU architecture). The hybrid heterogeneous technology is a hybrid combination of heterogeneous computing nodes with various different architecture types and traditional CPU computing nodes.
In this embodiment, the computing nodes of the x86 architecture are composed of intel x86 Gold 6132CPU nodes, each computing node has two CPUs with a dominant frequency of 2.6GHz, 28 computing cores are total, a theoretical peak value is 18 trillion floating point operations, and each computing node is configured with a 1TB memory and a 4TB flash memory (SSD), which is an ideal choice for a large-scale data imaging task. The ARM architecture computing node is composed of 10 Huacheng spread spectrum 920 computing nodes, each computing node is provided with 96 kernels, the dominant frequency is 2.6GHz, the theoretical computing peak value is 10 trillion floating point operations, each computing node is provided with a total of 1TB memory and 600GB flash memory, and the ARM architecture computing node is suitable for highly parallel processing tasks of a multispectral line channel. The computing node of the CPU + GPU architecture is a heterogeneous system composed of an x86CPU + GPU, 36 CPU kernels and 8 Nvidia Tesla V100 cards are configured, the theoretical peak value is 62.4 trillion floating point operations, a 1TB memory and a 7.7TB flash memory, and the computing node is suitable for digital beam synthesis of time domain data, astronomical search steps in an imaging data process, artificial intelligence related scientific application and the like. When the computing node is an ARM architecture, the total access bandwidth is 80GB/s; when the computing node is a CPU + GPU architecture, the access bandwidth is 2TB/s.
Each data constellation has a corresponding distributed storage system 100 installed in the cabinet where the data constellation is located, and stores the required original file. The distributed storage systems 100 of different data constellations are independent and interconnected by a very high speed, low latency network system 300.
Each computing node is physically integrated with a corresponding local storage unit of a super-large-capacity memory and flash memory (SSD) type, and the storage system 100 is composed of a local storage unit 101 corresponding to each computing node and a distributed file system 102 composed of all storage nodes; each data constellation has a separate distributed file system 102. Therefore, the local storage unit and the distributed file system of the storage system 100 cooperate with the super-large-capacity memory (the total memory capacity of the super-large-capacity memory of a single compute node is 1 TB-2 TB, and the total memory capacity of the compute node/the number of cores of the compute node is greater than 32GB, for example, the number of cores of the compute node is 32, the total memory capacity of the compute node is greater than 1 TB) on each compute node to exert the data I/O bandwidth performance, so that the data interaction between the compute nodes is effectively utilized, and the data interaction bottleneck between the compute nodes and the storage system 100 is reduced. Data interaction refers to ultra-large bandwidth data streams occurring among data constellations, among multiple compute nodes within a data constellation, and within a single compute node. The local storage unit on the computing node adopts an NVMe flash memory (SSD) which is used for fast data exchange and cache in the computing node. The storage nodes employ conventional HDDs. According to the requirements of different stages on bandwidth and the characteristics of data in the processing process, the distributed file system adopts a completely distributed architecture and a completely symmetrical distributed architecture, and has high-speed read-write performance and large-scale expansibility.
As shown in fig. 3, the network system 300 is composed of two parts, one is a plurality of InfiniteBand network systems (IB switch for short, bandwidth is 100Gbps, bandwidth between partial nodes reaches 200 Gbps) connected to all the computing nodes and storage nodes, one is an ethernet (maximum bandwidth is 10 Gbps) connected to data centers in other continents of the world, and the ethernet includes network switches connected to all the computing nodes, storage nodes, background storage nodes, and management nodes, and a plurality of user login nodes connected to the management nodes via the internet.
The plurality of computing nodes and the plurality of storage nodes are similar to a single constellation in a constellation, and any two of the plurality of computing nodes and the plurality of storage nodes are connected together through the network system 300.
The workflow of the data processing system for radio astronomical data intensive scientific arithmetic of the present invention and the advantages of the data processing system compared to a conventional supercomputer are specifically described below.
In a conventional supercomputer, three systems, i.e., a computing node, a storage system, and a network system, are physically separated, and for example, a row of cabinets is a computing node, a row of cabinets is a storage system, and a row of cabinets is a network system. This design is convenient for the computation intensive traffic model, but in the data intensive traffic, it becomes very inefficient because the amount of data is very large, the data movement becomes the biggest bottleneck, and when the data is not in place, the compute nodes are in idle wait state, causing a great waste of running cost.
The invention adopts a data constellation, which changes the design idea that the three systems are separated in physics in the traditional hypercalculation. The idea behind data constellation design is to integrate these three systems together organically, bringing the computation closer to the data, minimizing the cost of data movement. Resources can be flexibly distributed according to the requirements of the computing tasks and can be completed by one data constellation or a plurality of data constellations, so that various application scenarios such as a plurality of scientific data processing flows, various user requirements, different computing scales, distributed tasks and the like can be met. According to the design of the invention, most data processing can be completed in one data constellation, thereby avoiding a large amount of data exchange between an independent computing node and an independent external storage node in the traditional supercomputing system, not only greatly reducing the energy consumption (saving about 1/3-1/2 of data operation cost compared with the traditional supercomputing platform), but also saving a part of network equipment and saving the whole construction cost of a data center.
The key program in the radio astronomical data processing process needs to call data in the memory for read-write operation, so the capacity of the memory determines the performance of the data processing flow. Data read into the System includes, but is not limited to, raw data from SKA telescope observations, which is typically in the standard astronomical FITS (Flexible Image Transport systems) format or MS (Measurement Sets) format, although other astronomical data formats may be used. A single raw datum from the SKA leader telescope typically has a size of tens of gigabytes (1gb = 1024mb) or even tens of TB bytes (1tb = 1024gb).
A computing node of a traditional computing-based super computing center generally has only 64GB or 128GB of memory, and obviously, a computing system cannot read in observation data at one time. The single-core serial processing procedure or mode in the radio astronomy field is not suitable for processing a large amount of observation data. Especially raw data with a plurality of frequency channels of 65000 at most like from an SKA telescope, the data of each channel are independently processed and combined to be written into a file, and then the visibility data are converted into images through inverse Fourier transformation. In this process, a large amount of data interaction is performed in the memory. In conventional supercomputing, data processing for a single compute node typically uses shared memory, while data processing for multiple compute nodes uses distributed memory. However, the distributed memory needs to be read in a limited way at one time, and when the size of a single data file exceeds a certain scale, calculation cannot be completed at one time, and in actual operation, astronomers adopt a data cutting scheme; that is, data is first cut into several blocks in time order, and then each block of data is read in and processed in turn. Experience has shown that the process of segmenting data and loading the segmented data into memory consumes a significant amount of runtime. The smaller the number of slices, the lower the overall time consumption. In addition, the problem cannot be solved effectively by using a distributed memory and deploying computing in multiple computing nodes for parallel processing, because the computing task is distributed over multiple computing nodes, which means that data needs to be transmitted or copied to multiple computing nodes, and thus, it takes additional time and more time to move data on different computing nodes. In summary, conventional computing platforms applied to compute-intensive application scenarios sacrifice access time to achieve a cost and benefit balance, which can be quite time-consuming and labor-intensive for data-intensive application scenarios in astronomy.
The invention designs a memory scheme aiming at the characteristics of data intensive scientific operation such as astronomical observation data and the like. Specifically, in the present invention, a large capacity of memory is allocated to each compute node within each data constellation, thereby completing the processing of one data file at a time as much as possible or reducing the number of data slices as much as possible. In the specific implementation of the scheme, the total memory capacity of the ultra-large capacity memory of a single computing node is about 1 TB-2 TB, so that the radio astronomical data with the size of about 100GB can be processed at one time at one computing node, and data files with the size of hundreds of GB also only need to be cut into a plurality of blocks.
The memory scheme of the present invention is not only embodied in that the total memory capacity of the compute nodes is large, but also embodied in that the allocable memory amount of a single core of each compute node (i.e. x86/ARM/GPU processor) is large, that is, "the total memory capacity of the compute nodes/the number of cores of the compute nodes" is greater than 32GB, for example, the number of cores of the nodes is 32, and the total memory capacity is required to be greater than 1TB. Because the large data processing observed by modern telescopes is basically accelerated by parallel processing, each thread in the parallelization process is controlled by the kernel of one x86/ARM/GPU processor, and the data processing task amount which can be executed by the thread is limited by the allocated memory amount. Conventional super computing platforms, which use computing as a main service, seek to have as many computing cores as possible due to the small amount of data without allocating too much memory, so that multiple cores of a computing node usually use shared memory, and the standard configuration is 64GB or 128GB, and up to 44 (e.g., intel to strong Gold 6152) or 68 (e.g., intel to strong Phi 7250) or even 96 (e.g., shikuroc 920) cores of a computing node, and each core is allocated to 1.88GB of memory on average. However, as described above, when running data intensive tasks, it is not possible to slice the data too thin (otherwise the overall time for data processing would increase), which only sacrifices the advantages of multiple compute cores. For example, a 100GB file is divided into 5 parts in time series and allocated to a compute node of a conventional supercomputing, and only 20 of 68 cores (1 GB of memory per thread) in the compute node can be used. In other words, although a compute node has 68 cores, in practice, more than half of the cores are idle and do not efficiently utilize the performance of the system.
In order to solve the problem, the invention adopts a design scheme which considers the total memory capacity of the computing node and the single-core average memory of the computing node. According to one embodiment of the invention, intel 6132 is used as at least one compute node, wherein 28 compute cores have 1TB memory in total, and each core has 36.6GB memory on average. For a 100GB file, there is no need to cut at all, and it can use all 28 compute kernels (input data only accounts for 10% of the memory allocated to each kernel), perfectly addressing parallel acceleration of data flow.
In a word, the large memory capacity on a single computing node solves the problem of processing a single large-size data file, avoids or reduces the time cost of data cutting, data moving and idle waiting, and greatly improves the operating efficiency of a data constellation.
One key limiting factor of data-intensive computing processes is data I/O limitations, which are fundamentally different from compute-intensive High Performance Computing (HPC) in that the former is adapted to the storage, management, acquisition, and processing of large data, with most of the processing time being spent on I/O and the movement and replication of data. Parallel processing of data-intensive computational tasks typically employs units of data that are diced into small blocks, each unit independently performing the same application, requiring the entire data processing system to be specially designed so that the degree of parallelism can be extended as the amount of data increases.
FIG. 2 is a workflow diagram of a data processing system for radio astronomical data intensive scientific arithmetic according to one embodiment of the present invention. As shown in fig. 2, the data processing method for the radio astronomical data intensive scientific operation includes:
step S0: providing a data processing system as described above for radio astronomical data intensive scientific arithmetic;
step S1: sending the original data into a super-large-capacity memory of the current computing node through an IB switch to serve as a super-large memory Cache (Cache);
step S2: the current computing node processes the task and judges whether the obtained intermediate data or final data is obtained; if the data is the intermediate data, continuing to execute the step S3; if the data is the final data, storing the final data to a super-large-capacity memory of the computing node or a local storage unit of a flash memory type, writing back to a distributed file system of the storage node through an IB switch for storage, and ending the process;
and step S3: the current computing node stores the obtained intermediate data into a local storage unit of a super-large-capacity memory or a flash memory type according to storage requirements to serve as a super-large memory Cache (Cache) or a flash memory Cache (SSD Cache);
according to different requirements of different stages of a data processing pipeline on data I/O, a storage system adopts a super-large-capacity memory and a flash memory type local storage unit as a multi-stage mixed medium for storage, has the functions of safety, high-speed reading, quick recoverable reconstruction and the like, can realize the storage, management, calling and multiplexing of data in the whole life cycle, and can meet various application requirements of high-performance calculation, high-data I/O, multi-load tasks and the like.
And step S4: and reading the super large memory Cache or SSD Cache of the current computing node by the computing node with different architecture types from the current computing node through the IB switch to perform intermediate data interaction between the computing nodes, then serving as a new current computing node, and returning to the step S2.
In the data processing system, the I/O constraint is solved by jointly optimizing and designing the ultra-large-capacity memory of the computing node, the high-performance storage system and the high-speed low-delay network system.
1) Firstly, an ultra-large-capacity memory with the memory capacity of 1 TB-2 TB is a component which is most frequently accessed and interacted, on the basis, the computing node preferentially considers the access bandwidth index during model selection, taking an ARM architecture type computing node as an example, 8-channel DDR4 and a 100G Ethernet card supporting PCIe are integrated, and the total access bandwidth can reach 80GB/s. Taking a computing node of a CPU + GPU architecture type as an example, the access bandwidth can be expanded to 2TB/s, so that through the designed ultra-large capacity memory and high-performance (namely high access bandwidth) processor scheme, compared with a commercial server which is not optimally designed according to data I/O constraints, the memory bandwidth is increased by 46%, and the total I/O bandwidth is increased by 66%.
2) Secondly, the storage system adopts high-performance distributed storage, uses a mixed storage medium of SSD of a local storage unit of a computing node and HDD of a storage node, takes high performance and cost performance into consideration, and constructs a shared distributed Cache (Cache) resource pool for all business systems to commonly use the distributed Cache (Cache) resource pool; and by a layered read Cache mechanism (the first layer is a memory Cache, and the second layer is an SSD Cache), the data access time is shortened, and the average time delay for common 4K data reading and writing is kept about 1 ms. According to the invention, the DHT (Distributed Hash Table) algorithm is combined with the fast compatibility of high-performance hardware (in the case of full NVMe SSD configuration), the data read-write time is shortened, the maximum number of concurrent users is increased from 400 to 1000, and the wide application requirements of high-performance calculation, high-data I/O (input/output) and multi-load tasks are completely met.
3) Finally, as shown in fig. 3, the invention designs a high-speed network system up to 200Gbps for a hybrid heterogeneous compute node system and a distributed storage system and a topology structure thereof, and connects compute nodes, storage nodes and management equipment by using a high-throughput and low-latency IB switch as an internal networking, thereby solving the most serious data transmission bottleneck problem in data intensive computation, ensuring smooth data exchange between the interior of the compute nodes and the nodes, reducing the delay of data flow, and reducing the risk of system breakdown caused by network traffic. Inside the computing nodes, the computers are connected through interfaces, for example, an M.2 interface is adopted for NVMe storage. The internal data exchange bandwidth of the system node is restrained by the bandwidth of the NVMe SSD Cache, and the I/O performance is further greatly improved. At present, the maximum I/O throughput rate of a single node is measured to be 7.4 gigabytes/second, the theoretical peak value is 8.5 gigabytes/second, the I/O utilization rate is as high as 94%, and the I/O utilization rate exceeds the conventional standard of a high-performance computing server. In addition, the whole system is directly connected with other SKA data centers all over the world through an intercontinental Ethernet (up to 10 Gbps) at the highest level in the field of Chinese scientific research, and supports the transmission, calculation and management of astronomical big data to the maximum extent.
The core difference between the storage system 100 of the present invention and conventional SAN fabric storage is the expansion capability. Traditional SAN storage is extended in a controller stack manner: double or multiple controllers are stacked, up to dozens of controllers, and capacity and performance are improved by vertically stacking a disk rack (Scale-up) at the rear end of the controller, but a bottleneck is encountered after a certain Scale is reached: adding a hard disk can increase the total capacity, but is limited by the controller architecture, and the performance cannot be increased linearly any more. In the invention, a fully distributed architecture is adopted for the distributed file system of the storage system 100, 288 storage nodes (hundreds of PB storage capacity) are supported and expanded, the high-speed read-write performance is linearly increased along with the increase of the number of the nodes, and the continuously increased data requirements of the radio astronomical telescope (such as a square kilometer array) are fully met. The distributed storage architecture provides a high-throughput, highly concurrent and highly scalable storage mechanism that ensures that storage performance grows nearly linearly with scale-up. In addition, the storage system with the distributed architecture can ensure high expandability and the safety and reliability of data. Since the data processing system records various astronomical data all the time, the astronomical data not only needs to be processed in time, but also needs to be stored in the file system for continuous multiplexing and data mining within a period of time, so that the storage device needs to have extremely high reliability, be accessible all the day and have certain fault tolerance while ensuring excellent performance. The distributed file system of the storage system also adopts a full-symmetric distributed architecture, not only can provide a single file system with ultra-large capacity, but also provides a mechanism for redundant backup and rapid data reconstruction, and reliability is provided to the greatest extent. When data is stored, the same file data can be scattered and sliced on different hard disks of different storage nodes, and one or more redundant data can be generated for the sliced data to ensure the data reliability. The user can flexibly configure the redundancy according to the importance, the performance and the like of the data. Under the condition of failure, the system can automatically use the redundant backup data to reconstruct the data, the speed of reconstructing 2TB data within 1 hour is achieved, and the use of a user is not influenced at all.
The working flow of the data processing system for radio astronomical data intensive scientific operation, namely the processing flow of SKA pilot telescope continuous spectrum patrolling GLEAM-X project data, according to one embodiment of the invention is shown in fig. 4, and the data processing flow uses the three architecture types of computing nodes to process data according to requirements. The total data volume of the project is large (2 PB), and the file transfer operation accounts for a large proportion of the processing time (40%). Each full-day scan observation takes 100GB of storage space, including visibility data, images, and metadata, when processing data. In view of the richness of the GLEAM-X tour, the memory Cache required by data processing is up to 3TB, and the data processing operation is directly performed on the memory, so that the data is prevented from being frequently moved.
Specifically, in step S1, the original data is sent to the ultra-large memory of the computing nodes of the multiple ARM architectures through the IB switch, so as to serve as an ultra-large memory Cache (Cache); in step S2 and step S3, the current compute node completes the compute-intensive task in the flow by using the super-large memory Cache, and stores the obtained intermediate data in a local storage unit of a flash memory type to be used as a flash memory Cache (SSD Cache); in step S4, a plurality of x86 architecture computing nodes of different architecture types from the current computing node read the SSD Cache of the current computing node through the IB switch to perform intermediate data interaction between the computing nodes, and then serve as new current computing nodes, return to step S2 to complete data intensive and memory intensive tasks, and store the obtained intermediate data in the ultra-large capacity memory to serve as the ultra-large memory Cache; and finally, reading the super-large memory cache of the current computing node by the computing node of 1 or a plurality of CPU + GPU architectures with different architecture types from the current computing node through an IB switch to perform intermediate data interaction between the computing nodes, then using the computing node as a new current computing node to complete a super-large computing intensive task to obtain final data, storing the final data to a flash memory type local storage unit of the computing node, and writing the final data back to a distributed file system of the storage node through the IB switch to store the final data. When the number of the computing nodes of the ARM architecture is 10, the number of the computing nodes of the x86 architecture is 23, and the number of the computing nodes of the CPU + GPU architecture is 2, the performance of the hybrid heterogeneous computing node system is improved by 2.23 times compared with that of a traditional platform in the aspect of verifying that the hybrid heterogeneous computing node system images massive continuous spectrum data generated by the SKA pilot telescope after the prototype platform is tested.
Fig. 5 shows a work flow of a data processing system for radio astronomical data intensive scientific operation according to another embodiment of the present invention, namely, an SKA pilot telescope spectral line data imaging flow, which mainly uses 7 x 86-architecture computing nodes to process and image data of 7 frequency channels in parallel, which is a typical distributed processing task in astronomical scientific operation and is a data intensive computing task. Specifically, in step S1, raw data is sent to the ultra-large-capacity memory of 7 computing nodes of the x86 architecture through the IB switch, so as to serve as an ultra-large memory Cache (Cache); subsequently, the current computing node completes the compute-intensive task in the process by using the super-large memory Cache, the obtained final data is stored in a local storage unit of a flash memory type to be used as a flash memory Cache (SSD Cache), and the data is written back to a distributed file system of the storage node through an IB switch for storage.
Fig. 6 shows a workflow of a data processing system for radio astronomical data intensive scientific arithmetic, namely the MWA pulsar search pipeline of the SKA pilot telescope, according to yet another embodiment of the present invention. The MWA (Australian Murchison wide-field radio telescope array) pulsar search project is a typical astronomical time domain data processing flow, the original data is small (TB level), the intermediate data is large, the number of files is large (PB level, million files) and the data does not need to be stored (the data is too large, the storage consumption is too much), and a serious I/O bottleneck can be generated if the data interaction is carried out through distributed storage.
The work flow of the data processing system aiming at the radio astronomical data intensive scientific operation comprises the following steps: sending the original data into a super-large-capacity memory of a plurality of computing nodes of a CPU + GPU architecture through an IB switch to serve as a super-large memory Cache (Cache); the current computing node utilizes a super-large memory cache to perform digital beam synthesis (digital beam synthesis is a super-large computation-intensive task) on original data, the generated intermediate data are massive beam files, each node generates directories with about 1-2 thousand addresses, each directory contains about two hundred files, and the capacity of each directory is about 300 GB. Caching the intermediate data through a local storage unit in the current computing node to obtain SSDCache; the method comprises the steps that a plurality of directories of an SSD (solid State disk) Cache of a current computing node are read in parallel by the computing nodes of a plurality of ARM architectures which are different from the current computing node through an IB (information base) switch, so that intermediate data interaction among the computing nodes is realized, parallel searching of the directories is carried out to complete a calculation intensive task, final data obtained through searching are used as the memory Cache of the ARM computing node, the final data size is small (TB magnitude), and therefore the data are written back to a distributed file system of a storage node through the IB switch to be stored.
Therefore, the hybrid heterogeneous computing platform is respectively suitable for different scientific application scenes and different steps in a single flow, is not only a radio astronomy but also a scheme worth reference in other scientific operation fields, and also accumulates operation experience for expanding data centers in the radio astronomy field in a large scale in the future.
The above embodiments are merely preferred embodiments of the present invention, which are not intended to limit the scope of the present invention, and various changes may be made in the above embodiments of the present invention. All simple and equivalent changes and modifications made according to the claims and the content of the specification of the present application fall within the scope of the claims of the present patent application. The invention has not been described in detail in order to avoid obscuring the invention.

Claims (8)

1. A data processing method for radio astronomical data intensive scientific arithmetic is characterized by comprising the following steps:
step S0: providing a data processing system aiming at radio astronomical data intensive scientific operation; the data processing system for the radio astronomical data intensive scientific operation comprises at least one data constellation, each data constellation is an extensible comprehensive data unit which is arranged on one cabinet or a plurality of adjacent cabinets, and each data constellation consists of an extensible distributed storage system, a mixed heterogeneous computing node system and a network system;
each computing node is physically integrated with a corresponding super-large-capacity memory and a flash memory type local storage unit, and the storage system consists of the local storage unit corresponding to each computing node and a distributed file system consisting of the storage nodes; each data constellation has an independent distributed file system;
step S1: the original data is sent to the super-large-capacity memory of the current computing node through an IB switch to serve as super-large memory cache;
step S2: the current computing node processes the task and judges whether the obtained intermediate data or final data is obtained; if the data is the intermediate data, continuing to execute the step S3; if the data is the final data, storing the final data to a super-large-capacity memory or a local storage unit of the computing node, writing the final data back to a distributed file system of the storage node through an IB exchanger for storage, and ending the process;
and step S3: the current computing node stores the obtained intermediate data into a local storage unit of a super-large-capacity memory or flash memory type according to storage requirements to serve as a super-large memory cache or a flash memory cache;
and step S4: and the computing nodes with different architecture types from the current computing node read the super memory cache or the flash memory cache of the current computing node through the IB switch to perform intermediate data interaction between the computing nodes, and then the computing nodes are used as new current computing nodes and return to the step S2.
2. The data processing method for radio astronomical data intensive scientific arithmetic of claim 1, wherein the hybrid heterogeneous compute node system comprises compute nodes of at least an x86CPU architecture, an ARM architecture and an x86CPU + GPU architecture.
3. The data processing method for radio astronomical data intensive scientific arithmetic according to claim 2, wherein when the compute node is an ARM architecture, the total access bandwidth is 80GB/s; when the computing node is a CPU + GPU architecture, the access bandwidth of the computing node is 2TB/s.
4. The data processing method for the radio astronomical data intensive scientific arithmetic of claim 1, wherein the total memory capacity of the ultra-large memory of each compute node is 1TB to 2TB, the total memory capacity of the ultra-large memory of each compute node is adjusted correspondingly according to the number of cores of the CPU, and the memory capacity corresponding to each core is not lower than 32GB.
5. The method of claim 4, wherein the total memory capacity of the compute node with 32 cores is at least 1TB.
6. The data processing method for radio astronomical data intensive scientific operations according to claim 1, wherein said local storage unit employs an NVMe SSD, and said storage nodes employ HDDs.
7. The data processing method for radio astronomical data-intensive scientific arithmetic of claim 1, wherein the distributed file system employs a fully distributed architecture and a fully symmetric distributed architecture.
8. The data processing method for radio astronomical data intensive scientific arithmetic of claim 1, wherein the network system comprises a plurality of IB switches connected to all the compute nodes and storage nodes, a network switch connected to all the compute nodes, storage nodes, background storage nodes and management nodes, and a plurality of user login nodes connected to the management nodes through the internet.
CN202210187329.8A 2022-02-28 2022-02-28 Data processing system and method for radio astronomical data intensive scientific operation Active CN114661637B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210187329.8A CN114661637B (en) 2022-02-28 2022-02-28 Data processing system and method for radio astronomical data intensive scientific operation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210187329.8A CN114661637B (en) 2022-02-28 2022-02-28 Data processing system and method for radio astronomical data intensive scientific operation

Publications (2)

Publication Number Publication Date
CN114661637A CN114661637A (en) 2022-06-24
CN114661637B true CN114661637B (en) 2023-03-24

Family

ID=82027305

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210187329.8A Active CN114661637B (en) 2022-02-28 2022-02-28 Data processing system and method for radio astronomical data intensive scientific operation

Country Status (1)

Country Link
CN (1) CN114661637B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103198097A (en) * 2013-03-11 2013-07-10 中国科学院计算机网络信息中心 Massive geoscientific data parallel processing method based on distributed file system
CN104023062A (en) * 2014-06-10 2014-09-03 上海大学 Heterogeneous computing-oriented hardware architecture of distributed big data system

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7761687B2 (en) * 2007-06-26 2010-07-20 International Business Machines Corporation Ultrascalable petaflop parallel supercomputer
US20110055494A1 (en) * 2009-08-25 2011-03-03 Yahoo! Inc. Method for distributed direct object access storage
US9158540B1 (en) * 2011-11-14 2015-10-13 Emc Corporation Method and apparatus for offloading compute resources to a flash co-processing appliance
US9286261B1 (en) * 2011-11-14 2016-03-15 Emc Corporation Architecture and method for a burst buffer using flash technology
US20130304775A1 (en) * 2012-05-11 2013-11-14 Xyratex Technology Limited Storage unit for high performance computing system, storage network and methods
US9158548B2 (en) * 2012-11-13 2015-10-13 The Johns Hopkins University System and method for program and resource allocation within a data-intensive computer
CN103237046B (en) * 2013-02-25 2016-08-17 中国科学院深圳先进技术研究院 Support distributed file system and the implementation method of mixed cloud storage application
US20150261724A1 (en) * 2014-03-14 2015-09-17 Emilio Billi Massive parallel exascale storage system architecture
CN104572569A (en) * 2015-01-21 2015-04-29 江苏微锐超算科技有限公司 ARM (Algorithmic Remote Manipulation) and FPGA (Field Programmable Gate Array)-based high performance computing node and computing method
CN104717297A (en) * 2015-03-30 2015-06-17 上海交通大学 Safety cloud storage method and system
CN105681402A (en) * 2015-11-25 2016-06-15 北京文云易迅科技有限公司 Distributed high speed database integration system based on PCIe flash memory card
CN105550238A (en) * 2015-11-27 2016-05-04 浪潮(北京)电子信息产业有限公司 Architecture system of database appliance
CN106299702A (en) * 2016-09-14 2017-01-04 中国科学院上海天文台 A kind of low frequency radio array digital bea mforming system and method
CN107733696A (en) * 2017-09-26 2018-02-23 南京天数信息科技有限公司 A kind of machine learning and artificial intelligence application all-in-one dispositions method
CN107908477A (en) * 2017-11-17 2018-04-13 郑州云海信息技术有限公司 A kind of data processing method and device for radio astronomy data
CN108763299A (en) * 2018-04-19 2018-11-06 贵州师范大学 A kind of large-scale data processing calculating acceleration system
CN208433991U (en) * 2018-08-21 2019-01-25 苏州超集信息科技有限公司 The hardware device of polymorphic type GPU mixing computing platform
CN111444020B (en) * 2020-03-31 2022-07-12 中国科学院计算机网络信息中心 Super-fusion computing system architecture and fusion service platform
US11394550B2 (en) * 2020-07-30 2022-07-19 Dapper Labs Inc. Systems and methods providing specialized proof of confidential knowledge
CN112579696B (en) * 2020-11-26 2023-12-12 广州大学 Multi-point synchronous copying method, device, server and storage medium for radio astronomical data
CN113742088B (en) * 2021-09-23 2023-11-14 上海交通大学 Pulsar search parallel optimization method for processing radio telescope data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103198097A (en) * 2013-03-11 2013-07-10 中国科学院计算机网络信息中心 Massive geoscientific data parallel processing method based on distributed file system
CN104023062A (en) * 2014-06-10 2014-09-03 上海大学 Heterogeneous computing-oriented hardware architecture of distributed big data system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
国家天文科学数据中心发展思路浅析;米琳莹等;《农业大数据学报》;20191226(第04期);全文 *

Also Published As

Publication number Publication date
CN114661637A (en) 2022-06-24

Similar Documents

Publication Publication Date Title
US11741053B2 (en) Data management system, method, terminal and medium based on hybrid storage
US11593037B2 (en) File system block-level tiering and co-allocation
Vuppalapati et al. Building an elastic query engine on disaggregated storage
Islam et al. Triple-H: A hybrid approach to accelerate HDFS on HPC clusters with heterogeneous storage architecture
Islam et al. High performance design for HDFS with byte-addressability of NVM and RDMA
JP6639420B2 (en) Method for flash-optimized data layout, apparatus for flash-optimized storage, and computer program
Dong et al. Data elevator: Low-contention data movement in hierarchical storage system
CN103873559A (en) Database all-in-one machine capable of realizing high-speed storage
US20180107601A1 (en) Cache architecture and algorithms for hybrid object storage devices
Canim et al. Buffered Bloom Filters on Solid State Storage.
Liu et al. Profiling and improving i/o performance of a large-scale climate scientific application
Li et al. {ROLEX}: A Scalable {RDMA-oriented} Learned {Key-Value} Store for Disaggregated Memory Systems
US20200210114A1 (en) Networked shuffle storage
CN105516313A (en) Distributed storage system used for big data
Sun et al. GraphMP: An efficient semi-external-memory big graph processing system on a single machine
Li et al. Transparent and lightweight object placement for managed workloads atop hybrid memories
CN100383721C (en) Isomeric double-system bus objective storage controller
Naveenkumar et al. Performance Impact Analysis of Application Implemented on Active Storage Framework
Lu et al. Design and implementation of the tianhe-2 data storage and management system
CN116074179B (en) High expansion node system based on CPU-NPU cooperation and training method
CN114661637B (en) Data processing system and method for radio astronomical data intensive scientific operation
Bae et al. Empirical guide to use of persistent memory for large-scale in-memory graph analysis
US10990530B1 (en) Implementation of global counters using locally cached counters and delta values
Ruan et al. Improving Shuffle I/O performance for big data processing using hybrid storage
Xing et al. HPGraph: A high parallel graph processing system based on flash array

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant