Grace Supercomputer

TAMU supercomputers

What are supercomputers?

Supercomputers are highly advanced, high-performance computers designed to handle complex calculations and data processing tasks at extremely fast speeds. They are typically used for scientific research and simulations, weather forecasting, and solving complex problems in finance, engineering, and medicine.

These types of computers are significantly more powerful than regular computers, with processing speeds that can reach several petaflops. They are also equipped with vast amounts of memory and storage capacity, allowing them to process and store large amounts of data quickly and efficiently.

Supercomputers are typically used for tasks requiring a large amount of data processing and complex calculations, such as simulating complex processes or analyzing large data sets. They are also used for tasks that require high-speed data processing, such as real-time weather forecasting or analyzing large amounts of financial data.

They are computers with extremely high performance that take advantage of multiple central processing units grouped in computing nodes. These nodes can communicate with each other and solve complex problems by means of parallel processing.

Some examples of problems that can be solved with supercomputers include problems that require large amounts of memory, core count or node count, problems that scale well with more CPU cores or memory, and single-threaded problems with millions of permutations. The performance of such computers is measured in floating-point operations per second (FLOPS).

The world’s fastest supercomputer has a speed of 442 petaflops, and the first supercomputer was made in 1964 and was called CDC 6600. Supercomputers can generate a lot of heat and therefore need a powerful cooling system.

They are typically used by large organizations, such as research institutions and government agencies, as well as by businesses that require high-performance computing capabilities. Scientists and engineers also use them to conduct research and simulations in fields such as climate modeling, astrophysics, and biotechnology.

Supercomputers are expensive and require specialized knowledge and expertise to operate and maintain. They are normally kept in secure, temperature-controlled environments and are typically connected to high-speed networks to allow for fast data transfer and collaboration.

Texas A&M University uses supercomputers in a variety of research and academic programs. These computers are advanced and powerful, with processing speeds that can reach several petaflops.

An example of the use of supercomputers at Texas A&M University is in the field of engineering. The university’s Department of Mechanical Engineering uses supercomputers to simulate and analyze complex engineering processes, such as the flow of fluids and the behavior of materials. This allows researchers to design and test new products and processes, improving their efficiency and performance.

In addition to these applications, supercomputers at Texas A&M University are also used in a range of other research areas, including biotechnology, astrophysics, and climate modeling. This allows researchers to conduct complex simulations and analyses, providing valuable insights into these fields.

Overall, the use of supercomputers at Texas A&M University greatly enhances the university’s research and academic programs, allowing researchers and students to tackle complex challenges and advance their fields of study. These powerful computers are an essential tool in driving innovation and advancing knowledge.

Users mainly submit batch jobs to a scheduler system, which governs access to the computer nodes and resources. The queue system that manages the requests should maximize the utilization of the supercomputer and do this in a way that is fair to all users. Below we can see a cluster diagram.

Overall, the use of supercomputers at Texas A&M University greatly enhances the university’s research and academic programs, allowing researchers and students to tackle complex challenges and advance their fields of study. These powerful computers are an essential tool in driving innovation and advancing knowledge.

Grace Supercomputer

With Dell Technologies, the Grace supercomputer is 20 times faster than HPRC, which it replaces, so that more groundbreaking research can occur.
With the new, improved supercomputer Grace, we can perform operations 20 times faster than HPRC, which can perform at 337 teraflops
(what are Flops).

With the Grace supercomputer, Texas A&M researchers will be able to advance in high-performance computing, artificial intelligence (AI),
and data science at the university while also preparing a workforce for exascale computing, which is expected to handle orders of magnitude more calculations per second than what we can currently do.

The integrated computational research platform combines Dell EMC PowerEdge servers with Intel Xeon Scalable processors, NVIDIA GPUs A100, NVIDIA T4, and NVIDIA RTX 6000, an NVIDIA Mellanox HDR100 InfiniBand
network, NVMe-based local storage, and high-performance DDN EXAScaler® ES7990XTM storage, plus Intel Xeon Scalable processors.
Eight hundred regular compute nodes comprise the Grace cluster, 100 double-precision NVIDIA A100 GPU compute nodes, eight large memory (three terabytes) compute nodes, eight single-precision NVIDIA T4 GPU compute nodes,
nine single-precision NVIDIA RTX 6000 GPU compute nodes, five login nodes, and six management nodes. The Intel Xeon Scalable processors power these nodes.

The Grace cluster contains an NVIDIA low-latency HDR InfiniBand interconnect and 5.12 petabytes of high-performance DDN storage running the EXAScaler parallel filesystem. “Each regular and GPU compute node is equipped with two 2nd Gen Intel Xeon Scalable 24-core 3.0GHz processors and 384GB DDR4 3200MHz memory, while each of eight large memory nodes has four 2nd Gen Intel Xeon Scalable 20-core 2.5 GHz processors and 3.072 terabytes of DDR4 3200MHz memory.”

Table 1 Details of Compute Nodes
General 384GB
GPU A100
GPU RTX 6000
Large Memory 3TB
Total Nodes 800 100 9 8 8
Processor Type Intel Xeon 6248R (Cascade Lake), 3.0GHz, 24-core Intel Xeon 6248 (Cascade Lake), 2.5 GHz, 20-core
Sockets/Node 2 4
Cores/Node 48 80
Memory/Node 384GB DDR4, 3200 MHz 3TB DDR4, 3200 MHz
Accelerator(s) N/A 2 NVIDIA A100 40GB GPU 2 NVIDIA RTX 6000 24GB GPU 4 NVIDIA T4 16GB GPU N/A
Interconnect Mellanox HDR100 InfiniBand
Local Disk Space 1.6TB NVMe (/tmp), 480GB SSD

Terra Supercomputer

Terra was produced in 2017 and is now located in Teague Data Center. It has 320 compute nodes which consist of 256 nodes with 64 GB of memory and 48 compute nodes with 128 GB of memory which have a dual-socket server with two Intel Xeon 14-core processors known as Broadwell. It also has 16 nodes Intel Knights Landing nodes with 96 GB of memory with either 68 or 72 cores per node.

Terra uses Intel Omni-Path Architecture for its interconnecting fabric. Some of the key features of this Omni-Path fabric includes adaptive routing, dispersive routing, traffic flow optimization, packet integrity protection, and dynamic lane scaling.
This Intelx86-64 cluster is managed by Slurm, an open-source and highly scalable cluster management and job scheduling system for Linux cluster. Slurm allocated access to compute nodes and memory to users for a specific amount of time and provides a framework for starting, executing, and monitoring work on the allocated jobs.

The interconnecting fabric used in Terra is based on Omni-Path Architecture, a high-performance and low-latency architecture that is developed by Intel. All the nodes in Terra have access to 7.4 PB capacity, which is provided by Lenovo DSS-G260 storage appliance. DSS-G offers a high-performance, scalable building block approach for storage needs.

Terra uses CentOS 7 Linux distribution on its nodes, and it uses IBM’s General Parallel File System. The login nodes are used for small-to-medium code development and processing. Each user gets one hour per login session and can use a maximum of eight cores concurrently.

Users usually submit job files which makes resource requests, add user commands and scripting and submits the job to the batch system. The user needs to write batch script to the cluster and can specify the number of tasks, number of tasks per computing node, and desired memory per node.
Writing batch script can be automatically done by tamubatch. Users can request data transfers and the login nodes are suitable for small-to-medium sized data transfers. Resync is the suggested login node due to its support for resume transfer which can be helpful for the CPU process time limit.

FASTER Supercomputer

Faster, located in the west campus data center, was produced in 2021. It has 184 computing nodes with an InfiniBand HDR-100 interconnect. The login and compute nodes are based on the Intel Xeon 8352Y Ice Lake processor. Faster uses PCIe fabric which is the common language for all core building blocks of the data center and enables all of the core data center resources live on a common bus.

This is about 3 times faster than the Terra supercomputer. The global disk has 5PB memory via DDN Lustre appliances. There are 2 sockets per node and 32 cores per socket which results in 64 compute nodes. There are 200 T4 16GB GPUs, 40 A100 40GB GPUs, 8 A10 24GB GPUs, 4 A30 24GB GPUs, and 8 A40 48GB GPUs that are composable to the compute nodes.

There are three login nodes with a local disk space of 3.84 TB. Users can connect to these nodes using SSH. All nodes have Linux Centos 7 as their operating system. There are two transfer nodes in FASTER that use Global Connect, which dramatically increases data transfer speeds over SCP and other transfer tools and automatically suspends transfers when the computer sleeps.

FASTER uses Luster parallel file system, which allows direct communications with storage servers and separation of data and metadata into separate services. Users will have one hour per login session and can use eight cores concurrently which allows for small-to-medium code development and processing.

FASTER has two file systems, one for small-to-modest amount of processing such as small software, compiling, and editing, which is backed up on a nightly basis, and one with a high-performance storage intended to temporarily hold large files. Like Terra, FASTER uses SCP, SFTP, and RSYNC as login nodes. Users can use Globus Connect to transfer large amounts of data.
Users may submit job requests, and Slurm will manage resources based on the job files. For example, users can specify desired number of nodes and number of tasks per node, and desired memory per node. The jobs done by FASTER can be of different types like a single node single core, multiple node single core, a single or parallel GPU job. FASTER also allows its users to create a save state which can be helpful for the CPU time limits and any other interruptions that may happen. Therefore, users don’t have to start over their jobs if interrupted.