ARCH Terminology

This is a high-level overview of core terms and architecture relevant to cluster users at ARCH.

Hardware Terminology

Node: A single physical server (a “computer in a box”) within a cluster. Each node typically contains CPUs, memory (RAM), and networking components.

Cluster: A group of nodes connected by a high-speed network that work together to run jobs in parallel or concurrently.

CPU (Central Processing Unit): The traditional processing unit of a computer. Modern CPUs have multiple cores, each of which can process independent tasks.

Socket: A physical slot on a motherboard where a CPU is installed. A node may have multiple sockets, each with its own cores and memory access paths.

Core: An individual execution unit within a CPU. Multiple cores can run different tasks at the same time.

NUMA (Non-Uniform Memory Access): A memory architecture where memory is distributed across multiple CPUs (sockets), and each CPU can access its own local memory faster than memory on other CPUs.

Multicore CPU (NUMA system)

GPU (Graphics Processing Unit): A specialized processor with many small cores designed for massive parallelism. Commonly used in AI, ML, simulations, and image processing.

Memory (RAM): Temporary storage used by programs while they are running. Jobs request and consume memory on the compute nodes.

Interconnect: The high-speed internal network (e.g., InfiniBand) used for communication between nodes in a cluster.

Software and Scheduling

HPC (High Performance Computing): The use of powerful computing systems to solve complex problems requiring significant processing power and parallelism.

Slurm: An open-source, fault-tolerant, and scalable job scheduler used to allocate resources and schedule jobs. See the Slurm overview.

Job: A job is a user-submitted request to run a script or set of commands on the cluster. Jobs may include one or more tasks and are managed by Slurm.

Task: A single unit of computation, typically run on one core. A job may consist of one or many tasks.

Process: A running instance of a program. A process may include one or more threads and executes independently on a core.

Thread: A lightweight execution unit within a process. Threads can run concurrently and share memory.

Job Script: A shell script submitted to Slurm, containing job directives (prefixed with #SBATCH) that define resource needs, walltime, and the job’s commands.

Walltime: The maximum amount of time a job is allowed to run. If a job exceeds its walltime, it will be automatically terminated by the scheduler.

Partition: A grouping of nodes with shared characteristics (e.g., GPU-enabled, high-memory). Jobs must be submitted to a specific partition to access those resources.

Interactive Session: A real-time login session on a compute node, often used for debugging, exploratory work, or running graphical tools. Requested using srun or the interact command.

Module System (Lmod): Environment modules used to load and manage software packages. Users load software with:

module avail
module load <package_name>

Data and Storage

File System: The organizational structure for storing and accessing data on a cluster. Common file systems include /home, /data, /scratch.`

Scratch Space: Temporary high-performance storage intended for intermediate data. Files not accessed for 30 days are automatically purged. Not backed up.

Data Space: Longer-term shared group storage for high-value research outputs. Files stored in /data are not automatically deleted but also not backed up.

Quota: A limit on the amount of storage or number of files a user or group can consume on a given filesystem. View usage with the quotas.py tool.

Throughput vs. Latency: - Throughput refers to how much data can be moved over time (e.g., MB/s) - Latency is the time it takes to start a transfer

Checkpointing: The practice of periodically saving a job’s state so it can be resumed after a failure or timeout. Useful for long-running simulations.