ARCH Terminology

This is a high-level overview of core terms and architecture relevant to cluster users at ARCH.

Hardware Terminology 

Node: A single physical server (a “computer in a box”) within a cluster. Each node typically contains CPUs, memory (RAM), and networking components.

Cluster: A group of nodes connected by a high-speed network that work together to run jobs in parallel or concurrently.

CPU (Central Processing Unit): The traditional processing unit of a computer. Modern CPUs have multiple cores, each of which can process independent tasks.

Socket: A physical slot on a motherboard where a CPU is installed. A node may have multiple sockets, each with its own cores and memory access paths.

Core: An individual execution unit within a CPU. Multiple cores can run different tasks at the same time.

NUMA (Non-Uniform Memory Access): A memory architecture where memory is distributed across multiple CPUs (sockets), and each CPU can access its own local memory faster than memory on other CPUs.

GPU (Graphics Processing Unit): A specialized processor with many small cores designed for massive parallelism. Commonly used in AI, ML, simulations, and image processing.

Memory (RAM): Temporary storage used by programs while they are running. Jobs request and consume memory on the compute nodes.

Interconnect: The high-speed internal network (e.g., InfiniBand) used for communication between nodes in a cluster.

Software and Scheduling 

HPC (High Performance Computing): The use of powerful computing systems to solve complex problems requiring significant processing power and parallelism.

Slurm: An open-source, fault-tolerant, and scalable job scheduler used to allocate resources and schedule jobs. See the Slurm overview.

Job: A job is a user-submitted request to run a script or set of commands on the cluster. Jobs may include one or more tasks and are managed by Slurm.

Task: A single unit of computation, typically run on one core. A job may consist of one or many tasks.

Process: A running instance of a program. A process may include one or more threads and executes independently on a core.

Thread: A lightweight execution unit within a process. Threads can run concurrently and share memory.

Job Script: A shell script submitted to Slurm, containing job directives (prefixed with #SBATCH) that define resource needs, walltime, and the job’s commands.

Walltime: The maximum amount of time a job is allowed to run. If a job exceeds its walltime, it will be automatically terminated by the scheduler.

Partition: A grouping of nodes with shared characteristics (e.g., GPU-enabled, high-memory). Jobs must be submitted to a specific partition to access those resources.

Interactive Session: A real-time login session on a compute node, often used for debugging, exploratory work, or running graphical tools. Requested using srun or the interact command.

Module System (Lmod): Environment modules used to load and manage software packages. Users load software with:

module avail
module load <package_name>

Data and Storage 

File System: The organizational structure for storing and accessing data on a cluster. Common file systems include /home, /data, /scratch.`

Scratch Space: Temporary high-performance storage intended for intermediate data. Files not accessed for 30 days are automatically purged. Not backed up.

Data Space: Longer-term shared group storage for high-value research outputs. Files stored in /data are not automatically deleted but also not backed up.

Quota: A limit on the amount of storage or number of files a user or group can consume on a given filesystem. View usage with the quotas.py tool.

Throughput vs. Latency: - Throughput refers to how much data can be moved over time (e.g., MB/s) - Latency is the time it takes to start a transfer

Checkpointing: The practice of periodically saving a job’s state so it can be resumed after a failure or timeout. Useful for long-running simulations.

ARCH Terminology

Hardware Terminology

Software and Scheduling

Data and Storage

Hardware Terminology 

Software and Scheduling 

Data and Storage 