ARCH Terminology ################ This is a high-level overview of core terms and architecture relevant to cluster users at ARCH. .. contents:: :local: :depth: 1 Hardware Terminology ******************** **Node:** A single physical server (a “computer in a box”) within a cluster. Each node typically contains CPUs, memory (RAM), and networking components. **Cluster:** A group of nodes connected by a high-speed network that work together to run jobs in parallel or concurrently. **CPU (Central Processing Unit):** The traditional processing unit of a computer. Modern CPUs have multiple cores, each of which can process independent tasks. **Socket:** A physical slot on a motherboard where a CPU is installed. A node may have multiple sockets, each with its own cores and memory access paths. **Core:** An individual execution unit within a CPU. Multiple cores can run different tasks at the same time. **NUMA (Non-Uniform Memory Access):** A memory architecture where memory is distributed across multiple CPUs (sockets), and each CPU can access its own local memory faster than memory on other CPUs. .. image:: ../images/multi_core.png :width: 800 :alt: Multicore CPU (NUMA system) **GPU (Graphics Processing Unit):** A specialized processor with many small cores designed for massive parallelism. Commonly used in AI, ML, simulations, and image processing. **Memory (RAM):** Temporary storage used by programs while they are running. Jobs request and consume memory on the compute nodes. **Interconnect:** The high-speed internal network (e.g., InfiniBand) used for communication between nodes in a cluster. Software and Scheduling *********************** **HPC (High Performance Computing):** The use of powerful computing systems to solve complex problems requiring significant processing power and parallelism. **Slurm:** An open-source, fault-tolerant, and scalable job scheduler used to allocate resources and schedule jobs. See the `Slurm overview `__. **Job:** A job is a user-submitted request to run a script or set of commands on the cluster. Jobs may include one or more tasks and are managed by Slurm. **Task:** A single unit of computation, typically run on one core. A job may consist of one or many tasks. **Process:** A running instance of a program. A process may include one or more threads and executes independently on a core. **Thread:** A lightweight execution unit within a process. Threads can run concurrently and share memory. **Job Script:** A shell script submitted to Slurm, containing job directives (prefixed with `#SBATCH`) that define resource needs, walltime, and the job’s commands. **Walltime:** The maximum amount of time a job is allowed to run. If a job exceeds its walltime, it will be automatically terminated by the scheduler. **Partition:** A grouping of nodes with shared characteristics (e.g., GPU-enabled, high-memory). Jobs must be submitted to a specific partition to access those resources. **Interactive Session:** A real-time login session on a compute node, often used for debugging, exploratory work, or running graphical tools. Requested using `srun` or the `interact` command. **Module System (Lmod):** Environment modules used to load and manage software packages. Users load software with: .. code-block:: console module avail module load Data and Storage **************** **File System:** The organizational structure for storing and accessing data on a cluster. Common file systems include `/home`, `/data`, `/scratch`.` **Scratch Space:** Temporary high-performance storage intended for intermediate data. Files not accessed for 30 days are automatically purged. Not backed up. **Data Space:** Longer-term shared group storage for high-value research outputs. Files stored in `/data` are not automatically deleted but also not backed up. **Quota:** A limit on the amount of storage or number of files a user or group can consume on a given filesystem. View usage with the `quotas.py` tool. **Throughput vs. Latency:** - **Throughput** refers to how much data can be moved over time (e.g., MB/s) - **Latency** is the time it takes to start a transfer **Checkpointing:** The practice of periodically saving a job’s state so it can be resumed after a failure or timeout. Useful for long-running simulations.