Skip to content

Cloudmesh AI HPC

Authors: * Gregor von Laszewski (laszewski@gmail.com)

Cloudmesh AI HPC is a powerful CLI tool designed to simplify access and management of resources on High Performance Computing (HPC) clusters, with primary support for the University of Virginia (UVA) Rivanna cluster. It abstracts the complexity of Slurm job submission, Apptainer image management, and remote cluster interaction into a streamlined set of commands.

Getting Started

1. Installation

Recommended: Using pipx For the best experience with CLI tools, use pipx to install cloudmesh-ai-hpc in an isolated environment.

pipx install cloudmesh-ai-hpc
# Or from a local directory:
pipx install .

Using pip

pip install cloudmesh-ai-hpc
# Or from a local directory:
pip install .

2. First Steps

Start by checking your current configuration and available HPC partitions:

cmc hpc info

Set your default host and partition to avoid specifying them in every command:

cmc hpc set-default --host uva --partition a100

Usage Guide

Connectivity & Remote Access

VPN Management

Many HPC resources require a VPN. Cloudmesh AI HPC integrates VPN control directly:

cmc hpc vpn on       # Connect to the HPC VPN
cmc hpc vpn status   # Check if VPN is active
cmc hpc vpn off      # Disconnect

Interactive Login

You can SSH into an interactive node. Use the --ui flag to open a sophisticated visual selector:

# Interactive selection via Textual UI
cmc hpc login --ui

The Interactive UI features: * Real-time Monitoring: The table automatically updates "Idle Nodes" and "GPU Usage" every 30 seconds. * Dynamic GRES Adjustment: Use + and - keys to increase or decrease the requested GPU count directly in the table. * Resource Verification: Before final login, the tool verifies actual resource availability and displays a confirmation banner with the exact command to be executed.

Direct login to a specific partition:

cmc hpc login a100

Remote Execution & Editing

Run a command without entering a full shell, or edit a remote file using your preferred editor:

# Run a one-off command
cmc hpc run "df -h /scratch/$USER"

# Edit a remote file (defaults to emacs)
cmc hpc edit my_script.py --editor vim

Slurm Job Management

Job Submission Workflow

  1. Generate a template: bash cmc hpc slurm template a100 > my_job.sh
  2. Edit your script and add your logic.
  3. Submit the job: bash cmc hpc slurm submit my_job.sh

Advanced Submission

You can override or add Slurm parameters at submission time using the key:val format:

cmc hpc slurm submit my_job.sh --sbatch "time:01:00:00,mem:16G"

Monitoring & Maintenance

cmc hpc slurm list            # List all your active jobs
cmc hpc slurm status 123456   # Quick status of a specific job
cmc hpc slurm job-info 123456 # Detailed scontrol output
cmc hpc slurm wait 123456     # Block until job completes
cmc hpc slurm logs 123456     # Read output logs
cmc hpc slurm logs 123456 --tail # Tail output logs in real-time
cmc hpc slurm cancel 123456   # Cancel a job

Cluster Monitoring & Reporting

Get high-level insights into cluster health and usage:

# Node information and cluster summary
cmc hpc sinfo --output summary

# Get the current job queue
cmc hpc squeue --search "node[01-10]"

# Detailed usage reports for users, accounts, or partitions
cmc hpc sreport --stat

# Check GPU usage for a specific node or reservation
cmc hpc slurm gpu-usage a100-node-01

Image & Storage Management

Apptainer Images

Build container images directly on the HPC to ensure environment consistency:

cmc hpc image build my_env.def

Storage Checks

Quickly check your disk usage or quota:

cmc hpc storage info /home/user/data
cmc hpc slurm quota

Jupyter Notebooks

Launch a Jupyter server on the cluster:

cmc hpc jupyter --port 8888

Note: This requires an active VPN connection and an SSH tunnel (the command output will provide the exact tunnel string).

System Info & Support

cmc hpc config     # View hardware and queue specifications
cmc hpc tutorial   # Open HPC tutorials in browser
cmc hpc ticket     # Open support request form

Configuration & Customization

Cloudmesh AI HPC uses a two-tier configuration system: 1. Base Config: Packaged partitions.yaml containing standard cluster definitions. 2. Local Overrides: Located at ~/.cloudmesh/hpc.yaml.

Local Configuration Example

You can define your own default host, partition, and aliases for complex sbatch parameters.

Create ~/.cloudmesh/hpc.yaml:

cloudmesh:
  ai:
    default:
      host: uva
      partition: a100
    # Aliases allow you to use a short name for a set of parameters
    aliases:
      heavy_gpu: "gres:gpu:a100:1,mem:80G,time:24:00:00"
      light_gpu: "gres:gpu:v100:1,mem:16G,time:02:00:00"
    # You can also add custom partitions or override existing ones
    partition:
      uva:
        my-custom-partition:
          partition: gpu
          account: my_account
          gres: gpu:a100:1

Using an alias in a command:

cmc hpc slurm submit my_job.sh --sbatch "heavy_gpu"

Command Reference

Command Description Key Options
info Show config, available hosts, and partitions [key]
config Show hardware and queue specifications
login SSH into an interactive node --ui, --host, --sbatch
run Execute a one-off remote command "command"
slurm template Generate a .sbatch boilerplate [key]
slurm submit Upload and submit a Slurm job --key, --sbatch
slurm job-info Detailed job metadata <job_id>
slurm status Current job state (R, PD, etc.) <job_id>
slurm list List all active jobs for current user
slurm wait Block until job finishes <job_id>, --interval
slurm logs Read or tail job output logs <job_id>, --tail
slurm quota Check disk quota
slurm nodes Check node availability in partition --partition
sinfo Get node information and cluster summary --output summary, --search
squeue Get Slurm queue information --search, --output
sreport Get Slurm usage reports --start, --end, --stat
slurm gpu-usage Check GPU usage for node/reservation <target>
slurm search-jobs Find jobs by node regex <node_regex>
slurm cancel Terminate a Slurm job <job_id>
set-default Set default host/partition in config --host, --partition
image build Build Apptainer image from .def file <file>
storage info Get directory size on HPC <dir>
edit Edit remote file via SSH <file>, --editor
vpn on/off Toggle VPN connection
vpn status Check VPN connectivity
jupyter Setup Jupyter notebook on cluster --port
tutorial Open HPC tutorials in browser [keyword]
ticket Open support request form

Debugging & Troubleshooting

Using Debug Mode

Almost every command supports the --debug flag. When enabled, the tool prints the exact SSH/Shell commands it intends to execute without actually running them. This is invaluable for verifying the generated Slurm directives.

cmc hpc slurm submit my_job.sh --debug

Common Issues

  • Permission Denied (publickey): Ensure your SSH keys are added to the HPC cluster and your local ssh-agent is running.
  • VPN Connection Failed: Check your network and ensure you have the necessary credentials for the VPN service.
  • Job Pending (PD): Use cmc hpc slurm job-info <id> to see why a job is pending (e.g., waiting for resources or priority).
  • Interactive UI not loading: Ensure your terminal supports TUI applications (Textual).

Appendix: UVA Partition Table

Key Partition Account GRES / Constraint / Reservation
parallel parallel bii_dsc_community nodes: 2, ntask-per-node: 4
v100 bii-gpu bii_dsc_community gpu:v100:1
a100 gpu bii_dsc_community gpu:a100:1
a100-80gb gpu bii_dsc_community gpu:a100:1, constraint: a100_80gb
a100-dgx bii-gpu bii_dsc_community gpu:a100:1, reservation: bi_fox_dgx
k80 gpu bii_dsc_community gpu:k80:1
p100 gpu bii_dsc_community gpu:p100:1
a6000 gpu bii_dsc_community gpu:a6000:1
a100-pod gpu bii_dsc_community gpu:a100:1, constraint: gpupod
rtx2080 gpu bii_dsc_community gpu:rtx2080:1
rtx3090 gpu bii_dsc_community gpu:rtx3090:1