Welcome to Cradle Documentation!

Here you will find some useful references for working with the Cradle HPC cluster at UTRGV. If you have any questions regarding the use of Cradle, look under the Working on Cradle sidebar menu for your question.

Requesting Access +

Requesting Access to Work on Cradle

Go to UTRGV IT's Get Access! portal.

Click on "Request Access".

Fill in the information:

  • Category: IT Resource
  • Resource: Cradle HPC
  • Description: Provide project description, project supervisor (must be a faculty member), expected duration of job, expected number of nodes per job, and data storage requirements.

Click on the "User" button (NOT "Super User").

Once this is complete, your request will be submitted and reviewed for approval. Once approved, you will receive a username and password for logging into Cradle.

Login Information +

Logging In

Once you have received your user account information, you can login to Cradle using your credentials from a terminal/command line prompt.

To get started, you will first need to download and install a VPN to connect to the UTRGV network.

Next, you will use ssh to login with the following:

  • For all of the below. replace username with your username

$ ssh username@login.cradle.utrgv.edu

From here, you will be prompted to input your password with a prompt:

username@login.cradle.utrgv.edu's password:

Once you enter your password successfully you will see your command prompt display that you are currently working in the cluster with the following:

[username@login001 ~]$

You are now ready to work!

Getting Started +

Intro to Working on Cradle

The Cradle cluster currently uses a command line interface for all user operations, the OS is linux-based.

If you are not too familiar with working in the terminal, you can check here for some useful command line operations on linux.


Scheduler

The cluster accepts jobs through a scheduler called SLURM. Instead of running python scripts directly, the details of the job are submitted to the scheduler and placed into a queue to receive resources from the cluster. As soon as the resources are available, your job will begin and be carried out.

This is not really like running scripts interactively, so once the job is submitted the output is collected in output files instead of being displayed in the terminal. Additionally, users are not required to remain logged in for jobs to begin, be carried out and complete. You can simply log back in and check the status at anytime

See here for a Sample script to submit a simple Machine Learning Job with Tensorflow

Transferring Files to Cradle +

Transferring Files to Cradle to/from Your Local Directory

In order to transfer files to work on the cluster, it is recommended to use one of the following methods.

scp Examples:

Transferring a local file to your directory on the cluster

scp localFileInMyDirectory.txt yourUserName@login.cradle.utrgv.edu:~/destinationForYourFile/

Transferring a file in a directory on the cluster to your local folder

scp yourUserName@login.cradle.utrgv.edu:~/destinationForYourFile/ localFileInMyDirectory.txt

Note on working within a directory on the terminal

Note that in both of the cases shown above localFileInMyDirectory.txt is the name of the file that is in your current directory in your terminal. Usually you can tell which directory you are in by either looking at your terminal prompt:

# This indicates that my current directory is ~/Desktop/test
[username@mycomputer:~/Desktop/test]$

Or by using a command such as pwd on Linux/Mac or cd on Windows

# This should print something like
# /home/username/Desktop/test
$ pwd

github Cloning Repository Example:

You can also simply clone a repository that you have uploaded to Github with the following

git clone https://github.com/emenriquez/testTFForSlurm
Sample Job Submission Script +

Sample Script for SLURM Job

The script below will request resources from the cluster and carry out a test job with Tensorflow:

#!/bin/bash
### Sets the job's name.
#SBATCH --job-name=myFirstJob

### Sets the job's output file and path.
#SBATCH --output=myFirstJob.out.%j

### Sets the job's error output file and path.
#SBTACH --error=myFirstJob.err.%j

### Requested number of nodes for this job. Can be a single number or a range.
#SBATCH -N 1

### Requested partition (group of nodes, i.e. compute, bigmem, gpu, etc.) for the resource allocation.
#SBATCH -p kimq

### Requested number of gpus
#SBATCH --gres=gpu:1

### Limit on the total run time of the job allocation.
#SBATCH --time=1:00:00

echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"

echo "Activating TensorFlow-2.6.2 environment"
source /shared/tensorflow-2.6.2/tf_env/bin/activate

echo "Running testTF.py"
python3 ~/testTFForSlurm/testTF.py

echo "Deactivating TensorFlow-2.6.2 environment"
deactivate

echo "Done."
  • You will need the file testTF.py downloaded for this script to work. If you have a file located in another directory, you can just provide the full path

For example:

Instead of

echo "Running Test Script File"
python3 ~/testTFForSlurm/testTF.py

Your file call will be

echo "Running Test Script File"
python3 /path/to/your/file.py
Submitting a Job +

Submitting a Job to SLURM

Once you have a script prepared with your job, submission is fairly straightforward.

Simply submit the job to the scheduler using the command below, where slurm_sample.sh can be replaced with your job script file name:

sbatch slurm_sample.sh

You may need to specify the path to the script file if you are working in another directory, for example:

sbatch ~/testTFForSlurm/slurm_sample.sh

Checking Job Status

Once you submit your job, there are two simple ways to check the progress of your job. First, you can see your job's location in the scheduler queue:

squeue

This will output all jobs currently in the scheduler, including your job if it is currently running.

NOTE: Once your job completes, it will disappear from this list, so be sure to check the log files for your job to make sure that you do not resubmit in case of error or a job that has already completed.

To check the logs (output) of your job, you will go to the filename specified in --output in your slurm job script file. For example, if the line in your script reads:

### Sets the job's output file and path.
#SBATCH --output=myFirstJob.out.%j

Then your output file will appear with the job number in the same directory as your script file, with the filename myFirstJob.out.813 (in this example case, the job # in the scheduler is 813 - each job has a unique job number assignment)

To check the contents of the logs, you can use cat. For example:

cat myFirstJob.out.813
Preinstalled Software +

Preinstalled Software

  • Tensorflow 2.6.2
  • PyTorch 1.10.2
  • gcc 11.2.1
  • OpenMPI 4.1.3
  • Intel OneAPI
  • CUDA 11.6
  • cuDNN 8

Preconfigured Python Environments

The following Python virtual environments are available to be used (or copied and then modified) for easy access on the cluster.

  • Tensorflow 2.6.2
    Located in /shared/tensorflow-2.6.2/tf_env/
  • PyTorch
    Located in /shared/pytorch-1.10.2/pytorch_env

There is also a gcc compiler available for C/C++ applications

  • gcc 11.2.1
    Located in /shared/openmpi/gcc-11.2.1
  • gcc 8.5.0
    Located in /shared/openmpi/gcc-8.5.0

Loading Environments/modules for your job:

To load a module for your job, you will activate the virtual environment by including the following line in your SLURM script before your code is executed. An example is shown below for the Tensorflow environment, but it can be replaced with any of the available modules in the /shared/ directory:

### excerpt SLURM Script (options not shown)

echo "Activating TensorFlow-2.6.2 environment"
# This activates your virtual environment or module
source /shared/tensorflow-2.6.2/tf_env/bin/activate

echo "Running testTF.py"
# Then you can run your code
python3 ~/testTFForSlurm/testTF.py

echo "Deactivating TensorFlow-2.6.2 environment"
# Once your code is done, deactivate your environment
deactivate
Configure Virtual Environment +

Add to Existing Shared Environment

pip3 install --user virtualenv-clone

Now clone the environment with tensorflow (configured for the cluster and runs much faster)

virtualenv-clone /shared/tensorflow-2.6.2/tf_env/ ~/myEnvs/tf_env

(it takes a couple of minutes...)

Now you have the original environment copied to your user account


Add new modules to your environment:

Now you can add new packages by activating your environment and then installing anything needed inside

Ex: to install pandas in the copied tf_env

source ~/myEnvs/tf_env/bin/activate

Next, install the package

pip3 install pandas

Make Sure to activate your new environment in the job script


Change the following line in your job submission script:

echo "Activating TensorFlow-2.6.2 environment"
source /shared/tensorflow-2.6.2/tf_env/bin/activate

To your new environment

echo "Activating My custom environment"
source ~/myEnvs/tf_env/bin/activate