Using RCDS infrastructure for AI research

Logging in

For this workshop we’ll be using the Open OnDemand interface. Use Chrome, Safari doesn’t seem to work currently. Here is the link:

OnDemand

This interface uses the UI campus authentication (Single Sign On), so use your UI credentials.

To access a terminal (command line), click on Clusters in the navigation bar. There are several standalone servers available. If you have a separate RCDS/CRC password you’ll use that to access the terminals.

Type ‘exit’ and then close the tab when you’re done.

Create a cluster job and submit

For this workshop, we’re going to use GPU resources that are located in the cluster nodes, so we’ll use the job composer. Click on Jobs, then Job Composer.

Select New Job > From Template

Choose ‘Basic Python 3.8.1 with GPU’

On the right, click Create new Job (you can rename the job if you like, but do not change the Cluster)

Click on the ‘mygpuscript.py’ link/button, a new tab with an editor will open. Delete the contents of the file and replace with the training script below. Click Save then close the tab

Click on the Open Editor button in the Submit Script box. In the new editor tab, change the –time parameter to 10 (from 1), and change the name of the job to something you can identify. Optionally, add ‘-j $SLURM_JOBID’ to the python3 line. Save the file and close the tab

Submit the job on the previous tab.

Dataset

We’re using a stock Tensorflow Dataset, IMDB movie reviews.

python script to train model

saved as ‘sentiment.train.py’ or ‘mygpuscript.py’ Following the Tensorflow example here.

#!/bin/python

import numpy as np
import tensorflow_datasets as tfds
import tensorflow as tf
import argparse
import os
import matplotlib.pyplot as plt

parser = argparse.ArgumentParser(description="Trains and saves a tensorflow-keras model for sentiment analysis")
parser.add_argument("-j","--jobid",help="the slurm jobid or other unique number",required=False,default="00000")
args = parser.parse_args()

tfds.disable_progress_bar()

dataset, info = tfds.load('imdb_reviews', data_dir='/mnt/lfs2/data/tensorflow_datasets', with_info=True, as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

BUFFER_SIZE = 10000
BATCH_SIZE = 64

train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

VOCAB_SIZE = 1000
encoder = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=VOCAB_SIZE)
encoder.adapt(train_dataset.map(lambda text, label: text))


model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary()),
        output_dim=64,
        # Use masking to handle the variable sequence lengths
        mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])


model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

history = model.fit(train_dataset, epochs=3,
                    validation_data=test_dataset,
                    validation_steps=30)

test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)

model.save_weights("sentiment.ckpt"+args.jobid)

#export a plot of the training

def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])

plt.figure(figsize=(16, 8))
plt.subplot(1, 2, 1)
plot_graphs(history, 'accuracy')
plt.ylim(None, 1)
plt.subplot(1, 2, 2)
plot_graphs(history, 'loss')
plt.ylim(0, None)

plt.savefig(os.getcwd()+"/training_results.png", format='png', dpi=150)

#this saves, but is buggy and can't load the model again:
#model.save('sentiment.'+args.jobid)

exit()

And here’s the slurm script to submit

#!/bin/bash
#SBATCH -p gpu-long
#SBATCH --gres=gpu:1 
#SBATCH -C ceph

cd $SLURM_SUBMIT_DIR

hostname

nvidia-smi -L
source /usr/modules/init/bash

module load python/3.8.1 openmpi/1.10.2 cuda/11.2

START=$(date +%s)

python sentiment.train.py -j $SLURM_JOBID

let RUNTIME=$(date +%s)-$START
echo "Training time: $RUNTIME"

echo "*--done--*"

Inference

To use our trained model, we’ll create a new script that creates the model and then loads up our saved weights.

From the navbar choose Jobs -> Job Composer

Click on the your job in the list, then in the lower right corner choose Open Dir which will open a new tab.

Click the New File button, and enter a name ‘inference.py’ A new file will show up in the list, click the three dots button, then Edit.

Copy/paste all this into the new editor tab that appears:

#!/bin/python

import numpy as np
import tensorflow_datasets as tfds
import tensorflow as tf
import argparse
import code

parser = argparse.ArgumentParser(description="Trains and saves a tensorflow-keras model for sentiment analysis")
parser.add_argument("-j","--jobid",help="the slurm jobid or other unique number",required=False,default="00000")
args = parser.parse_args()

tfds.disable_progress_bar()

dataset, info = tfds.load('imdb_reviews', with_info=True, as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

BUFFER_SIZE = 10000
BATCH_SIZE = 64

train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE)
test_dataset = test_dataset.batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE)

VOCAB_SIZE = 1000
encoder = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=VOCAB_SIZE)
encoder.adapt(train_dataset.map(lambda text, label: text))


model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary()),
        output_dim=64,
        # Use masking to handle the variable sequence lengths
        mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])


model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])


model.load_weights("sentiment.ckpt"+args.jobid)

print("#--------------------------------------------------------------------#")
print("\nUse the infer function to analyze some text.  For example:\ninfer('this is the text to analyze sentiment in',model) \n negative numbers indicate negative sentiment, positive numbers positive sentiment\n")
def infer(thetext, mdl):
    predicts = mdl.predict(np.array([thetext]))
    print("sentiment: "+str(predicts[0]))


code.interact(local=locals())

exit()

Save the file and close the tab.

Now back in the Open Dir tab click the dropdown next to Open in Terminal and pick one of the standalone servers. A new tab will open. Type ‘ls’ to see the files

boswald@ford2 ~/ondemand/data/sys/myjobs/projects/default/5 >  ls
checkpoint  inference.py  main_job.sh  mygpuscript.py  sentiment.ckpt1098505.data-00000-of-00001  sentiment.ckpt1098505.index  slurm-1098504.out  slurm-1098505.out  training_results.png
boswald@ford2 ~/ondemand/data/sys/myjobs/projects/default/5 >

Load the python/3.8.1 module and then run the inference.py script with the jobid as a command line parameter (so it knows what your checkpoint file name is)

boswald@ford2 ~/ondemand/data/sys/myjobs/projects/default/5 >  module load python/3.8.1
boswald@ford2 ~/ondemand/data/sys/myjobs/projects/default/5 >  python3 inference.py -j 1098505

This will run the script and then put you into an interactive python session. Use the ‘infer’ function to evaluate text, eg:

boswald@ford2 ~/ondemand/data/sys/myjobs/projects/default/5 >  python3 inference.py -j 1098505
2021-10-13 09:33:18.570931: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib:/opt/
modules/devel/gcc/7.2.0/lib64:/opt/modules/devel/python/3.8.1/lib
2021-10-13 09:33:18.570984: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-10-13 09:33:20.679133: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib:/opt/modules/de
vel/gcc/7.2.0/lib64:/opt/modules/devel/python/3.8.1/lib
2021-10-13 09:33:20.679183: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2021-10-13 09:33:20.679213: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ford2.ibest.uidaho.edu): /proc/driver/nvidia/version does not exist
2021-10-13 09:33:20.679791: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-10-13 09:33:20.980669: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
#--------------------------------------------------------------------#
Use the infer function to analyze some text.  For example:
infer('this is the text to analyze sentiment in',model) 
 negative numbers indicate negative sentiment, positive numbers positive sentiment
Python 3.8.1 (default, Aug 20 2021, 14:17:07) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> infer('some text to analyze here',model)
sentiment: [-0.26127574]
>>> infer('happy day, a good movie, fun for all')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
TypeError: infer() missing 1 required positional argument: 'mdl'
>>> infer('happy day, a good movie, fun for all',model)
sentiment: [0.7799665]
>>> infer('i hate apples, they taste like sand',model)
sentiment: [-0.15291533]
>>> exit()

Type ‘exit’ when you’re done in the terminal, then close the tab

References:

Saving and loading models Tensorflow Datasets Tensorflow Datasets API

Pytorch

Here’s an example script:

import torch
import math


dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# Create random input and output data
x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)
y = torch.sin(x)

# Randomly initialize weights
a = torch.randn((), device=device, dtype=dtype)
b = torch.randn((), device=device, dtype=dtype)
c = torch.randn((), device=device, dtype=dtype)
d = torch.randn((), device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(2000):
    # Forward pass: compute predicted y
    y_pred = a + b * x + c * x ** 2 + d * x ** 3

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    if t % 100 == 99:
        print(t, loss)

    # Backprop to compute gradients of a, b, c, d with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_a = grad_y_pred.sum()
    grad_b = (grad_y_pred * x).sum()
    grad_c = (grad_y_pred * x ** 2).sum()
    grad_d = (grad_y_pred * x ** 3).sum()

    # Update weights using gradient descent
    a -= learning_rate * grad_a
    b -= learning_rate * grad_b
    c -= learning_rate * grad_c
    d -= learning_rate * grad_d


print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')

From this tutorial

https://pytorch.org/tutorials/beginner/transformer_tutorial.html