Skip to content

RoseTTAFold All-Atom

The RoseTTAFold All-Atom (RFAA) developed by Baker Lab is an inference pipeline for protein structure prediction.

The version of it that is current on the 4th of March 2025 is installed on Myriad, available in environment modules as rfaa/20250304.

Important

RFAA depends on a piece of software called SignalP 6.0h which is licensed under an academic licence that explicitly forbids commercial use. You therefore may not under any circumstances use the RFAA pipeline as it is presently installed for commercial workloads. Where any ambiguity exists about the non-commercial nature of your work you must cease using RFAA immediately and contact rc-support to discuss paths forward. Violation of software licenses is a violation of the UCL Computer terms of use which may result in removal of access to all IT systems.

Using RFAA§

Important

RFAA contains an embedded miniforge3 install and so conflicts with other versions of Python.

When run, RFAA relies on some very large databases (about 2.5TiB) as well as the model weights from the original paper. These, and other files are expected to be in a predictable structure in the input directory which you run the code in. To facilitate this, we have placed those reference sets in a central location and provided a shell script prepare_rfaa_input_directory with which will either make appropriate file-system links in the current working directory, or, if given an argument will do so there so:

$ module load rfaa/20250304

Please note that the license of SignalP 6.0h which is used
in RFAA ONLY allows its use for non-commercial work.
This means you may only use RFAA for non-commercial work.

$ cd ~/Scratch/
$ mkdir rfaa_input
$ prepare_rfaa_input_directory rfaa_input/
Preparing rfaa_input/ as an RFAA input directory.

Linking databases to rfaa_input/ directory...
/shared/ucl/apps/RoseTTAFold-All-Atom_db/bfd <- rfaa_input//bfd
/shared/ucl/apps/RoseTTAFold-All-Atom_db/pdb100_2021Mar03 <- rfaa_input//pdb100_2021Mar03
/shared/ucl/apps/RoseTTAFold-All-Atom_db/UniRef30_2020_06 <- rfaa_input//UniRef30_2020_06

Linking weights to rfaa_input/ directory...
/shared/ucl/apps/RoseTTAFold-All-Atom_db/RFAA_paper_weights.pt <- rfaa_input//RFAA_paper_weights.pt

Done.

Assuming you have the environment module correctly loaded, you can now run RFAA from inside this directory.

$ cd rfaa_input/
$ ls -lah
total 20K
drwx------  2 uccaoke uccapc3 4.0K Mar 13 16:15 .
drwxr-xr-x 16 uccaoke uccapc3  12K Mar 13 16:15 ..
lrwxrwxrwx  1 uccaoke uccapc3   44 Mar 13 16:15 bfd -> /shared/ucl/apps/RoseTTAFold-All-Atom_db/bfd
lrwxrwxrwx  1 uccaoke uccapc3   57 Mar 13 16:15 pdb100_2021Mar03 -> /shared/ucl/apps/RoseTTAFold-All-Atom_db/pdb100_2021Mar03
lrwxrwxrwx  1 uccaoke uccapc3   62 Mar 13 16:15 RFAA_paper_weights.pt -> /shared/ucl/apps/RoseTTAFold-All-Atom_db/RFAA_paper_weights.pt
lrwxrwxrwx  1 uccaoke uccapc3   57 Mar 13 16:15 UniRef30_2020_06 -> /shared/ucl/apps/RoseTTAFold-All-Atom_db/UniRef30_2020_06
$ 

You then need to prepare your input files which are YAML files. We will take the protein example.

cp /shared/ucl/apps/rfaa/20250304/RoseTTAFold-All-Atom/rf2aa/config/inference/protein.yaml .
cp /shared/ucl/apps/rfaa/20250304/RoseTTAFold-All-Atom/rf2aa/config/inference/base.yaml .

If we look at protein.yaml, it depends on base.yaml:

defaults:
  - base

job_name: "7u7w_protein"
protein_inputs: 
  A:
    fasta_file: examples/protein/7u7w_A.fasta

We also need the fasta file, so we can copy this to our current directory and modify protein.yaml so it can find it.

$ cp /shared/ucl/apps/rfaa/20250304/RoseTTAFold-All-Atom/examples/protein/7u7w_A.fasta .
defaults:
  - base

job_name: "7u7w_protein"
protein_inputs: 
  A:
    fasta_file: 7u7w_A.fasta

If we are on a compute node (preferrably with a GPU), booked interactively with qrsh we can then directly run the pipeline:

python3 -m rf2aa.run_inference --config-path=$(pwd) --config-name protein

Note that we specify where to find the config files with the --config-path option and give it a configuration name to run with --config-name.

$ python3 -m rf2aa.run_inference --config-path=$(pwd) --config-name protein
/shared/ucl/apps/rfaa/20250304/miniforge3/envs/RFAA/lib/python3.10/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'protein': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information
  warnings.warn(msg, UserWarning)
Using the cif atom ordering for TRP.
make_msa.sh 7u7w_A.fasta 7u7w_protein/A 4 64  pdb100_2021Mar03/pdb100_2021Mar03
Predicting: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.52s/sequences]
Running HHblits against UniRef30 with E-value cutoff 1e-10
- 16:46:14.500 INFO: Input file = 7u7w_protein/A/hhblits/t000_.1e-10.a3m

- 16:46:14.500 INFO: Output file = 7u7w_protein/A/hhblits/t000_.1e-10.id90cov75.a3m

- 16:46:14.727 WARNING: Maximum number 100000 of sequences exceeded in file 7u7w_protein/A/hhblits/t000_.1e-10.a3m

- 16:46:48.226 INFO: Input file = 7u7w_protein/A/hhblits/t000_.1e-10.a3m

- 16:46:48.226 INFO: Output file = 7u7w_protein/A/hhblits/t000_.1e-10.id90cov50.a3m

- 16:46:48.451 WARNING: Maximum number 100000 of sequences exceeded in file 7u7w_protein/A/hhblits/t000_.1e-10.a3m

Running PSIPRED
Running hhsearch
$ ls
u7w_A.fasta  7u7w_protein  7u7w_protein_aux.pt  7u7w_protein.pdb  base.yaml  bfd  outputs  pdb100_2021Mar03  protein.yaml  RFAA_paper_weights.pt  UniRef30_2020_06

When the process has run, we should find in our current working directory both a PDB file (7u7w_protein.pdb) with the structure and a PyTorch file (7u7w_protein_aux.pt) with some statistical information about the run, as per the documentation.

Writing a job script§

Assuming we have our input YAML and FASTA files in a directory inside our home directory called rfaa_input, and the configuration is input.yaml, a job script for RFAA looks like this:

#!/bin/bash -l

# Batch script to run RoseTTAFold All-Atom on Myriad.

# Request one GPU
#$ -l gpu=1

# Request 18 cores (half a node)
#$ -pe smp 18

# Request two hours of wallclock time (format hours:minutes:seconds).
#$ -l h_rt=2:0:0

# Request 5 gigabyte of RAM per core.
#$ -l mem=5G

# Set the name of the job.
#$ -N RFAA

# Set the working directory to the current working directory.
#$ -cwd # You may wish to change this to a specific directory

module load rfaa/20250304

prepare_rfaa_input_directory # sets up current directory.

# Copy input files into our directory.
# Assume we have our input.yaml and fasta files in ~/rfaa_input - amend.
cp ${HOME}/rfaa_input/*.yaml ${HOME}/rfaa_input/*.fasta .

python3 -m rf2aa.run_inference --config-path=$(pwd) --config-name input