Skip to main content

· Samantha J. Klasfeld, Ph.D. · Tutorial  · 5 min read

Biobank Intro Series: Hardware Settings

Hardware setup lessons for UK Biobank Research Analysis Platform and All of Us Workbench

In 2026 the most expensive resource isn’t compute time or storage. It’s your time rerunning failed analyses. After months of working across the UK Biobank RAP and All of Us Researcher Workbench, I’ve collected some hard-won lessons about resource management.

This post is more technical. Consider yourself warned.

Tip #1: Each Platform Has Its Own Command-Line Interface (CLI)

Two islands labeled "Your Workspace" and "Data Storage" are connected by a tiny, rickety wooden footbridge. Crossing the bridge is a terminal carrying a folder.
The CLI: a small bridge between two very different worlds. Image generated by Gemini AI.

Think of your workspace and the platform’s data storage as two separate floating islands. Your code lives on one island. The massive biobank files live on the other. The CLI is the bridge between them, and each platform has its own.

UK Biobank RAP: The dx toolkit

The dx CLI is your friend for navigating the RAP filesystem:

# List files in a directory within the platform's data storage
dx ls

# Stream file contents (don't download!) from data storage
dx cat file-xxxx | bcftools view

# Upload local files from your workspace to data storage
dx upload local_file.txt --path /file/path/in/workspace/

# Download files (only if absolutely necessary) to your workspace
dx download file-xxxx --output local_file.txt

All of Us: gsutil for Google Cloud Storage

All of Us data lives in Google Cloud Storage (GCS) buckets and uses gsutil to identify, stream, and move data between your workspace and these buckets.

# List files in the controlled data-repository (CDR) bucket
gsutil ls gs://fc-aou-datasets-controlled/

# Find VCF files in the CDR bucket
gsutil ls gs://fc-aou-datasets-controlled/v7/wgs/short_read/snpindel/

# Stream directly (the right way)
gsutil cat gs://path/to/file.vcf.gz | bcftools view

Important: Add a -u flag to gsutil commands to attribute the operation to your project for proper billing and access control:

gsutil -u $GOOGLE_PROJECT [command]

Use the environment variables $GOOGLE_PROJECT and $WORKSPACE_BUCKET to avoid hardcoding paths:

# Upload local file from your workspace to your storage bucket
gsutil -u $GOOGLE_PROJECT cp local_file.txt $WORKSPACE_BUCKET/

# Download files (only if absolutely necessary) to your workspace
gsutil -u $GOOGLE_PROJECT cp gs://path/to/file.txt local_file.txt

Tip #2: Don’t Bring the Cloud Home With You

Comparison of downloading vs streaming biobank data: wrong way (slow download) versus right way (fast streaming).
Don't download massive biobank files — stream and filter directly on the platform.

Now that you know how to upload and download files, I must restate that you should not use those download commands on biobank data files. Yes, dx download and gsutil cp exist, but the data is already where it needs to be. Your job is to meet it there, not drag it to you.

Don’t do this:

# UK Biobank: Copying a 500GB VCF locally
dx download file-xxxx

# All of Us: Same mistake, different platform
gsutil cp gs://path/to/huge.vcf.gz .

Do this instead:

# UK Biobank: Stream with dx
dx cat file-xxxx | bcftools view | your_analysis

# All of Us: Stream with gsutil
gsutil cat gs://path/to/file.vcf.gz | bcftools view | your_analysis

The data is already where it needs to be. It sits in the cloud, on fast storage, ready to be streamed. Copying wastes time, burns through storage quotas, and risks running out of disk space mid-analysis. Both platforms are designed for streaming access. Use it.

Tip #3: Know Your Tools and Your Files

Hail is prominently featured in All of Us documentation, which makes it tempting to reach for first. Resist that instinct and match your tool to your actual problem, not the first one you find or the most impressive-sounding one.

Why avoid Hail when possible:

  • Requires expensive Spark clusters
  • Memory-intensive operations that crash your instance
  • Adds complexity when simpler tools work fine

If you can use standard tools (pandas, bcftools, plink), do that instead. Save Hail for genuinely distributed computing tasks.

Regardless of which tool you choose, make sure your files are indexed (.tbi, .csi) before querying them. Without an index, tools like bcftools have to read the entire file from start to finish — region queries are no faster than loading everything.

Tip #4: Don’t Let Long Jobs Catch You Off Guard

Picture this: You start a 2-hour variant annotation job, grab lunch, and return to… nothing. Just a terminated instance.

Three habits will save you from rerunning everything from scratch.

First, check your idle timeout limit before running any long job. By default, All of Us shuts down after 15 minutes of inactivity.

  • All of Us: Go to workspace settings → increase idle timeout to 8 hours (or your preferred duration)
  • UK Biobank RAP: Check instance auto-pause settings

Note: nohup or screen/tmux can keep jobs running but won’t survive an instance shutdown so adjusting your timeout is still necessary.

Second, filter as early in your pipeline as possible. The less data you’re carrying through each step, the faster and cheaper each step is.

Third, checkpoint intermediate results. Save outputs at meaningful stages so a crash at step 5 doesn’t send you back to step 1.

None of these take more than a minute to set up. The rerun will.

The Bottom Line

When things crash:

  • You ran out of memory
  • Filter earlier in your pipeline
  • Checkpoint intermediate results

When things are slow:

  • Check if files are indexed (.tbi, .csi)
  • Use region queries instead of full chromosomes
  • Stream instead of copying

Set yourself up for success: learn your CLIs, stream your data, pick the right tool for the job, and make sure your long jobs have safety nets.

Back to Blog

Related Posts

View All Posts »