· Samantha J. Klasfeld, Ph.D. · Tutorial · 3 min read · Original Source
Biobank Intro Series: You Have What You Need
A wrap-up of the series and a checklist before you start your analysis

If you’ve made it here, you’ve covered a lot of ground in UK Biobank and All of Us:
- The research environments
- The observational data
- The genetic data
This series was never meant to be exhaustive, but I hope it compressed your learning curve.
Before You Start Analysis
Use this as a gut check. If you can answer yes to everything here, you’re ready.
Platform
- Identified the compute model you’re working in (DNAnexus vs. Google Cloud)
- A plan is made for saving outputs that won’t disappear when your session ends
Phenotypes and Covariates
- Identified the right fields or concepts for your phenotypes of interest and covariates
- Checked for data quality warnings and understand what they mean for your analysis
- Know which timepoints or visits have data and which one you’re using (instances on UKB, enrollment/EHR timing on AoU)
- Confirmed the sample size is sufficient for your question
Genetic Data
- Located the genotype data format you need (unphased WGS, phased WGS, or exome sequencing)
- Confirmed files are indexed (.tbi or .csi) before running any region queries
- Subset to your region or variant list
- Confirmed sample overlap between your phenotype and genotype data
What Comes Next
These posts will update if I find mistakes. If something looks off, feel free to reach out to me.
This series ends here, but, as the cliché goes, the journey is only beginning. Depending on your question, you might be headed toward GWAS, PheWAS, rare variant burden testing, or cross-biobank comparisons. Topics that may appear in a future series.
For now: you have the data, you know where it lives, and you know enough to not be too surprised by what you find. That’s further than a lot of people get before they start. Good luck!
Reflections
None of these posts would have happened without a few people.
Thank you to Sneha Grandhi who helped me recognize my strengths in this field. The biobank work I did at Genscience with her is what this series is built on.
Almost a year ago, Lina Faller’s dedication to sharing her data management work on the Boston Women in Bioinformatics website was the push I needed to finally write something of my own.
Thank you to Mitibketa Ilboudo and Sharvari Narendra for editing. Taking time to give feedback on a stranger’s blog posts is not a small thing, and it’s the reason I could post these with confidence.
Thank you to Lance and my family for the blind support. They may never understand what it is that I do, but I still love them.



