Important note
This GWAS pipeline is written as a Nextflow workflow, with
independent parts specified by a nextflow configuration file, which is
generated using a genemapgwas.sh script available in the
workflow project directory.
Main QC steps and key details
- Check and remove duplicate samples based on sample ID (FID and
IID).
This ensures QC is performed with unique samples only. Therefore,
certain ID forms such as those with underscores can have duplicate
samples appearing as unique because parts of the ID are different. For
example, the same sample processed in different wells will have same
sample ID but different well numbers.
- Remove duplicate variants based on variant ID.
- Check related individuals.
- Check for discordant sex information.
- Check sample missingness (heteorzygosity against missing genotype
proportion).
- Check poor quality variants based on MAF, genotype call rate, HWE,
Differential missingness between cases and controls or between
batches.
- Exclude palindromic SNPs [A/T] and [C/G] which are usually
problematic to phase.
- Compute and plot PCA of QCed data
- Prune population outliers and recompute and plot PCA.
The steps look something like this when completed.

Get nextflow configuration file
Since the current working working direction is qc, we
have to step two directory backwards to run the
genemapgwas.sh script
Cameroon
#!bin/bash
$gwasdir/genemapgwas.sh \
qc local,singularity,hg19 \
--bfile /data/awonkam1/kesoh/gwas/cmr/raw/bed/AW2018-2019_hg19 \
--out AW2018-2019-hg19 \
--output_dir /data/awonkam1/kesoh/gwas/cmr/ \
--maf 0.01 \
--geno 0.05 \
--mind 0.10 \
--prune_outliers \
--threads 6
Ghana
#!bin/bash
$gwasdir/genemapgwas.sh \
qc local,singularity,hg19 \
--bfile /vast/awonkam1/DATA/from-external/sicklegen/ghana/vcf/sicklegen-gha \
--out sicklegen-gha-hg19 \
--output_dir /data/awonkam1/kesoh/gwas/gha/ \
--maf 0.01 \
--geno 0.05 \
--mind 0.10 \
--prune_outliers \
--threads 6
Tanzania
#!bin/bash
$gwasdir/genemapgwas.sh \
qc local,singularity,hg19 \
--bfile /data/awonkam1/kesoh/gwas/tzn/bed/tzscdgwas.eagle2 \
--out tzn-hg19 \
--output_dir /data/awonkam1/kesoh/gwas/tzn/ \
--maf 0.01 \
--geno 0.05 \
--mind 0.10 \
--prune_outliers \
--threads 6
SITT
#!bin/bash
$gwasdir/genemapgwas.sh \
qc local,singularity,hg18 \
--bfile /vast/awonkam1/DATA/from-external/sitt/genotypes/sitt-unclean-sampleids-updated \
--out sitt-hg18 \
--output_dir /data/awonkam1/kesoh/gwas/sitt/ \
--maf 0.01 \
--geno 0.05 \
--mind 0.10 \
--prune_outliers \
--threads 6
Run QC using config file created above
- This runs the gwas nextflow workflow using a local profile.
- The config file will be read from the current directory where it was
saved by the previous command
- Nextflow is already loaded in the Rscript SBATCH script
- The workflow is pulled from the esohkevin/ei-gwas
github repo, the
dev branch.
Cameroon
#!/bin/bash
work='/scratch4/awonkam1/nfworkdirs/gwas/cmr/'
nextflow \
-c AW2018-2019-hg19-qc.config \
run esohkevin/ei-gwas/qc.nf \
-profile local,singularity,hg19 \
-w ${work} \
-r dev \
-resume
Ghana
#!/bin/bash
work='/scratch4/awonkam1/nfworkdirs/gwas/gha/'
nextflow \
-c sicklegen-gha-hg19-qc.config \
run esohkevin/ei-gwas/qc.nf \
-profile local,singularity,hg19 \
-w ${work} \
-r dev \
-resume
Tanzania
#!/bin/bash
work='/scratch4/awonkam1/nfworkdirs/gwas/tzn/'
nextflow \
-c tzn-hg19-qc.config \
run esohkevin/ei-gwas/qc.nf \
-profile local,singularity,hg19 \
-w ${work} \
-r dev \
-resume
SITT
SITT genomic coordinates were in hg18 and were lifted over to hg19
and hg38 after QC using CrossMap.
#!/bin/bash
work='/scratch4/awonkam1/nfworkdirs/gwas/sitt/'
nextflow \
-c sitt-hg18-qc.config \
run esohkevin/ei-gwas/qc.nf \
-profile local,singularity,hg18 \
-w ${work} \
-r dev \
-resume
QC Reports
Missingness
- The vertical dashed lines are two standard deviations of mean
heterozygosity (mean+-2SD)
- The horizontal dashed line is missing genotype threshold
- All samples on the outside of these lines (colored red) are
excluded
Cameroon

Ghana

Tanzania

SITT

PCA All Cohorts
We first merged the datasets across cohorts as follows:
- Get datasets of SNPs intersecting exactly all four studies:
isec -n=4
- Merge datasets of intersect SNPs.
bcftools \
isec \
-n=4 \
-p . \
-Oz \
--threads 32 \
/data/awonkam1/kesoh/gwas/cmr/aligned/combined/cms-aligned.vcf.gz \
/data/awonkam1/kesoh/gwas/tzn/aligned/tz_scd_aligned.vcf.gz \
/data/awonkam1/kesoh/gwas/gha/aligned/combined/sicklegen-gha-hg19_aligned.vcf.gz \
/data/awonkam1/kesoh/gwas/sitt/aligned/combined/sitt-pass-qc-crossmap-hg19_aligned.vcf.gz
# number of snps intersecting studies
wc -l sites
# 190635 sites.txt
# merge datasets
bcftools \
merge 000{0..3}.vcf.gz \
--threads 32 \
-Oz \
-o cmr-tzn-gha-sitt-scd-mergedset.vcf.gz
- Get a population labels file for PCA plotting
for i in 000{0..3}.vcf.gz; do
bcftools \
query \
-l $i | \
awk \
-v pop="${i/.vcf*/}POP" \
'{print $1, pop}'
done | \
sed 's/0000POP/CMR/g; s/0001POP/TZN/g; s/0002POP/GHA/g; s/0003POP/SITT/g' > cmr-tzn-gha-sitt-scd-pops.txt
head -3 cmr-tzn-gha-sitt-scd-pops.txt
# 202637450131_R05C01 CMR
# 202637450131_R07C01 CMR
# 202637450156_R01C01 CMR
- Convert VCF file to PLINK binary format
plink2 \
--double-id \
--make-bed \
--out cmr-tzn-gha-sitt-scd-mergedset \
--threads 32 \
--vcf cmr-tzn-gha-sitt-scd-mergedset.vcf.gz
- Perform QC using the Nextflow workflow with the following
parameters
...
bfile = '/data/awonkam1/kesoh/gwas/mergedset/raw/cmr-tzn-gha-sitt-scd-mergedset'
pheno_file = 'NULL'
out = 'cmr-tzn-gha-sitt-scd-mergedset'
output_dir = '/data/awonkam1/kesoh/gwas/mergedset/'
hetlower = 'NULL'
hetupper = 'NULL'
maf = 0.01
geno = 0.05
mind = 0.10
keep_related = false
keep_palindrome = false
keep_outliers = false
s_param = 6
threads = 48
njobs = 1
...
N E X T F L O W ~ version 22.10.0-RC1
Launching `genemapgwas/qc.nf` [cranky_dubinsky] DSL2 - revision: 86e0e2d8e2
GeneMAP GWAS QC WORKFLOW
executor > local (17)
[55/e4fe66] process > checkDuplicateSampleIds (checking duplicate samples...) [100%] 1 of 1 ✔
[e6/0ff877] process > removeDuplicateVars (romoving duplicate variants) [100%] 1 of 1 ✔
[fc/96e7d3] process > getKingFormatFile (formatting data for KING...) [100%] 1 of 1 ✔
[4b/04fe52] process > checkDuplicateAndRelatedIndivs (checking related and duplicate individuals...) [100%] 1 of 1 ✔
[2e/00ae0d] process > checkDiscordantSex (checking discordant sex...) [100%] 1 of 1 ✔
[fa/d5f71b] process > computeSampleMissingnesStats (calculating missingness statistics...) [100%] 1 of 1 ✔
[09/0e8207] process > checkSamplesMissingess (checking sample missingness...) [100%] 1 of 1 ✔
[fc/957c8a] process > removePoorQualitySamples (excluding poor quality samples...) [100%] 1 of 1 ✔
[8b/28245d] process > checkPoorQualityVariants (checking poor quality snps...) [100%] 1 of 1 ✔
[0c/362121] process > checkPalindromes (checking palindromic snps...) [100%] 1 of 1 ✔
[d5/1ac14c] process > removePoorQualityVariants (removing poor quality snps...) [100%] 1 of 1 ✔
[4c/2e53b3] process > performPca (running PCA...) [100%] 1 of 1 ✔
[93/603078] process > plotPca (plotting PCA...) [100%] 1 of 1 ✔
[fe/a82480] process > getPed (processing cmr-tzn-gha-sitt-scd-mergedset-pass-qc...) [100%] 1 of 1 ✔
[b7/64d69c] process > getParams (preparing parameter file for cmr-tzn-gha-sitt-scd-mergedset-pass-qc...) [100%] 1 of 1 ✔
[b5/df295e] process > getEigenstratgeno (converting data for cmr-tzn-gha-sitt-scd-mergedset-pass-qc...) [100%] 1 of 1 ✔
[45/0a58c7] process > prunePopOutliers (pruning outliers for cmr-tzn-gha-sitt-scd-mergedset-pass-qc...) [100%] 1 of 1 ✔
[d0/35ebaf] process > getPrunedData (removing poor quality snps...) [100%] 1 of 1 ✔
Waiting for file transfers to complete (7 files)
Pipeline execution summary
---------------------------
Completed at: 2026-03-17T13:21:08.189728-04:00
Duration : 52m 36s
Success : true
workDir : /scratch4/awonkam1/nfworkdirs/gwasqc/
outputDir : /data/awonkam1/kesoh/gwas/mergedset/
exit status : 0
Completed at: 17-Mar-2026 13:21:08
Duration : 52m 36s
CPU hours : 27.7 (1.1% cached)
Succeeded : 5
Cached : 13