Back

Important note

This GWAS pipeline is written as a Nextflow workflow, with independent parts specified by a nextflow configuration file, which is generated using a genemapgwas.sh script available in the workflow project directory.


Main QC steps and key details

  • Check and remove duplicate samples based on sample ID (FID and IID).

This ensures QC is performed with unique samples only. Therefore, certain ID forms such as those with underscores can have duplicate samples appearing as unique because parts of the ID are different. For example, the same sample processed in different wells will have same sample ID but different well numbers.

  • Remove duplicate variants based on variant ID.
  • Check related individuals.
  • Check for discordant sex information.
  • Check sample missingness (heteorzygosity against missing genotype proportion).
  • Check poor quality variants based on MAF, genotype call rate, HWE, Differential missingness between cases and controls or between batches.
  • Exclude palindromic SNPs [A/T] and [C/G] which are usually problematic to phase.
  • Compute and plot PCA of QCed data
  • Prune population outliers and recompute and plot PCA.

The steps look something like this when completed.


Get nextflow configuration file

Since the current working working direction is qc, we have to step two directory backwards to run the genemapgwas.sh script

Cameroon

#!bin/bash

$gwasdir/genemapgwas.sh \
  qc local,singularity,hg19 \
  --bfile /data/awonkam1/kesoh/gwas/cmr/raw/bed/AW2018-2019_hg19 \
  --out AW2018-2019-hg19 \
  --output_dir /data/awonkam1/kesoh/gwas/cmr/ \
  --maf 0.01 \
  --geno 0.05 \
  --mind 0.10 \
  --prune_outliers \
  --threads 6

Ghana

#!bin/bash

$gwasdir/genemapgwas.sh \
  qc local,singularity,hg19 \
  --bfile /vast/awonkam1/DATA/from-external/sicklegen/ghana/vcf/sicklegen-gha \
  --out sicklegen-gha-hg19 \
  --output_dir /data/awonkam1/kesoh/gwas/gha/ \
  --maf 0.01 \
  --geno 0.05 \
  --mind 0.10 \
  --prune_outliers \
  --threads 6

Tanzania

#!bin/bash

$gwasdir/genemapgwas.sh \
  qc local,singularity,hg19 \
  --bfile /data/awonkam1/kesoh/gwas/tzn/bed/tzscdgwas.eagle2 \
  --out tzn-hg19 \
  --output_dir /data/awonkam1/kesoh/gwas/tzn/ \
  --maf 0.01 \
  --geno 0.05 \
  --mind 0.10 \
  --prune_outliers \
  --threads 6

SITT

#!bin/bash

$gwasdir/genemapgwas.sh \
  qc local,singularity,hg18 \
  --bfile /vast/awonkam1/DATA/from-external/sitt/genotypes/sitt-unclean-sampleids-updated \
  --out sitt-hg18 \
  --output_dir /data/awonkam1/kesoh/gwas/sitt/ \
  --maf 0.01 \
  --geno 0.05 \
  --mind 0.10 \
  --prune_outliers \
  --threads 6

Run QC using config file created above

  • This runs the gwas nextflow workflow using a local profile.
  • The config file will be read from the current directory where it was saved by the previous command
  • Nextflow is already loaded in the Rscript SBATCH script
  • The workflow is pulled from the esohkevin/ei-gwas github repo, the dev branch.

Cameroon

#!/bin/bash

work='/scratch4/awonkam1/nfworkdirs/gwas/cmr/'

nextflow \
  -c AW2018-2019-hg19-qc.config \
  run esohkevin/ei-gwas/qc.nf \
  -profile local,singularity,hg19 \
  -w ${work} \
  -r dev \
  -resume

Ghana

#!/bin/bash

work='/scratch4/awonkam1/nfworkdirs/gwas/gha/'

nextflow \
  -c sicklegen-gha-hg19-qc.config \
  run esohkevin/ei-gwas/qc.nf \
  -profile local,singularity,hg19 \
  -w ${work} \
  -r dev \
  -resume

Tanzania

#!/bin/bash

work='/scratch4/awonkam1/nfworkdirs/gwas/tzn/'

nextflow \
  -c tzn-hg19-qc.config \
  run esohkevin/ei-gwas/qc.nf \
  -profile local,singularity,hg19 \
  -w ${work} \
  -r dev \
  -resume

SITT

SITT genomic coordinates were in hg18 and were lifted over to hg19 and hg38 after QC using CrossMap.

#!/bin/bash

work='/scratch4/awonkam1/nfworkdirs/gwas/sitt/'

nextflow \
  -c sitt-hg18-qc.config \
  run esohkevin/ei-gwas/qc.nf \
  -profile local,singularity,hg18 \
  -w ${work} \
  -r dev \
  -resume

QC Reports

Missingness

  • The vertical dashed lines are two standard deviations of mean heterozygosity (mean+-2SD)
  • The horizontal dashed line is missing genotype threshold
  • All samples on the outside of these lines (colored red) are excluded

Cameroon


Ghana


Tanzania


SITT


Relatedness

Cameroon


Ghana


Tanzania


SITT


PCA

Cameroon


Ghana


Tanzania


SITT


PCA All Cohorts

We first merged the datasets across cohorts as follows:

  • Get datasets of SNPs intersecting exactly all four studies: isec -n=4
  • Merge datasets of intersect SNPs.

bcftools \
  isec \
    -n=4 \
    -p . \
    -Oz \
    --threads 32 \
    /data/awonkam1/kesoh/gwas/cmr/aligned/combined/cms-aligned.vcf.gz \
    /data/awonkam1/kesoh/gwas/tzn/aligned/tz_scd_aligned.vcf.gz \
    /data/awonkam1/kesoh/gwas/gha/aligned/combined/sicklegen-gha-hg19_aligned.vcf.gz \
    /data/awonkam1/kesoh/gwas/sitt/aligned/combined/sitt-pass-qc-crossmap-hg19_aligned.vcf.gz

# number of snps intersecting studies
wc -l sites

# 190635 sites.txt


# merge datasets
bcftools \
  merge 000{0..3}.vcf.gz \
  --threads 32 \
  -Oz \
  -o cmr-tzn-gha-sitt-scd-mergedset.vcf.gz
  • Get a population labels file for PCA plotting
for i in 000{0..3}.vcf.gz; do 
  bcftools \
    query \
      -l $i | \
    awk \
      -v pop="${i/.vcf*/}POP" \
      '{print $1, pop}'
done | \
sed 's/0000POP/CMR/g; s/0001POP/TZN/g; s/0002POP/GHA/g; s/0003POP/SITT/g' > cmr-tzn-gha-sitt-scd-pops.txt

head -3 cmr-tzn-gha-sitt-scd-pops.txt

# 202637450131_R05C01 CMR
# 202637450131_R07C01 CMR
# 202637450156_R01C01 CMR
  • Convert VCF file to PLINK binary format

plink2 \
  --double-id \
  --make-bed \
  --out cmr-tzn-gha-sitt-scd-mergedset \
  --threads 32 \
  --vcf cmr-tzn-gha-sitt-scd-mergedset.vcf.gz
  • Perform QC using the Nextflow workflow with the following parameters
...
   bfile            = '/data/awonkam1/kesoh/gwas/mergedset/raw/cmr-tzn-gha-sitt-scd-mergedset'
   pheno_file       = 'NULL'
   out              = 'cmr-tzn-gha-sitt-scd-mergedset'
   output_dir       = '/data/awonkam1/kesoh/gwas/mergedset/'
   hetlower         = 'NULL'
   hetupper         = 'NULL'
   maf              = 0.01
   geno             = 0.05
   mind             = 0.10
   keep_related     = false
   keep_palindrome  = false
   keep_outliers    = false
   s_param          = 6
   threads          = 48
   njobs            = 1
...
N E X T F L O W  ~  version 22.10.0-RC1
Launching `genemapgwas/qc.nf` [cranky_dubinsky] DSL2 - revision: 86e0e2d8e2

GeneMAP GWAS QC WORKFLOW

executor >  local (17)
[55/e4fe66] process > checkDuplicateSampleIds (checking duplicate samples...)                            [100%] 1 of 1 ✔
[e6/0ff877] process > removeDuplicateVars (romoving duplicate variants)                                  [100%] 1 of 1 ✔
[fc/96e7d3] process > getKingFormatFile (formatting data for KING...)                                    [100%] 1 of 1 ✔
[4b/04fe52] process > checkDuplicateAndRelatedIndivs (checking related and duplicate individuals...)     [100%] 1 of 1 ✔
[2e/00ae0d] process > checkDiscordantSex (checking discordant sex...)                                    [100%] 1 of 1 ✔
[fa/d5f71b] process > computeSampleMissingnesStats (calculating missingness statistics...)               [100%] 1 of 1 ✔
[09/0e8207] process > checkSamplesMissingess (checking sample missingness...)                            [100%] 1 of 1 ✔
[fc/957c8a] process > removePoorQualitySamples (excluding poor quality samples...)                       [100%] 1 of 1 ✔
[8b/28245d] process > checkPoorQualityVariants (checking poor quality snps...)                           [100%] 1 of 1 ✔
[0c/362121] process > checkPalindromes (checking palindromic snps...)                                    [100%] 1 of 1 ✔
[d5/1ac14c] process > removePoorQualityVariants (removing poor quality snps...)                          [100%] 1 of 1 ✔
[4c/2e53b3] process > performPca (running PCA...)                                                        [100%] 1 of 1 ✔
[93/603078] process > plotPca (plotting PCA...)                                                          [100%] 1 of 1 ✔
[fe/a82480] process > getPed (processing cmr-tzn-gha-sitt-scd-mergedset-pass-qc...)                      [100%] 1 of 1 ✔
[b7/64d69c] process > getParams (preparing parameter file for cmr-tzn-gha-sitt-scd-mergedset-pass-qc...) [100%] 1 of 1 ✔
[b5/df295e] process > getEigenstratgeno (converting data for cmr-tzn-gha-sitt-scd-mergedset-pass-qc...)  [100%] 1 of 1 ✔
[45/0a58c7] process > prunePopOutliers (pruning outliers for cmr-tzn-gha-sitt-scd-mergedset-pass-qc...)  [100%] 1 of 1 ✔
[d0/35ebaf] process > getPrunedData (removing poor quality snps...)                                      [100%] 1 of 1 ✔
Waiting for file transfers to complete (7 files)

Pipeline execution summary
---------------------------
Completed at: 2026-03-17T13:21:08.189728-04:00
Duration    : 52m 36s
Success     : true
workDir     : /scratch4/awonkam1/nfworkdirs/gwasqc/
outputDir   : /data/awonkam1/kesoh/gwas/mergedset/
exit status : 0

Completed at: 17-Mar-2026 13:21:08
Duration    : 52m 36s
CPU hours   : 27.7 (1.1% cached)
Succeeded   : 5
Cached      : 13

QC Reports

Missingness

PCA Post-QC

PCA Post-Outlier Pruning