Run¶
Creating a genome STR index (optional)¶
Creates a bed file of large STR regions in the reference genome. This step is performed automatically as part of strling extract. However, when running multiple samples, it is more efficient to do it once, then pass the file to strling extract using the -g option.
strling index $reference_fasta
Single sample¶
Extract informative pairs to a binary format
strling extract -f $reference_fasta /path/to/$sample.cram $sample.bin
Output file: $sample.bin - a binary file describing STR-containing reads
Call strs on the extract binary data
mkdir -p str-results/
strling call --output-prefix str-results/$sample -f $reference_fasta /path/to/$sample.cram $sample.bin
Output files: $sample-bounds - STR loci interrogated in that sample $sample-genotype.txt - Locus size estimates and other per-locus information (see outputs) $sample-unplaced.txt - Counts of unplaced STR reads that could not be assigned to a specific locus
Joint calling¶
Extract informative read pairs to a binary format for each sample
This step is the same as above, you can use the same bin files.
strling extract -f $reference_fasta /path/to/$sample2.cram $sample1.bin
strling extract -f $reference_fasta /path/to/$sample2.cram $sample2.bin
Output file(s): $sample1.bin, $sample2.bin, … - binary files describing STR-containing reads
Joint call str loci across all samples
Requires minimum read evidence from at least one sample.
mkdir -p str-results/
strling merge --output-prefix str-results/joint -f $reference_fasta $sample1.bin $sample2.bin
Output file: joint-bounds.txt - positions of STR loci found by combining across all individuals, used for the call stage when joint calling
Peak memory usage for joint calling increases linearly by ~63 MB per 30-40X human WGS. This can be prohibitively high for large cohorts. To reduce memory requirements, analysis can be parallelized by chromosome using –chromosome then subsequently merged. See the workflows for examples. This reduces memory usage to ~5 MB/sample.
Call genotypes/estimate allele sizes for all loci in each sample
strling call --output-prefix str-results/$sample1 -b str-results/joint-bounds.txt -f $reference_fasta /path/to/$sample1.cram $sample1.bin
strling call --output-prefix str-results/$sample2 -b str-results/joint-bounds.txt -f $reference_fasta /path/to/$sample2.cram $sample2.bin
Output files as above.
Find outliers
Finds loci that are expanded in one individual relative to other individuals in the joint called cohort.
strling-outliers.py --genotypes *-genotype.txt --unplaced *-unplaced.txt
Output files: STRs.tsv - a single file with all loci in all samples and their outlier p-values (see outputs) $sample.STRs.tsv - the same data, filtered to a single individual