Hg002-qc-snakemake - HG002 QC Snakemake

Overview

HG002 QC Snakemake

To Run

Resources and data specified within snakefile (hg002QC.smk) for simplicity. Tested with snakemake v6.15.3.

Warning: Several steps of this workflow require minimum coverage. It's recommended that this workflow not be run when yield in base pairs is insufficient to produceat least 15X coverage (i.e. yield/3099922541 >= 15x).

# clone repo
git clone --recursive https://github.com/PacificBiosciences/pb-human-wgs-workflow-snakemake.git workflow

# make necessary directories
mkdir cluster_logs

# create conda environment
conda env create --file workflow/environment.yaml

# activate conda environment
conda activate pb-human-wgs-workflow

# submit job
sbatch workflow/run_hg002QC.sh

Plots

A list of important stats from target files that would be good for plotting.

targets = [f"conditions/{condition}/{filename}"
                    for condition in ubam_dict.keys()
                    for filename in ["smrtcell_stats/all_movies.read_length_and_quality.tsv",
                                    "hifiasm/asm.p_ctg.fasta.stats.txt",
                                    "hifiasm/asm.a_ctg.fasta.stats.txt",
                                    "hifiasm/asm.p_ctg.qv.txt",
                                    "hifiasm/asm.a_ctg.qv.txt",
                                    "truvari/summary.txt",
                                    "pbsv/all_chroms.pbsv.vcf.gz",
                                    "deepvariant/deepvariant.vcf.stats.txt",
                                    "whatshap/deepvariant.phased.tsv",
                                    "happy/all.summary.csv",
                                    "happy/all.extended.csv",
                                    "happy/cmrg.summary.csv",
                                    "happy/cmrg.extended.csv",
                                    "mosdepth/coverage.mosdepth.summary.txt",
                                    "mosdepth/mosdepth.M2_ratio.txt",
                                    "mosdepth/gc_coverage.summary.txt",
                                    "mosdepth/coverage.thresholds.summary.txt"]]
  • smrtcell_stats/all_movies.read_length_and_quality.tsv
    • outputs 3 columns (read name, read length, read quality)
    • boxplots of read length and quality
  • hifiasm/asm.p_ctg.fasta.stats.txt (primary) + hifiasm/asm.a_ctg.fasta.stats.txt (alternate)
    • all stats below should be collected for both primary (p_ctg) and alternate (p_atg) assemblies
    • assembly size awk '$1=="SZ" {print $2}' <filename>
    • auN (area under the curve) awk '$1=="AU" {print $2}' <filename>
    • NGx - line plot of NG10 through NG90 awk '$1=="NL" {print $2 $3}' <filename> ($2 is x-axis, $3 y-axis) like this: example plot
  • hifiasm/asm.p_ctg.qv.txt + hifiasm/asm.a_ctg.qv.txt
    • adjusted assembly quality awk '$1=="QV" {print $3}' <filename> for primary and alternate assemblies
  • truvari/truvari.summary.txt
    • structural variant recall jq .recall <filename>
    • structural variant precision jq .precision <filename>
    • structural variant f1 jq .f1 <filename>
    • number of calls jq '."call cnt"' <filename>
    • FP jq .FP <filename>
    • TP-call jq .TP-call <filename>
    • FN jq .FN <filename>
    • TP-base jq .TP-base <filename>
  • pbsv/all_chroms.pbsv.vcf.gz
    • counts of each type of variant bcftools query -i 'FILTER=="PASS"' -f '%INFO/SVTYPE\n' <filename> | awk '{A[$1]++}END{for(i in A)print i,A[i]}'
    • can also do size distributions of indels bcftools query -i 'FILTER=="PASS" && (INFO/SVTYPE=="INS" | INFO/SVTYPE=="DEL")' -f '%INFO/SVTYPE\t%INFO/SVLEN\n' <filename>
  • deepvariant/deepvariant.vcf.stats.txt
    • several values in lines starting with 'SN' awk '$1=="SN"' <filename>
      • number of SNPS
      • number INDELs
      • number of multi-allelic sites
      • number of multi-allelic SNP sites
    • ratio of transitions to transversions awk '$1=="TSTV" {print$5}' <filename>
    • can monitor substitution types awk '$1=="ST"' <filename>
    • SNP heterozygous : non-ref homozygous ratio awk '$1=="PSC" {print $6/$5}' <filename>
    • SNP transitions : transversions awk '$1=="PSC" {print $7/$8}' <filename>
    • Number of heterozygous insertions : number of homozgyous alt insertions awk '$1=="PSI" {print $8/$10}' <filename>
    • Number of heterozygous deletions : number of homozgyous alt deletions awk '$1=="PSI" {print $9/$11}' <filename>
    • Total INDEL heterozygous:homozygous ratio awk '$1=="PSI" {print ($8+$9)/($10+$11)}' <filename>8+9:10+11 indel het:hom)
  • whatshap/deepvariant.phased.tsv
    • phase block N50 awk '$2=="ALL" {print $22}' <filename>
    • bp_per_block_sum (total number of phased bases) awk '$2=="ALL" {print $18}' <filename>
  • whatshap/deepvariant.phased.blocklist
    • calculate phase block size (to - from) and reverse order them (awk 'NR>1 {print $5-$4}' <filename> |sort -nr), then plot as cumulative line graph like for assembly, N_0 to N90 example plot
  • happy/all.summary.csv + happy/cmrg.summary.csv
    • stats should be collected for all variants and cmrg challenging medically relevant genes
      • SNP recall awk -F, '$1=="SNP" && $2=="PASS" {print $10}' <filename>
      • SNP precision awk -F, '$1=="SNP" && $2=="PASS" {print $11}' <filename>
      • SNP F1 awk -F, '$1=="SNP" && $2=="PASS" {print $13}' <filename>
      • INDEL recall awk -F, '$1=="INDEL" && $2=="PASS" {print $10}' <filename>
      • INDEL precision awk -F, '$1=="INDEL" && $2=="PASS" {print $11}' <filename>
      • INDEL F1 awk -F, '$1=="INDEL" && $2=="PASS" {print $13}' <filename>
  • happy/all.extended.csv + happy/cmrg.extended.csv
    • there are many stratifications that can be examined, and Aaron Wenger might have opinionso n which are most important. The below commands are just for one stratification "GRCh38_lowmappabilityall.bed.gz".
    • SNP GRCh38_lowmappabilityall recall awk -F, '$1=="SNP" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $8}' <filename>
    • SNP GRCh38_lowmappabilityall precision awk -F, '$1=="SNP" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $9}' <filename>
    • SNP GRCh38_lowmappabilityall F1 awk -F, '$1=="SNP" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $11}' <filename>
    • INDEL GRCh38_lowmappabilityall recall awk -F, '$1=="INDEL" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $8}' <filename>
    • INDEL GRCh38_lowmappabilityall precision awk -F, '$1=="INDEL" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $9}' <filename>
    • INDEL GRCh38_lowmappabilityall F1 awk -F, '$1=="INDEL" && $2=="*" && $3=="GRCh38_lowmappabilityall.bed.gz" && $4=="PASS" {print $11}' <filename>
  • mosdepth/coverage.mosdepth.summary.txt
    • mean aligned coverage in "coverage.mosdepth.summary.txt" - 4th column of final row, can grep 'total_region'
  • mosdepth/mosdepth.M2_ratio.txt
    • outputs single value: ratio of chr2 coverage to chrM coverage
    • bar chart of m2 ratio
  • mosdepth/gc_coverage.summary.txt
    • outputs 5 columns: gc percentage bin, q1 , median , q3 , count
    • q1, median, q3 columns are statistics for coverage at different gc percentages (e.g. median cover at 30% GC)
    • "count" refers to # of 500 bp windows that fall in that bin
    • can pick a couple of key GC coverage bins and make box plots out of them
  • mosdepth/coverage.thresholds.summary.txt
    • outputs 10 columns corresponding to % of genome sequenced to minimum coverage depths (1X - 10X)
    • maybe a line chart comparing the different coverage thresholds among conditions
Owner
Juniper A. Lake
Bioinformatics Scientist
Juniper A. Lake
This is a repo documenting the best practices in PySpark.

Spark-Syntax This is a public repo documenting all of the "best practices" of writing PySpark code from what I have learnt from working with PySpark f

Eric Xiao 447 Dec 25, 2022
OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase working capital.

Overview OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase

Tom 3 Feb 12, 2022
A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

Unnikrishnan 2 Dec 12, 2021
Conduits - A Declarative Pipelining Tool For Pandas

Conduits - A Declarative Pipelining Tool For Pandas Traditional tools for declaring pipelines in Python suck. They are mostly imperative, and can some

Kale Miller 7 Nov 21, 2021
Tools for the analysis, simulation, and presentation of Lorentz TEM data.

ltempy ltempy is a set of tools for Lorentz TEM data analysis, simulation, and presentation. Features Single Image Transport of Intensity Equation (SI

McMorran Lab 1 Dec 26, 2022
Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which he recommends to buy. We will use this data to build a portfolio

Backtesting the "Cramer Effect" & Recommendations from Cramer Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which

Gábor Vecsei 12 Aug 30, 2022
MeSH2Matrix - A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

SisonkeBiotik 6 Nov 30, 2022
Data Scientist in Simple Stock Analysis of PT Bukalapak.com Tbk for Long Term Investment

Data Scientist in Simple Stock Analysis of PT Bukalapak.com Tbk for Long Term Investment Brief explanation of PT Bukalapak.com Tbk Bukalapak was found

Najibulloh Asror 2 Feb 10, 2022
Minimal working example of data acquisition with nidaqmx python API

Data Aquisition using NI-DAQmx python API Based on this project It is a minimal working example for data acquisition using the NI-DAQmx python API. It

Pablo 1 Nov 05, 2021
Uses MIT/MEDSL, New York Times, and US Census datasources to analyze per-county COVID-19 deaths.

Covid County Executive summary Setup Install miniconda, then in the command line, run conda create -n covid-county conda activate covid-county conda i

Ahmed Fasih 1 Dec 22, 2021
Feature Detection Based Template Matching

Feature Detection Based Template Matching The classification of the photos was made using the OpenCv template Matching method. Installation Use the pa

Muhammet Erem 2 Nov 18, 2021
Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly

Table of contents Introduction Dataset Model & Metrics How to Run Quickstart Install Training Evaluation Detection DATA COMPETITION The COVID-19 pande

Thanh Dat Vu 1 Feb 27, 2022
Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database

Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database, using a set of "harvesters", whose job it

Battery Intelligence Lab 20 Sep 28, 2022
The Spark Challenge Student Check-In/Out Tracking Script

The Spark Challenge Student Check-In/Out Tracking Script This Python Script uses the Student ID Database to match the entries with the ID Card Swipe a

1 Dec 09, 2021
Developed for analyzing the covariance for OrcVIO

about This repo is developed for analyzing the covariance for OrcVIO environment setup platform ubuntu 18.04 using conda conda env create --file envir

Sean 1 Dec 08, 2021
Time ranges with python

timeranges Time ranges. Read the Docs Installation pip timeranges is available on pip: pip install timeranges GitHub You can also install the latest v

Micael Jarniac 2 Sep 01, 2022
Includes all files needed to satisfy hw02 requirements

HW 02 Data Sets Mean Scale Score for Asian and Hispanic Students, Grades 3 - 8 This dataset provides insights into the New York City education system

7 Oct 28, 2021
Analysiscsv.py for extracting analysis and exporting as CSV

wcc_analysis Lichess page documentation: https://lichess.org/page/world-championships Each WCC has a study, studies are fetched using: https://lichess

32 Apr 25, 2022
Detailed analysis on fraud claims in insurance companies, gives you information as to why huge loss take place in insurance companies

Insurance-Fraud-Claims Detailed analysis on fraud claims in insurance companies, gives you information as to why huge loss take place in insurance com

1 Jan 27, 2022
wikirepo is a Python package that provides a framework to easily source and leverage standardized Wikidata information

Python based Wikidata framework for easy dataframe extraction wikirepo is a Python package that provides a framework to easily source and leverage sta

Andrew Tavis McAllister 35 Jan 04, 2023