Automatic Assembly For The Fungi (AAFTF) pipeline

Source:

https://github.com/stajichlab/AAFTF

Goal:

Assemble the genome of two M. alpina strains (Mortierella alpina and Mortierella sp.) sequenced at QUT.
Identify DAG (Diacyl-glycerol) and Phopholipase (PLA, PLB, PLC and/or PLD) genes

Method

De novo genome assembly of yeast genomes using the AAFTF pipeline. More information:

https://github.com/stajichlab/AAFTF

(pending) FunGAP pipeline for yeast genome annotation:

https://github.com/CompSynBioLab-KoreaUniv/FunGAP

Preliminary Results

https://wiki.qut.edu.au/display/erbts/M.+alpina+de+novo+genome+assemblies

Resources

Using conda environments:

https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html

Methodology

Installing tools and dependencies

Create a conda environment called ‘aaftf’ and install several tools needed by the workflow:

conda create -n aaftf "python>=3.6" bbmap trimmomatic bowtie2 bwa pilon sourmash blast minimap2 spades megahit novoplasty biopython -c bioconda

Activate the conda environment:

conda activate aaftf

Install the aaftf pipeline as follows:

python -m pip install git+https://github.com/stajichlab/AAFTF.git

To deactivate the environment do:

conda deactivate

2. Running AAFTF pipeline

To run jobs on the HPC, we use PBS Pro submission scripts that will submit jobs to an scheduler that will then send jobs to the queue. Use the PBS Pro script below to run 1) quality trimming and filtering; 2) de novo assembly; and 3) check contaminant sequences using the AAFTF pipeline:

#!/bin/bash -l
#PBS -N AAFTF 
#PBS -l walltime=24:00:00
#PBS -l select=1:ncpus=8:mem=16gb

#QUT eResearch, R.A.Barrero; Modified: 26-Apr-2022
#Usage: qsub launch_AAFTF_pipeline.pbs

cd $PBS_O_WORKDIR

#activate the AAFTF pipeline environment
conda activate aaftf

###################
###  VARIABLES  ###
###################

MEM=16 # amount of memory; can higher for example 32gb
BASE=Mort # prefix name for assembly
READSDIR=/work/speight_team/projects/yeast_genomes/data/Mort_Nextera-357721696
TRIMREAD=reads_trimmed # name of output folder for quality trimmed reads
CPU=8 # number of cpu cores to use


###################
###  PIPELINE   ###
###################

############################################
### STEP1: Quality trimming of raw reads ###
############################################

#make output directory for QCed reads
mkdir -p $TRIMREAD

#run the AAFTF pipeline
AAFTF trim --method bbduk --memory $MEM -c $CPU \
  --left $READSDIR/${BASE}_R1.fastq.gz --right $READSDIR/${BASE}_R2.fastq.gz \
  -o $TRIMREAD/${BASE}  --trimmomatic_adaptors nextera_adaptor.fa

# filter out possible contamination sequences. This step make take a lot of memory depending on how many filtering libraries you use
AAFTF filter -c $CPU --memory $MEM --aligner bbduk \
          -o $TRIMREAD/${BASE} --left $TRIMREAD/${BASE}_1P.fastq.gz --right $TRIMREAD/${BASE}_2P.fastq.gz
          
############################################
### STEP2: De novo genome assembly       ###
############################################

# Additional variables

LEFT=$TRIMREAD/${BASE}_filtered_1.fastq.gz
RIGHT=$TRIMREAD/${BASE}_filtered_2.fastq.gz

WORKDIR=working_AAFTF
OUTDIR=genomes
ASMFILE=$OUTDIR/${BASE}.spades.fasta

#make work and output directories
mkdir -p $WORKDIR $OUTDIR

# run the de novo genome assembly
AAFTF assemble -c $CPU --mem $MEM \
          --left $LEFT --right $RIGHT  \
           -o $ASMFILE -w $WORKDIR/spades_$BASE

############################################
### STEP3: Check and remove vector seqs  ###
############################################

# Additional variables
ASMFILE=$OUTDIR/${BASE}.spades.fasta
VECTRIM=$OUTDIR/${BASE}.vecscreen.fasta

#mkdir -p $WORKDIR $OUTDIR

#run trimming of vector sequences
AAFTF vecscreen -c $CPU -i $ASMFILE -o $VECTRIM

The above template is to run the AAFTF pipeline for the Mort strain, to use it for the Alpine strain we need to modify the following two variables in the above script:

BASE=Alpina
READSDIR=/work/speight_team/projects/yeast_genomes/data/Alpina_Nextera-357932578

Once change the above variables then you can submit the launch_AAFTF_pipeline.pbs script to the scheduler as follows:

qsub launch_AAFTF_pipeline.pbs

Check the progress of the pipeline:

qjobs

#alternatively use
qstats -u USERNAME

List of public genomes

https://www.ncbi.nlm.nih.gov/genome/browse/#!/eukaryotes/11236/

Fetch for phospholipase proteins:

#step1: decompress file
gzip -d GCA_015679415.1_UCR_MalpAD072_1.0_protein.faa.gz

#step2: wrap sequence to one line
python2.7 /work/speight_team/projects/yeast_genomes/scripts/extract_seqs.py GCA_015679415.1_UCR_MalpAD072_1.0_protein.faa 0 | sed 's/lcl|//' > GCA_015679415.1_UCR_MalpAD072_1.0_protein.mod.faa

#step3" check file
less -S GCA_015679415.1_UCR_MalpAD072_1.0_protein.mod.faa

#use grep to fetch name of interest
grep -A 1 "Phospholipase" GCA_015679415.1_UCR_MalpAD072_1.0_protein.mod.faa | sed '/^--$/d' > Phospholipase_proteins.fasta