Background

A tertiary care hospital in Madrid has reported a sudden spike in severe lung infections among immunocompromised patients. Despite standard antibiotics and antifungal therapies, the infections persist, suggesting an unusual pathogen is at play. A mycology research unit within the hospital’s pathology department has isolated a fungal specimen from several patients that, while resembling Penicillium species in morphology, exhibits atypical pathogenic behaviors.

Traditional diagnostic methods, including microscopic examination, culture characteristics, and basic serological tests, have proved inadequate in classifying this isolate. Due to the urgency of the situation, the hospital's genetics lab uses rapid nanopore sequencing to analyze the genome of the fungus in hopes of identifying its origins and potential virulence factors, but they lack the expertise to perform a thorough phylogenomic analysis.

The Hospital just contacted you as leading expert in Phylogenomics. You are in the middle of a course, but who cares!, this looks important. You open your laptop, get a cup of coffee, and start looking at the data that the Hospital has sent you.

The inferred proteome of the fungal isolate is at isolate/isolate.pep.fa. You pick some random entries and perform some online BLAST searches, confirming that the specimen has close relatives in the genus Talaromyces. But sequences appear to be quite divergent from all known species, with about 85-90% protein identity. Based on these clues, your team has prepared the ground for a phylogenomic survey of this new fungal species by performing the following two preliminary steps for you:

Data prep 1. Data preparation

Data prep 2. Clustering gene families

Time to analyze data!


PART I - Phylogenetic characterization of the fungal isolate

Task 1: Identify gene family clusters composed solely of single-copy-orthologs present in at least 104 of the genomes/species

Task 2: Build a phylogenetic tree for each of the identified single-copy gene families using mafft for reconstructing the alignment and FastTree to infer the tree.

Task 3: Using the software ASTRAL , infer a species tree out of all the gene trees inferred in task-2.

Task 4: Using the multiple sequence alignments reconstructed in Task2, build a concatenated (supermatrix) species tree.

Task 5: Compare gene trees and species trees

Task 6. Estimate per-clade gene tree support in the species trees