General files

International WGS pipeline validation
In order to validate our pipeline, we have compared our results against other standardised pipelines developed by Public Health TB institutions. These TB reference laboratories include: the Research Center Borstel from Germany, Public Health England from UK, and National Institute for Public Health and the Environment (RIVM) from The Netherlands. In all comparisons, we had similar results on SNP calls and predictions on transmission clusters. Furthermore, the parameters used in our pipelines have been discussed and approved by the TB community (Meehan, CJ. et al. 2019. Nat. Rev. Microbiol. doi:10.1038/s41579-019-0214-5)
Pipeline available at: https://gitlab.com/tbgenomicsunit/ThePipeline
For a detailed flowchart of the pipeline, please see this document.

MTB Inferred Ancestor Sequence
Sequence for the inferred MTB ancestor from Comas et al. 2010.

SNP panel for lineage classification
We use this SNP panel for lineage typing purpouses. This list have been constructed from different sources. Initially, the SNPs came from the Coll et al. publication. Later, we modified the L2 classification using the SNPs from Shitikov et al. work, but using the lineage nomenclature of Rutaihwa et al.
The L4 nomenclature and SNPs were also updated based on the ones proposed by Stucki et al.
Regarding the animal-adapted strains, we have calculated two lineage defining SNPs for M. bovis and M. caprae, from our own collection of samples. In addition, we have calculated one lineage defining SNP for each of the A3, A2 and A1 lineages (as defined by Brites et al.).
As we used the MTBC inferred ancestor sequence as the mapping reference in our analyses, the reference allele in our list will match the reference allele in the ancestral genome. Please, bear this in mind when trying to classify lineage 4 and lineage 4.10 strains in genomic sequences mapped against the H37Rv reference genome.

Genomic regions masked in phylogenetic and epidemiological studies
In our phylogenetic and epidemiological studies, we used to filter out regions that are prone to accumulate mapping errors (leading to false positive SNP calls) or homoplastic variants. We used to filter genes that are either annotated as PE/PPE, phages, repeats and the intergenic regions flanking them. However, we have now reviewed the genomic regions to mask based on other studies and in our own analyses. We filter out the regions marked as ‘DISCARD’ in the ‘FILTER’ column of the attached TXT file. See the rationale followed for defining these regions here.



Sorted by author

Álvaro Chiner-Oms

Galo Goig Serrano