세균 유전체의 SNP 발굴용 프로그램 성능 비교(논문)

SNP 발굴은 적합한 참조 유전체 서열의 선정, read의 매핑, variant calling에 이르기까지 몇 개의 단계를 거쳐서 진행되는데, 앞에서 일부 소개했듯이 각각에 대하여 사용할 수 있는 프로그램의 조합이 대단히 많고 그 성능도 전부 다르다. 이를 벤치마킹하는 연구는 주로 인간 유전체를 대상으로 이루어지고 있는 현실이다. 다음의 논문은 세균을 대상으로 하는 SNP calling 기법들을 서로 비교하고 있다. 각 논문에 딸린 supplementary material에는 simulated data 생성과 분석에 필요한 명령어가 들어 있어서 참조하기에 좋다.

Evaluation of SNP calling methods for closely related bacterial isolates and a novel high-accuracy pipeline: BactSNP. Microbial Genomics (2019) https://doi.org/10.1099/mgen.0.000261

Outbreak의 조사와 같이 매우 가까운 isolate를 비교하는 경우에는 false-positive SNP가 과도하게 생성됨
기존의 SNP caller로는 이 문제를 해결하기 곤란하여 새로운 pipeline인 BactSNP를 개발
BactSNP는 assembly와 mapping information을 전부 사용할 수 있으며, 참조 유전체 서열이 draft 수준이거나 심지어 없어도 분석을 진행할 수 있음

Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism–calling pipelines. GigaScience (2020) https://doi.org/10.1093/gigascience/giaa007

209개의 SNP calling pipeline을 simulated data에 대하여 적용
파이프라인과 무관하게 참조 유전체의 올바른 선정이 매우 중요함(참조 서열과 read의 Mash distance가 낮을수록 유리)
종내 다양성이 높아질수록 SNP calling의 정확도는 떨어짐
시퀀싱된 유전체 자체를 참조서열로 쓰는 경우 Novoalign/GATK가 가장 정확함
Divergent genome에 대해 매핑을 하는 경우는 NextGenMap, SMALT(이상 aligner) 및 LoFreq, mpileup, Strelka(이상 variant caller)이 가장 정확함

Standardized phylogenetic and molecular evolutionary (PhaME) analysis applied across the microbial tree of life. Scientific Reports (2020) https://doi.org/10.1038/s41598-020-58356-1

세균의 sequencing read, draft assembly 등 상이한 데이터를 조립 여부와 관계없이 표준화된 방법으로 처리하여 phylogenetic analysis를 실시하는 단일 workflow