계통수 작성하기

Roary 또는 다른 multiple sequence alignment(MSA) 소프트웨어로 만들어진 다중 염기서열 정렬(MSA) 결과물을 trimAl로 트리밍한 뒤 FastTree로 처리하면 신속하게 approximately-maximum likelihood tree를 얻을 수 있다. 진짜 maximum likelihood(ML) tree를 얻으려면 PhyML이나 RAxML을 사용한다. FastTree는 매우 유용한 프로그램이지만 진정한 의미의 ML tree를 만들어주지 않고, 생성하는 support value 역시 널리 이용되는 bootstrap value가 아니라 Shimodaira-Hasegawa test에 의한 값이라는 한계점이 있다.

Alignment로부터 불량한 곳을 제거하려면 trimAl 이외에도 Gblocks를 쓸 수 있다. ClipKIT는 MSA로부터 계통발생학적으로 의미 있는 위치만을 남기고 나머지를 제거하는 도구이다. TrimAl 패키지에 포함된 readAl은 sequence alignment file의 포맷 전환용 유틸리티이다(예: FASTA → Phylip). MSA의 편집과 시각화 및 분석에는 Jalview가 유용할 것이다. SeaView('Multiplatform GUI for molecular phylogeny')도 MSA 자료의 시각화를 위한 프로그램이다. 작은 용량의 MSA를 시각화하려면 'trimal -htmlout filename.html' 명령어를 써도 좋다.

정렬된 MSA 자료의 조작에는 FASconCAT이나 Phylommand(tree 자료도 다룰 수 있음) 등의 유틸리티를 써도 좋을 것이다.

Species level을 벗어나는 genome의 MSA를 얻으려면 - ezTree

Roary는 하나의 species에 속하는 균주 유전체를 대상으로 core gene group을 동정한 뒤 이로부터 MSA를 생성해 낸다. 만약 분석 대상 미생물이 여러 species에 걸쳐 있다면 core gene이 잘 형성되지 않으므로 원하는 결과를 얻기 어렵다. 나는 이러한 상황에서는 ezTree를 애용한다. 인용도 많이 된다거나 지속적인 업데이트가 이루어지는 인기 있는 소프트웨어는 아니지만, 작동 원리가 간단하기 때문이다. ezTree는 pre-defined single-copy marker에 의존하지 않는다. 입력 유전체 서열(gene prediction을 자체적으로 실시하여 proteome set로 전환하여 사용)로부터 PFAM profile annotation을 거쳐 직접 single-copy marker gene set을 찾아낸다. 상세한 원리는 2018년도에 발표된 논문을 사용하기 바란다. ezTree 원본은 PFAM 검색 작업이 병렬화되지 않아서 계산에 많은 시간이 소요된다. 따라서 병렬화가 가능하도록 수정한 스크립트를 만들게 되었다. 이 파일 묶음(zipped)를 풀어서 나오는 Perl script 2개를 작업 디렉토리에 복사한 뒤 실행 권한을 주고 다음 설명대로 활용하기 바란다. ezTree 실행에 필요한 다른 dependency의 설치는 ezTree GitHub 웹사이트를 참조하라. 아마도 나는 conda base environment에 필요한 프로그램을 설치해 둔 것 같다.

51개 Paenibacillus polymyxa 관련 균주 유전체 자료 파일(zipped)에 포함된 accessions.txt에 해당하는 유전체 염기서열 파일을 다운로드한 뒤 이를 전부 fasta_230812.51라는 디렉토리에 두었다고 가정하자.

#FASTA file의 목록을 먼저 만든다.
$ find fasta_230812.51 -type f > list_230812.51
$ cat list_230812.51
fasta_230812.51/GCF_000507205.3.fna
fasta_230812.51/GCF_001272655.2.fna
fasta_230812.51/GCF_001955935.1.fna
...
#30개 스레드를 사용한다.
#트리 모델은 JTT를 기본으로 한다.
# Ryzen server: /data/apps/ezTree
(base) $ nohup ./ezTree_hyjeong -list list_230812.51 -out run_230812 -thread 30 &
$ ls -l | grep run_230812
-rw-rw-r-- 1 hyjeong hyjeong 7941159  8월 12 15:49 run_230812.aln
-rw-rw-r-- 1 hyjeong hyjeong    2526  8월 12 15:56 run_230812.nwk
-rw-rw-r-- 1 hyjeong hyjeong   17618  8월 12 15:49 run_230812.pfam
drwxrwxr-x 2 hyjeong hyjeong   24576  8월 12 15:49 run_230812.work

Newick 파일에서 각 leaf label은 최초에 투입한 FASTA file명(예: GCF_000507205.3.fna)이 그대로 쓰이고 있을 것이다. vim으로 run_230812.nwk 파일을 열어서 '.fna' 문자열을 제거하면 더욱 바람직할 것이다. Non-greedy match를 원한다면 Stack Overflow의 How can I make my match non greedy in vim?을 참조하라.

MSA로부터 newick file 만들어 보기

Roary가 생성한 MSA(core_gene_alignment.aln)로부터 간단하게 계통수를 만드는 방법은 다음과 같다.

$ trimal -in core_gene_alignment.aln -out core_gene_alignment.aln.trim -automated1
$ fasttree -nt -gtr core_gene_alignment.aln.trim > my_tree.newick

세균의 유전체 진화에서는 horizontal gene transfer(HGT)가 매우 빈번히 일어난다. HGT로 유입된 영역을 제거하지 않으면 정확한 계통수를 작성하기가 어려워질 수 있다. 이를 위해 gubbins와 같은 도구를 이용하여 sequence alignment file을 처리하는 것이 바람직하다. Gubbins의 사용에 관해서는 Snippy를 이용한 간편한 변이 탐색 항목을 참조하기 바란다.

Newick 포맷의 파일은 FigTree, Archaeopteryx, Dendroscope 또는 iTOL server 등을 통해서 시각화하면 된다. 특히 iTOL server는 각종 annotation 자료를 트리와 같이 표현할 수 있다는 점에서 유용하다.

R에서 트리 자료 다루기

R에서 tree file을 다루려면 ape package를 사용하면 된다.

> library(ape)
> myTree = ape::read.tree("gubbins2.final_tree.tre")
> myTree

Phylogenetic tree with 11 tips and 10 internal nodes.

Tip labels:
	NCTC2916, Ibaraki2007, Iwate2008, Iwate2007, Fukuoka2010, Miyagi2006, ...

Rooted; includes branch lengths.
> myTree$tip.label
[1] "NCTC2916"    "Ibaraki2007" "Iwate2008"   "Iwate2007"   "Fukuoka2010"
[6] "Miyagi2006"  "Tochigi2008" "Af650"       "Reference"   "Okayama2011"
[11] "Aichi2011"
> plot(myTree) # tree 구조가 그림으로 표시된다.
# Newick format의 텍스트를 직접 
> myTree2 <- ape::read.tree(text='((A:1, B:1):2, ((C:1, D:1):2, (E:1, F:1):2):4);')

R에서 트리 자료를 본격적으로 다루려면 phytools 패키지를 권장한다. 계통수 자료의 처리에 입문하려면 Introduction to phylogenies in R을 학습할 것을 권장한다.

IQ-TREE로 maximum likelihood를 이용한 계통분류학적 추론하기

IQ-TREE는 새로운 전략을 구사하여 ML tree를 좀 더 빠르고 정확하게 추론하기 위하여 개발된 소프트웨어 패키지이다. 2020년 발표된 IQ-TREE 2(Mol Biol Evol 논문)에서는 새로운 모델을 추가하고 계산방법도 더욱 개선되었다. 자세한 내용은 웹사이트에 게시된 튜토리얼 및 매뉴얼 PDF 파일을 참조하라. IQ-TREE의 입력 파일은 Phylip 포맷을 따라야 하므로, MSA 프로그램의 결과물이 만약 다른 종류의 포맷이라면 바로 위에서 소개한 readAl을 이용하여 전환하도록 한다. 그러나 최신 버전의 IQ-TREE는 aligned FASTA 파일을 자동으로 인식하여 처리하므로 phylip 포맷으로 일부러 바꿀 필요는 없다. 다음은 간략한 실행 사례이다.

1. Infer maximum-likelihood tree from a sequence alignment (example.phy)
   with the best-fit model automatically selected by ModelFinder:
     iqtree -s example.phy

2. Perform ModelFinder without subsequent tree inference:
     iqtree -s example.phy -m MF
   (use '-m TEST' to resemble jModelTest/ProtTest)

3. Combine ModelFinder, tree search, ultrafast bootstrap and SH-aLRT test:
     iqtree -s example.phy --alrt 1000 -B 1000

4. Perform edge-linked proportional partition model (example.nex):
     iqtree -s example.phy -p example.nex
   (replace '-p' by '-Q' for edge-unlinked model)

5. Find best partition scheme by possibly merging partitions:
     iqtree -s example.phy -p example.nex -m MF+MERGE
   (use '-m TESTMERGEONLY' to resemble PartitionFinder)

6. Find best partition scheme followed by tree inference and bootstrap:
     iqtree -s example.phy -p example.nex -m MFP+MERGE -B 1000

7. Use 4 CPU cores to speed up computation: add '-T 4' option

8. Polymorphism-aware model with HKY nucleotide model and Gamma rate:
     iqtree -s counts_file.cf -m HKY+P+G

9. PoMo mixture with virtual popsize 5 and weighted binomial sampling:
     iqtree -s counts_file.cf -m "MIX{HKY+P{EMP},JC+P}+N5+WB"

Genome Informatics Laboratory at KRIBB

Table of Contents

계통수 작성하기

Species level을 벗어나는 genome의 MSA를 얻으려면 - ezTree

MSA로부터 newick file 만들어 보기

R에서 트리 자료 다루기

IQ-TREE로 maximum likelihood를 이용한 계통분류학적 추론하기