Differences

This shows you the differences between two versions of the page.

--- bioinfo:계통수_작성하기 [2023/08/11 09:43] – [계통수 작성하기] hyjeong
+++ bioinfo:계통수_작성하기 [2025/02/23 12:57] (current) – [Species level을 벗어나는 genome의 MSA를 얻으려면 - ezTree] hyjeong
@@ Line 1: / Line 1: @@
 ====== 계통수 작성하기 ======
-Roary 또는 다른 multiple sequence alignment(MSA) 소프트웨어로 만들어진 염기서열 정렬 결과물을 [[http://trimal.cgenomics.org/|trimAl]]로 트리밍한 뒤 [[http://www.microbesonline.org/fasttree/|FastTree]]로 처리하면 신속하게 approximately-maximum likelihood tree를 얻을 수 있다. 진짜 maximum likelihood(ML) tree를 얻으려면 [[https://github.com/stephaneguindon/phyml|PhyML]]이나 [[https://raxml-ng.vital-it.ch/|RAxML]]을 사용한다. FastTree는 매우 유용한 프로그램이지만 진정한 의미의 ML tree를 만들어주지 않고, 생성하는 support value 역시 널리 이용되는 bootstrap value가 아니라 Shimodaira-Hasegawa test에 의한 값이라는 한계점이 있다.
+Roary 또는 다른 multiple sequence alignment(MSA) 소프트웨어로 만들어진 다중 염기서열 정렬(MSA) 결과물을 [[http://trimal.cgenomics.org/|trimAl]]로 트리밍한 뒤 [[http://www.microbesonline.org/fasttree/|FastTree]]로 처리하면 신속하게 approximately-maximum likelihood tree를 얻을 수 있다. 진짜 maximum likelihood(ML) tree를 얻으려면 [[https://github.com/stephaneguindon/phyml|PhyML]]이나 [[https://raxml-ng.vital-it.ch/|RAxML]]을 사용한다. FastTree는 매우 유용한 프로그램이지만 진정한 의미의 ML tree를 만들어주지 않고, 생성하는 support value 역시 널리 이용되는 bootstrap value가 아니라 Shimodaira-Hasegawa test에 의한 값이라는 한계점이 있다.
 Alignment로부터 불량한 곳을 제거하려면 trimAl 이외에도 Gblocks를 쓸 수 있다. [[https://github.com/JLSteenwyk/ClipKIT|ClipKIT]]는 MSA로부터 계통발생학적으로 의미 있는 위치만을 남기고 나머지를 제거하는 도구이다. TrimAl 패키지에 포함된 readAl은 sequence alignment file의 포맷 전환용 유틸리티이다(예: FASTA -> Phylip). MSA의 편집과 시각화 및 분석에는 [[https://www.jalview.org/|Jalview]]가 유용할 것이다. [[https://doua.prabi.fr/software/seaview|SeaView]]('Multiplatform GUI for molecular phylogeny')도 MSA 자료의 시각화를 위한 프로그램이다. 작은 용량의 MSA를 시각화하려면 'trimal -htmlout //filename.html//' 명령어를 써도 좋다.
@@ Line 7: / Line 7: @@
 정렬된 MSA 자료의 조작에는 [[https://www.zfmk.de/en/research/research-centres-and-groups/fasconcat|FASconCAT]]이나 [[https://github.com/RybergGroup/phylommand|Phylommand]](tree 자료도 다룰 수 있음) 등의 유틸리티를 써도 좋을 것이다.
-MSA로부터 간단하게 계통수를 만드는 방법은 다음과 같다.
+===== Species level을 벗어나는 genome의 MSA를 얻으려면 - ezTree =====
+Roary는 하나의 species에 속하는 균주 유전체를 대상으로 core gene group을 동정한 뒤 이로부터 MSA를 생성해 낸다. 만약 분석 대상 미생물이 여러 species에 걸쳐 있다면 core gene이 잘 형성되지 않으므로 원하는 결과를 얻기 어렵다. 나는 이러한 상황에서는 [[https://github.com/yuwwu/ezTree|ezTree]]를 애용한다. 인용도 많이 된다거나 지속적인 업데이트가 이루어지는 인기 있는 소프트웨어는 아니지만, 작동 원리가 간단하기 때문이다. ezTree는 pre-defined single-copy marker에 의존하지 않는다. 입력 유전체 서열(gene prediction을 자체적으로 실시하여 proteome set로 전환하여 사용)로부터 PFAM profile annotation을 거쳐 직접 single-copy marker gene set을 찾아낸다. 상세한 원리는 2018년도에 발표된 [[https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-4327-9|논문]]을 사용하기 바란다. ezTree 원본은 PFAM 검색 작업이 병렬화되지 않아서 계산에 많은 시간이 소요된다. 따라서 병렬화가 가능하도록 수정한 스크립트를 만들게 되었다. 이 파일 묶음({{:bioinfo:eztree_modified.zip|zipped}})를 풀어서 나오는 Perl script 2개를 작업 디렉토리에 복사한 뒤 실행 권한을 주고 다음 설명대로 활용하기 바란다. ezTree 실행에 필요한 다른 dependency의 설치는 [[https://github.com/yuwwu/ezTree|ezTree GitHub 웹사이트]]를 참조하라. 아마도 나는 conda base environment에 필요한 프로그램을 설치해 둔 것 같다.
-  $ trimal –in core_gene_alignment.aln –out core_gene_alignment.aln.trim –automated1
+개 //Paenibacillus polymyxa// 관련 균주 유전체 자료 파일({{:bioinfo:ani_r_exercise.zip|zipped}})에 포함된 accessions.txt에 해당하는 유전체 염기서열 파일을 다운로드한 뒤 이를 전부 fasta_230812.51라는 디렉토리에 두었다고 가정하자.
-  $ fasttree –nt –gtr core_gene_alignment.aln.trim > my_tree.newick
+  #FASTA file의 목록을 먼저 만든다.
+  $ find fasta_230812.51 -type f > list_230812.51
+  $ cat list_230812.51
+  fasta_230812.51/GCF_000507205.3.fna
+  fasta_230812.51/GCF_001272655.2.fna
+  fasta_230812.51/GCF_001955935.1.fna
+  ...
+  #30개 스레드를 사용한다.
+  #트리 모델은 JTT를 기본으로 한다.
+  # Ryzen server: /data/apps/ezTree
+  (base) $ nohup ./ezTree_hyjeong -list list_230812.51 -out run_230812 -thread 30 &
+  $ ls -l | grep run_230812
+  -rw-rw-r-- 1 hyjeong hyjeong 7941159  8월 12 15:49 run_230812.aln
+  -rw-rw-r-- 1 hyjeong hyjeong    2526  8월 12 15:56 run_230812.nwk
+  -rw-rw-r-- 1 hyjeong hyjeong   17618  8월 12 15:49 run_230812.pfam
+  drwxrwxr-x 2 hyjeong hyjeong   24576  8월 12 15:49 run_230812.work
+Newick 파일에서 각 leaf label은 최초에 투입한 FASTA file명(예: GCF_000507205.3.fna)이 그대로 쓰이고 있을 것이다. vim으로 run_230812.nwk 파일을 열어서 '.fna' 문자열을 제거하면 더욱 바람직할 것이다. Non-greedy match를 원한다면 Stack Overflow의 [[https://stackoverflow.com/questions/1305853/how-can-i-make-my-match-non-greedy-in-vim|How can I make my match non greedy in vim?]]을 참조하라.
+===== MSA로부터 newick file 만들어 보기 =====
+Roary가 생성한 MSA(core_gene_alignment.aln)로부터 간단하게 계통수를 만드는 방법은 다음과 같다.
+  $ trimal -in core_gene_alignment.aln -out core_gene_alignment.aln.trim -automated1
+  $ fasttree -nt -gtr core_gene_alignment.aln.trim > my_tree.newick
 세균의 유전체 진화에서는 horizontal gene transfer(HGT)가 매우 빈번히 일어난다. HGT로 유입된 영역을 제거하지 않으면 정확한 계통수를 작성하기가 어려워질 수 있다. 이를 위해 [[https://github.com/nickjcroucher/gubbins|gubbins]]와 같은 도구를 이용하여 sequence alignment file을 처리하는 것이 바람직하다. Gubbins의 사용에 관해서는 [[bioinfo:microbial_varaint_calling#snippy를 이용한 간편한 변이 탐색|Snippy를 이용한 간편한 변이 탐색]] 항목을 참조하기 바란다.
-Newick 포맷의 파일은 [[http://tree.bio.ed.ac.uk/software/figtree/|FigTree]], [[https://sites.google.com/site/cmzmasek/home/software/archaeopteryx-js|Archaeopteryx]], [[http://dendroscope.org/|Dendroscope]] 또는 [[https://itol.embl.de/|iTOL server]] 등을 통해서 시각화하면 된다. 특히 iTOL server는 각종 annotation 자료를 트리와 같이 표현할 수 있다는 점에서 유용하다.
+Newick 포맷의 파일은 [[http://tree.bio.ed.ac.uk/software/figtree/|FigTree]], [[https://sites.google.com/site/cmzmasek/home/software/archaeopteryx-js|Archaeopteryx]], [[https://uni-tuebingen.de/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/fachbereiche/informatik/lehrstuehle/algorithms-in-bioinformatics/software/dendroscope/|Dendroscope]] 또는 [[https://itol.embl.de/|iTOL server]] 등을 통해서 시각화하면 된다. 특히 iTOL server는 각종 annotation 자료를 트리와 같이 표현할 수 있다는 점에서 유용하다.
+===== R에서 트리 자료 다루기 =====
+R에서 tree file을 다루려면 ape package를 사용하면 된다.
+  > library(ape)
+  > myTree = ape::read.tree("gubbins2.final_tree.tre")
+  > myTree
+  Phylogenetic tree with 11 tips and 10 internal nodes.
+  Tip labels:
+  	NCTC2916, Ibaraki2007, Iwate2008, Iwate2007, Fukuoka2010, Miyagi2006, ...
+  Rooted; includes branch lengths.
+  > myTree$tip.label
+  [1] "NCTC2916"    "Ibaraki2007" "Iwate2008"   "Iwate2007"   "Fukuoka2010"
+  [6] "Miyagi2006"  "Tochigi2008" "Af650"       "Reference"   "Okayama2011"
+  [11] "Aichi2011"
+  > plot(myTree) # tree 구조가 그림으로 표시된다.
+  # Newick format의 텍스트를 직접
+  > myTree2 <- ape::read.tree(text='((A:1, B:1):2, ((C:1, D:1):2, (E:1, F:1):2):4);')
+R에서 트리 자료를 본격적으로 다루려면 [[https://cran.r-project.org/web/packages/phytools/phytools.pdf|phytools]] 패키지를 권장한다. 계통수 자료의 처리에 입문하려면 [[http://www.phytools.org/Cordoba2017/ex/2/Intro-to-phylogenies.html|Introduction to phylogenies in R]]을 학습할 것을 권장한다.
-===== R에서 트리 그리기 =====
 ===== IQ-TREE로 maximum likelihood를 이용한 계통분류학적 추론하기 =====
+[[http://www.iqtree.org/|IQ-TREE]]는 새로운 전략을 구사하여 ML tree를 좀 더 빠르고 정확하게 추론하기 위하여 개발된 소프트웨어 패키지이다. 2020년 발표된 IQ-TREE 2(Mol Biol Evol 논문)에서는 새로운 모델을 추가하고 계산방법도 더욱 개선되었다. 자세한 내용은 웹사이트에 게시된 [[http://www.iqtree.org/doc/iqtree-doc.pdf|튜토리얼 및 매뉴얼 PDF 파일]]을 참조하라. IQ-TREE의 입력 파일은 Phylip 포맷을 따라야 하므로, MSA 프로그램의 결과물이 만약 다른 종류의 포맷이라면 바로 위에서 소개한 readAl을 이용하여 전환하도록 한다. 그러나 최신 버전의 IQ-TREE는 aligned FASTA 파일을 자동으로 인식하여 처리하므로 phylip 포맷으로 일부러 바꿀 필요는 없다. 다음은 간략한 실행 사례이다.
+. Infer maximum-likelihood tree from a sequence alignment (example.phy)
+     with the best-fit model automatically selected by ModelFinder:
+       iqtree -s example.phy
+. Perform ModelFinder without subsequent tree inference:
+       iqtree -s example.phy -m MF
+     (use '-m TEST' to resemble jModelTest/ProtTest)
+. Combine ModelFinder, tree search, ultrafast bootstrap and SH-aLRT test:
+       iqtree -s example.phy --alrt 1000 -B 1000
+. Perform edge-linked proportional partition model (example.nex):
+       iqtree -s example.phy -p example.nex
+     (replace '-p' by '-Q' for edge-unlinked model)
+. Find best partition scheme by possibly merging partitions:
+       iqtree -s example.phy -p example.nex -m MF+MERGE
+     (use '-m TESTMERGEONLY' to resemble PartitionFinder)
+. Find best partition scheme followed by tree inference and bootstrap:
+       iqtree -s example.phy -p example.nex -m MFP+MERGE -B 1000
+. Use 4 CPU cores to speed up computation: add '-T 4' option
+. Polymorphism-aware model with HKY nucleotide model and Gamma rate:
+       iqtree -s counts_file.cf -m HKY+P+G
+. PoMo mixture with virtual popsize 5 and weighted binomial sampling:
+       iqtree -s counts_file.cf -m "MIX{HKY+P{EMP},JC+P}+N5+WB"