Differences

This shows you the differences between two versions of the page.

--- bioinfo:계통수_작성하기 [2023/08/11 09:59] – [IQ-TREE로 maximum likelihood를 이용한 계통분류학적 추론하기] hyjeong
+++ bioinfo:계통수_작성하기 [2025/02/23 12:57] (current) – [Species level을 벗어나는 genome의 MSA를 얻으려면 - ezTree] hyjeong
@@ Line 1: / Line 1: @@
 ====== 계통수 작성하기 ======
-Roary 또는 다른 multiple sequence alignment(MSA) 소프트웨어로 만들어진 염기서열 정렬 결과물을 [[http://trimal.cgenomics.org/|trimAl]]로 트리밍한 뒤 [[http://www.microbesonline.org/fasttree/|FastTree]]로 처리하면 신속하게 approximately-maximum likelihood tree를 얻을 수 있다. 진짜 maximum likelihood(ML) tree를 얻으려면 [[https://github.com/stephaneguindon/phyml|PhyML]]이나 [[https://raxml-ng.vital-it.ch/|RAxML]]을 사용한다. FastTree는 매우 유용한 프로그램이지만 진정한 의미의 ML tree를 만들어주지 않고, 생성하는 support value 역시 널리 이용되는 bootstrap value가 아니라 Shimodaira-Hasegawa test에 의한 값이라는 한계점이 있다.
+Roary 또는 다른 multiple sequence alignment(MSA) 소프트웨어로 만들어진 다중 염기서열 정렬(MSA) 결과물을 [[http://trimal.cgenomics.org/|trimAl]]로 트리밍한 뒤 [[http://www.microbesonline.org/fasttree/|FastTree]]로 처리하면 신속하게 approximately-maximum likelihood tree를 얻을 수 있다. 진짜 maximum likelihood(ML) tree를 얻으려면 [[https://github.com/stephaneguindon/phyml|PhyML]]이나 [[https://raxml-ng.vital-it.ch/|RAxML]]을 사용한다. FastTree는 매우 유용한 프로그램이지만 진정한 의미의 ML tree를 만들어주지 않고, 생성하는 support value 역시 널리 이용되는 bootstrap value가 아니라 Shimodaira-Hasegawa test에 의한 값이라는 한계점이 있다.
 Alignment로부터 불량한 곳을 제거하려면 trimAl 이외에도 Gblocks를 쓸 수 있다. [[https://github.com/JLSteenwyk/ClipKIT|ClipKIT]]는 MSA로부터 계통발생학적으로 의미 있는 위치만을 남기고 나머지를 제거하는 도구이다. TrimAl 패키지에 포함된 readAl은 sequence alignment file의 포맷 전환용 유틸리티이다(예: FASTA -> Phylip). MSA의 편집과 시각화 및 분석에는 [[https://www.jalview.org/|Jalview]]가 유용할 것이다. [[https://doua.prabi.fr/software/seaview|SeaView]]('Multiplatform GUI for molecular phylogeny')도 MSA 자료의 시각화를 위한 프로그램이다. 작은 용량의 MSA를 시각화하려면 'trimal -htmlout //filename.html//' 명령어를 써도 좋다.
@@ Line 7: / Line 7: @@
 정렬된 MSA 자료의 조작에는 [[https://www.zfmk.de/en/research/research-centres-and-groups/fasconcat|FASconCAT]]이나 [[https://github.com/RybergGroup/phylommand|Phylommand]](tree 자료도 다룰 수 있음) 등의 유틸리티를 써도 좋을 것이다.
-MSA로부터 간단하게 계통수를 만드는 방법은 다음과 같다.
+===== Species level을 벗어나는 genome의 MSA를 얻으려면 - ezTree =====
+Roary는 하나의 species에 속하는 균주 유전체를 대상으로 core gene group을 동정한 뒤 이로부터 MSA를 생성해 낸다. 만약 분석 대상 미생물이 여러 species에 걸쳐 있다면 core gene이 잘 형성되지 않으므로 원하는 결과를 얻기 어렵다. 나는 이러한 상황에서는 [[https://github.com/yuwwu/ezTree|ezTree]]를 애용한다. 인용도 많이 된다거나 지속적인 업데이트가 이루어지는 인기 있는 소프트웨어는 아니지만, 작동 원리가 간단하기 때문이다. ezTree는 pre-defined single-copy marker에 의존하지 않는다. 입력 유전체 서열(gene prediction을 자체적으로 실시하여 proteome set로 전환하여 사용)로부터 PFAM profile annotation을 거쳐 직접 single-copy marker gene set을 찾아낸다. 상세한 원리는 2018년도에 발표된 [[https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-4327-9|논문]]을 사용하기 바란다. ezTree 원본은 PFAM 검색 작업이 병렬화되지 않아서 계산에 많은 시간이 소요된다. 따라서 병렬화가 가능하도록 수정한 스크립트를 만들게 되었다. 이 파일 묶음({{:bioinfo:eztree_modified.zip|zipped}})를 풀어서 나오는 Perl script 2개를 작업 디렉토리에 복사한 뒤 실행 권한을 주고 다음 설명대로 활용하기 바란다. ezTree 실행에 필요한 다른 dependency의 설치는 [[https://github.com/yuwwu/ezTree|ezTree GitHub 웹사이트]]를 참조하라. 아마도 나는 conda base environment에 필요한 프로그램을 설치해 둔 것 같다.
-  $ trimal –in core_gene_alignment.aln –out core_gene_alignment.aln.trim –automated1
+개 //Paenibacillus polymyxa// 관련 균주 유전체 자료 파일({{:bioinfo:ani_r_exercise.zip|zipped}})에 포함된 accessions.txt에 해당하는 유전체 염기서열 파일을 다운로드한 뒤 이를 전부 fasta_230812.51라는 디렉토리에 두었다고 가정하자.
-  $ fasttree –nt –gtr core_gene_alignment.aln.trim > my_tree.newick
+  #FASTA file의 목록을 먼저 만든다.
+  $ find fasta_230812.51 -type f > list_230812.51
+  $ cat list_230812.51
+  fasta_230812.51/GCF_000507205.3.fna
+  fasta_230812.51/GCF_001272655.2.fna
+  fasta_230812.51/GCF_001955935.1.fna
+  ...
+  #30개 스레드를 사용한다.
+  #트리 모델은 JTT를 기본으로 한다.
+  # Ryzen server: /data/apps/ezTree
+  (base) $ nohup ./ezTree_hyjeong -list list_230812.51 -out run_230812 -thread 30 &
+  $ ls -l | grep run_230812
+  -rw-rw-r-- 1 hyjeong hyjeong 7941159  8월 12 15:49 run_230812.aln
+  -rw-rw-r-- 1 hyjeong hyjeong    2526  8월 12 15:56 run_230812.nwk
+  -rw-rw-r-- 1 hyjeong hyjeong   17618  8월 12 15:49 run_230812.pfam
+  drwxrwxr-x 2 hyjeong hyjeong   24576  8월 12 15:49 run_230812.work
+Newick 파일에서 각 leaf label은 최초에 투입한 FASTA file명(예: GCF_000507205.3.fna)이 그대로 쓰이고 있을 것이다. vim으로 run_230812.nwk 파일을 열어서 '.fna' 문자열을 제거하면 더욱 바람직할 것이다. Non-greedy match를 원한다면 Stack Overflow의 [[https://stackoverflow.com/questions/1305853/how-can-i-make-my-match-non-greedy-in-vim|How can I make my match non greedy in vim?]]을 참조하라.
+===== MSA로부터 newick file 만들어 보기 =====
+Roary가 생성한 MSA(core_gene_alignment.aln)로부터 간단하게 계통수를 만드는 방법은 다음과 같다.
+  $ trimal -in core_gene_alignment.aln -out core_gene_alignment.aln.trim -automated1
+  $ fasttree -nt -gtr core_gene_alignment.aln.trim > my_tree.newick
 세균의 유전체 진화에서는 horizontal gene transfer(HGT)가 매우 빈번히 일어난다. HGT로 유입된 영역을 제거하지 않으면 정확한 계통수를 작성하기가 어려워질 수 있다. 이를 위해 [[https://github.com/nickjcroucher/gubbins|gubbins]]와 같은 도구를 이용하여 sequence alignment file을 처리하는 것이 바람직하다. Gubbins의 사용에 관해서는 [[bioinfo:microbial_varaint_calling#snippy를 이용한 간편한 변이 탐색|Snippy를 이용한 간편한 변이 탐색]] 항목을 참조하기 바란다.
@@ Line 16: / Line 42: @@
 Newick 포맷의 파일은 [[http://tree.bio.ed.ac.uk/software/figtree/|FigTree]], [[https://sites.google.com/site/cmzmasek/home/software/archaeopteryx-js|Archaeopteryx]], [[https://uni-tuebingen.de/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/fachbereiche/informatik/lehrstuehle/algorithms-in-bioinformatics/software/dendroscope/|Dendroscope]] 또는 [[https://itol.embl.de/|iTOL server]] 등을 통해서 시각화하면 된다. 특히 iTOL server는 각종 annotation 자료를 트리와 같이 표현할 수 있다는 점에서 유용하다.
-===== R에서 트리 그리기 =====
+===== R에서 트리 자료 다루기 =====
 R에서 tree file을 다루려면 ape package를 사용하면 된다.