Differences

This shows you the differences between two versions of the page.

--- bioinfo:유전체_주석화_genome_annotation [2023/06/29 10:59] – [개별 사용을 위하여 PGAP을 실행하기] hyjeong
+++ bioinfo:유전체_주석화_genome_annotation [2025/05/23 14:49] (current) – [PGAP 사용하기] hyjeong
@@ Line 10: / Line 10: @@
 ===== PGAP 사용하기 =====
-Prokaryotic Genome Annotation System(PGAP, [[https://www.ncbi.nlm.nih.gov/genome/annotation_prok/|NCBI]] or [[https://github.com/ncbi/pgap|GitHub]])은 세균 유전체의 자동 주석화를 위하여 NCBI에서 공식적으로 사용하는 프로그램이다. 여러 개 유전체 서열에 대하여 신속하게 주석화를 하려면 Prokka가 매우 편리하지만, 대용량의 DB를 참조하여 주석화를 실시하는 PGAP이 더욱 양질의 결과를 산출하게 된다. 원래 PGAP은 RefSeq genome의 주석화용으로 내부적으로만 쓰이다가 누구나 설치할 수 있는 형태로 배포되기에 이르렀다. 유튜브에는 사용자의 유전체를 PGAP으로 직접 주석화하는 방법을 소개하는 [[https://youtu.be/pNn_-_46lpI|동영상]]이 올라와 있다. 설치와 사용 방법에 대한 상세한 설명은 PGAP 위키 사이트의 [[https://github.com/ncbi/pgap/wiki/Quick-Start|Quick-Start]]를 참고하도록 한다. Standalone 버전이 처음 나왔을 떄에 비하면 설치 방법이 훨씬 간단해진 것 같다. PGAP 버전 번호는 ‘YYYY-MM-DD.build####’의 형식을 따른다. 2023년 6월 22일에 설치한 input-2023-05-17.build6771 버전의 설치 후 용량은 32GB 정도이다.
+Prokaryotic Genome Annotation System(PGAP, [[https://www.ncbi.nlm.nih.gov/genome/annotation_prok/|NCBI]] or [[https://github.com/ncbi/pgap|GitHub]])은 세균 유전체의 자동 주석화를 위하여 NCBI에서 공식적으로 사용하는 프로그램이다. 여러 개 유전체 서열에 대하여 신속하게 주석화를 하려면 Prokka가 매우 편리하지만, 대용량의 DB를 참조하여 주석화를 실시하는 PGAP이 더욱 양질의 결과를 산출하게 된다. 원래 PGAP은 RefSeq genome의 주석화용으로 내부적으로만 쓰이다가 누구나 설치할 수 있는 형태로 배포되기에 이르렀다. 유튜브에는 사용자의 유전체를 PGAP으로 직접 주석화하는 방법을 소개하는 [[https://youtu.be/pNn_-_46lpI|동영상]]이 올라와 있다. 설치와 사용 방법에 대한 상세한 설명은 PGAP 위키 사이트의 [[https://github.com/ncbi/pgap/wiki/Quick-Start|Quick-Start]]를 참고하도록 한다. Standalone 버전이 처음 나왔을 떄에 비하면 설치 방법이 훨씬 간단해진 것 같다. PGAP 버전 번호는 ‘YYYY-MM-DD.build####’의 형식을 따른다. 2025년 5월 23일에 최신 버전인 2025-05-06.build7983로 업데이트하였다. PGAP을 설치하고 활용할 때에는 conda가 필요하지 않다.
   # 현재 배포 중인 PGAP의 최신 버전 확인하기
   $ curl --silent "https://api.github.com/repos/ncbi/pgap/releases/latest" | grep -Po '"tag_name": "\K.*?(?=")' > VERSION
   $ cat VERSION
--05-17.build6771
+-05-06.build7983
-PGAP은 docker 환경을 사용하므로, 사용자는 관리자이거나 sudo 권한을 갖고 있어야 한다. PGAP 배포판에 포함된 샘플 유전체 서열을 대상으로 주석화를 실행하는 방법은 다음과 같다. pgap.py 스크립트는 /data/apps/pagp에 있다고 가정한다.
+PGAP은 docker 환경을 사용하므로, 사용자는 관리자이거나 sudo 권한을 갖고 있어야 한다. PGAP 배포판에 포함된 샘플 유전체 서열을 대상으로 주석화를 실행하는 방법은 다음과 같다. pgap.py 스크립트는 /data/apps/pagp에 있다고 가정한다. 파일이 설치되는 위치는 $HOME/.pgap을 기본으로 하지만 환경변수 PGAP_INPUT_DIR를 통해 임의로 지정할 수 있다.
   # docker가 실행 중인지 확인
@@ Line 242: / Line 242: @@
   all	7
+이제부터의 작업은 R에서 실시한다. 모든 *.tsv.txt 파일을 모아서 하나로 병합한 뒤 results.txt 파일에 저장한다.
+  rm(list=ls())
+  list.filenames = list.files(pattern="*faa.tsv.txt")
+  # create an empty list
+  list.data=list()
+  length(list.filenames)
+  # create a loop to read in your data
+  for (i in 1:length(list.filenames))
+  {
+  list.data[[i]] = read.table(list.filenames[i],sep="\t",header=F)
+  colnames(list.data[[i]]) = c("markers",list.filenames[i])
+  }
+  # full outer join
+  for (i in 1:(length(list.filenames)-1))
+  {
+  df = merge(x=list.data[[i]],y=list.data[[i+1]],by="markers",all=TRUE)
+  list.data[[i+1]] = df
+  }
+  df = list.data[[length(list.filenames)]]
+  df[is.na(df)] = 0
+  rownames(df) = df[,1]
+  df = df[,-1]
+  colnames(df) = gsub(".faa.tsv.txt","",list.filenames)
+  # 최종 확인
+  dim(df)
+  View(df)
+  # accession을 species로 치환
+  data = read.table("acc_species_strain",sep="\t",stringsAsFactors=F)
+  key = data[,1]
+  names(key) = data[,2]
+  x = c()
+  for (i in names(df)){
+  temp = names(key)[key==i]
+  x = append(x, temp)
+  }
+  df = rbind(df, x)
+  rownames(df)[7] = "species"
+  View(df)
+  # row를 정렬하고 transpose
+  sorted = c("species","IPR039697","IPR014182","IPR002347","IPR012079","IPR012394","all")
+  df = df[sorted,]
+  df.2 = as.data.frame(t(df))
+  df.2[,2:7] = sapply(df.2[,2:7],as.numeric)
+  str(df.2)
+  write.table(df.2,"results.txt",sep="\t",quote=F)
+results.txt 파일을 이용하여 다양한 변형 및 탐색을 실시해 보자. 이때 dplyr 패키지가 매우 유용하게 쓰인다. 각 species에 대하여 family에 정의된 InterPro entry에 해당하는 유전자의 최댓값은 얼마인가? 그것에 해당하는 균주 또는 accession은 무엇인가? 다음 R 코드에서 그 방법을 알아보자.
+  # R 세션을 끝내고 다시 시작한다고 가정함
+  # 혹은 rm(list=ls())를 실행하여 변수를 초기화한다.
+  # reading from pre-existing file
+  df = read.table("results.txt",sep="\t")
+  # accession을 추출하여 첫번째 컬럼으로 삽입한다.
+  accession = rownames(df)
+  df = cbind(accession, df)
+  write.table(df,"results_with_accession.txt",sep="\t",row.names=F,quote=F)
+  # 각 species에 대하여 family에 해당하는 유전자가 가장 많이 검출된 균주(assembly accession; row)를 찾는다.
+  # all 대신 IPR039697 등 다른 컬럼을 선택할 수 있다.
+  # [1] 단순한 방법
+  df.agg = aggregate(all ~ species, df, max)
+  df.max = merge(df.agg, df)
+  # [2] 또는 dplyr을 사용한 방법
+  library(dplyr)
+  x = df %>% group_by(species) %>% filter(all==max(all)) %>% arrange(species)
+  x # 필요하면 파일로 저장한다.
+  # quick descriptive information  https://ademos.people.uic.edu/Chapter11.html
+  with(df,summary(species))
+  with(df,summary(IPR039697))
+  with(df,summary(all))
+  df[df$species=="Lactobacillus brevis",]
+  df[df$species=="Lactobacillus brevis",]$all
+  summary(df[df$species=="Lactobacillus brevis",])
+  # dplyr을 사용하면 데이터 프레임에 대한 탐색적 분석을 쉽게 할 수 있다.
+  # 데이터 프레임을 tbl_df(‘tibble’) 형태로 변환하면 좀 더 쉽게 다룰 수 있다.
+  df_df = tbl_df(df)
+  select(df_df, accession, species, all) # 선택된 column을 출력
+  filter(df_df, all > 10) # filter()는 조건에 맞는 row를 출력
+  df_df %>% group_by(species) %>% summarise(number_of_strains=n()) %>% arrange(species)
+  df_df %>% group_by(species) %>% summarise(max(all)) %>% arrange(species)
+마지막으로 각 species에 대하여 family에 정의된 InterPro entry 검출 결과를 집계하여 error bar(표준편차)를 포함한 막대그래프를 그려 보자. ggplot2 라이브러리를 로드해야 한다.
+  library("ggplot2")
+  rm(list=ls())
+  # https://stackoverflow.com/questions/32984974/add-error-bars-to-a-barplot
+  # reading from pre-existing file
+  data = read.table("results.txt",sep="\t")
+  nrow(data)
+  species_list = levels(data$species)
+  ipr_entries = colnames(data)
+  ipr_entries = ipr_entries[-1]
+  myData = aggregate(data[,2:7],by=list(data$species),FUN=function(x) c(mean=mean(x), sd=sd(x)))
+  df = data.frame()
+  j=1
+  for (i in species_list) {
+  x = matrix(data=as.vector(unlist(myData[j,2:7])),ncol=2,byrow=T)
+  df.tmp = data.frame(species=rep(i,length(ipr_entries)),ipr=ipr_entries,mean=x[,1],sd=x[,2])
+  df = rbind(df,df.tmp)
+  j = j + 1
+  }
+  write.table(df,"species_mean_sd.txt",sep="\t",quote=F)
+  # https://www.r-graph-gallery.com/4-barplot-with-error-bar.html
+  # https://www.bioinformatics.babraham.ac.uk/training/ggplot_course/Introduction%20to%20ggplot.pdf
+  # geom_bar(color="black")을 설정하면 검정색 선으로 테두리가 생긴다.
+  plot1 = ggplot(df,aes(species,mean,fill=ipr)) + geom_bar(position="dodge",stat="identity",size=1)
+  # axis label을 세로로 세우기.
+  plot1 = plot1 + theme(axis.text.x=element_text(angle=90,hjust=1)) + geom_hline(yintercept=28, linetype="dashed", size=0.3)
+  plot1
+  plot2 = plot1 + geom_errorbar(aes(ymin=mean-sd,ymax=mean+sd),width=0.5,position=position_dodge(width=0.9),size=0.2)
+  plot2
+  plot3 = plot2 + xlab("Species") + ggtitle("Distribution of ADH/ALDH across probiotic bacterial species") + theme(plot.title=element_text(hjust=0.5))
+  pdf("final_plot.pdf",width=11,height=7)
+  plot3
+  dev.off()
 ===== eggNOG-mapper를 사용한 orthology assignment 기반의 기능 주석화(functional annotation) =====
 eggNOG ([[https://academic.oup.com/nar/article/51/D1/D389/6833261|v6.0]]는 상동성 관계와 유전자의 진화 역사 및 기능 주석 정보의 데이터베이스이다. [[http://eggnog6.embl.de/|웹사이트]]에서는 DB 자체에 대한 검색 및 query 서열을 입력하여 주석화를 실시할 수 있다. Ortholgy assignment에 의한 기능 주석화를 수행하는 도구는 eggNOG-mapper [[https://academic.oup.com/mbe/article/38/12/5825/6379734?login=false|v2.0]]이다. [[http://eggnog-mapper.embl.de/|eggNOG-mapper 웹사이트]]에서 단일 서열 혹은 파일 업로드를 통한 배치 주석 작업을 실시할 수 있으며, 좀 더 빠른 실행을 위해 로컬 서버에 eggNOG-mapper를 설치하여 사용하는 것도 가능하다. 매우 빠른 BLAST 호환 검색 프로그램인 DIAMOND가 있어야 eggNOG-mapper를 실행할 수 있다. 최신 버전인 eggNOG-mapper v2의 상세한 설명은 [[https://github.com/eggnogdb/eggnog-mapper/wiki/|위키 사이트]]를 참조하라. Query protein의 수가 >100M라면 FASTA 서열 파일(single-line FASTA)을 잘게 나누어서 처리하는 방법을 권장한다([[https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.11#user-content-Setting_up_large_annotation_jobs|'Setting up large annotation jobs']]).
+eggNOG-mapper의 기본적인 사용법은 다음과 같다. CPU 수를 지정하지 않으면 2개를 사용한다.
+  $ python /data/apps/eggnog-mapper/emapper.py -i INFILE.faa --output INFILE_eggNOG -m diamond --cpu 16
+  …
+  Done
+     INFILE_eggNOG.emapper.seed_orthologs
+     INFILE_eggNOG.emapper.annotations
+  Total time: 1551.87 secs
+Query 서열에 대한 주석 정보는 22개의 컬럼으로 이루어진 .annotations 파일에 기록된다. 각 컬럼에 대한 설명은 다음과 같다.
+  -  query_name
+  - seed eggNOG ortholog
+  - seed ortholog evalue
+  - seed ortholog score
+  - Predicted taxonomic group
+  - Predicted protein name
+  - Gene Ontology terms
+  - EC numberKEGG_ko
+  - KEGG_Pathway
+  - KEGG_Module
+  - KEGG_ReactionKEGG_rclass
+  - BRITE
+  - KEGG_TC
+  - CAZy
+  - BiGG Reaction
+  - tax_scope: eggNOG taxonomic level used for annotation
+  - eggNOG OGs
+  - bestOG (deprecated, use smallest from eggnog OGs)
+  - COG Functional Category
+  - eggNOG free text description
+eggNOG mapper에서 출력한 annotation 파일에서 KEGG ko 번호(KEGG orthology)를 추출하면 [[https://www.genome.jp/kegg/mapper/reconstruct.html|KEGG Mapper – Reconstruct]]에 업로드할 수 있는 gene list file을 만들 수 있다. 하나의 단백질에 부여된 복수의 ko 번호를 분리하여 여러 라인으로 출력하는 것이 핵심이다. 다음의 awk one-liner를 사용하면 된다.
+  $ awk -F"\t" -vOFS="\t" '$9~/^ko:/{id = "\n"$1"\t"; sub("ko:","",$9); gsub(",ko:",id,$9); print $1, $9}' eggNOG.emapper.annotations > gene_list_for_kegg_mapper.txt
 ===== Genomic island 예측 =====
 [[https://www.pathogenomics.sfu.ca/islandviewer/upload/|IslandViewer 웹사이트]]에 annotation이 끝난 유전체의 GenBank 파일을 업로드하여 예측한다. CDS primary tag 내부에 translation 정보가 필요하며, prokka 및 dfast_core는 이를 모두 충족하는 GenBank 파일을 제공한다. Draft genome sequence을 이용한 예측은 원래 권장되지 않으나 사용자들의 요청에 의하여 그 기능이 추가되었다. 이 경우에는 사용자가 제시한 reference genome에 맞추어 먼저 정렬을 실시하여 pseudochromosome을 만들어서 분석을 개시한다.
 ===== 항생제 내성 유전자(AMR determinant)의 예측 =====
+===== ResFinder 사용하기 =====
+항생제 내성 유전자 DB 및 분석 툴로서 대표적인 것은 맥마스터 대학교의 [[https://card.mcmaster.ca/|The Comprehensive Antibiotic Resistance DB(CARD)]]와 덴마크 공대 Center for Genomic Epidemiology(CGE)의 [[https://cge.cbs.dtu.dk/services/ResFinder/|ResFinder]]가 있다. 여기에서는 웹 서비스를 사용하지 않고 Resfinder DB와 검색 도구(ResFinder)를 사용하여 획득된 유전자에 의하여 발생하는 내성을 예측하는 방법을 알아본다. 덴마크 공대의 PointFinder는 약제 내성 결핵균(Mycobacterium tuberculosis)의 경우에서처럼 chromosomal gene의 자발적 돌연변이에 의한 내성을 찾는 도구이다. resfinder.py 실행시 -db <DATABASES> 옵션을 별도로 주지 않으면 모든 항생제 내성 DB를 사용하며, 쉼표를 이용하여 복수의 내성 DB를 지정할 수 있다. Threshold와 min_coverage의 기본값은 각각 0.9와 0.6이다.
+  $ mkdir out_all out
+  $ resfinder.py -i test.fsa -p /nas/DB/resfinder_db -o out_all
+  $ resfinder.py -i test.fsa -p /nas/DB/resfinder_db -d aminoglycoside,macrolide -o out
+CGE 웹사이트에서 서비스하는 것은 resfinder.pl 스크립트이다. 이것은 실행 옵션이 다르고(예: 출력물 디렉토리를 사전에 생성할 필요가 없음) blastall을 사용하며, 모든 내성 DB에 대한 검색을 할 수가 없다. resfinder.pl에 직접 옵션을 주면 스크립트에 내장된 blastall 옵션(-p blastn -a 5 -F F)에 우선하여 적용된다. 그러나 -a <num_threads> 옵션 이외의 것을 바꾸는 것은 권장되지 않는다.
 ====ABRicate를 이용한 항생제 내성 병원성 인자(virulence factor) 유전자 예측 ====
+[[https://github.com/tseemann/abricate|ABRicate]]는 contig 혹은 유전체 서열을 대상으로 NCBI, CARD, ARG-ANNOT, Resfinder, MEGARES, EcOH, PlasmidFinder, Ecoli_VF 및 VFDB 등 다양한 DB에 대한 검색을 실시하여 항생제 내성과 병원성 인자를 예측하는 프로그램이다. Raw sequencing read(FASTQ)는 처리하지 못한다. 뒤에서 다룰 TORMES pipeline에 포함되어 있지만, 이 프로그램은 raw sequencing read을 입력물로 받아서 트리밍 등 전처리와 조립부터 출발하는 용도로 쓰인다. 따라서 이미 다른 방식으로 조립이 완료된 contig 서열 파일이 있다면 ABRicate를 사용하는 것이 편리하다.
-===== 이차대사물 생합성 유전자(biosynthetic gene cluster, BGC) 예측 =====
+  $ abricate --list
+  DATABASE	SEQUENCES	DBTYPE	DATE
+  ecoh	597	nucl	2018-Oct-20
+  card	2237	nucl	2018-Oct-20
+  ncbi	4579	nucl	2018-Oct-20
+  vfdb	2597	nucl	2018-Oct-20
+  plasmidfinder	263	nucl	2018-Oct-20
+  resfinder	3021	nucl	2018-Oct-20
+  ecoli_vf	2701	nucl	2018-Oct-20
+  argannot	1749	nucl	2018-Oct-20
+  $ abricate -db card contigs.fa
+  Using nucl database card:  2237 sequences -  2018-Oct-20
+  #FILE	SEQUENCE	START	END	GENE	COVERAGE	COVERAGE_MAP	GAPS	%COVERAGE	%IDENTITY	DATABASE	ACCESSION	PRODUCT
+  Processing: contigs.fa
+  Found 5 genes in contigs.fa
+  contigs.fa	Enterococcus	514111	515397	efmA	1-1287/1287	===============	0/0	100.00	100.00	card	AB467372.1:285-1572	efmA is an MFS transporter permease in  E. faecium.
+  contigs.fa	Enterococcus	1922156	1923199	efrB	45-1086/1086	========/======	6/6	95.76	75.05	card	HG970103.1:1-1087	efrB is a part of the EfrAB efflux pump and both efrA and efrB are necessary to confer multidrug resistance.
+  contigs.fa	Enterococcus	1923857	1925292	efrA	1-1434/1457	========/======	6/8	98.22	75.12	card	HG970100.1:1-1458	efrA is a part of the EfrAB efflux pump and both efrA and efrB are necessary to confer drug resistance.
+  contigs.fa	Enterococcus	2040199	2040747	AAC(6')-Ii	1-549/549	===============	0/0	100.00	99.82	card	L12710:1-550	AAC(6')-Ii is a chromosomal-encoded aminoglycoside acetyltransferase in Enterococcus spp.
+  contigs.fa	Enterococcus	2426678	2428156	msrC	1-1479/1479	===============	0/0	100.00	94.46	card	AF313494:1-1480	msrC is a chromosomal-encoded ABC-efflux pump expressed in Enterococcus faecium that confers resistance to erythromycin and other macrolide and streptogramin B antibiotics.
+여러 샘플에 대하여 ABRicate를 실행하였다면 abricate --summary 옵션을 이용하여 gene presence/absence matrix를 만들 수 있다. Contig 서열 파일에 대하여 각각 검색을 실시한 결과를 따로 갖고 있어도 되고, 한 파일에 여러 샘플에 대한 결과가 수록되어 있어도 무방하다.
+===== 이차대사물 생합성 유전자(biosynthetic gene cluster, BGC) 예측 =====
+[[https://antismash.secondarymetabolites.org/#!/start|antiSMASH bacterial version]]에 GenBank 파일을 업로드하여 분석을 실시한다. 웹서버에서는 동시에 돌릴 수 있는 작업의 수에 한도가 있으므로, local server에 설치하여 사용하는 것도 좋다. 공개된 미생물 유전체에 대하여 미리 예측된 BGC 정보를 검색하려면 [[https://antismash-db.secondarymetabolites.org/#!/start|antiSMASH database]]를 이용하라.