2022년 미생물 유전체 분석 강좌 계획안

2022년 미생물 유전체 분석 강좌 계획안

문서 최종 수정일: — Haeyoung Jeong 2022/03/15 19:09

이 문서는 수강생 여러분이 아니라 저를 위해 만드는 것입니다. 설명이 친절하지 않아도 양해를 부탁드립니다^^

Windows에 설치할 것

Java 실행 환경

OpenJDK 추천함.

'AdoptOpenJDK' OpenJDK 11(LTS) https://adoptium.net/releases.html?variant=openjdk11 에서 jdk-11.0.4+1(LTS) JRE .msi 파일 받아서 설치하면 됨
또는 https://java.com/ko/download/ 2022년 2월 26일 현재 Version 8 Update 321 ⇒ artemis를 실행할 때 JNI error가 발생한다. artemis는 AdoptOpenJDK와 궁합이 잘 맞는 것 같다. Java의 모든 버전을 다운로드할 수 있는 저장소(로그인 필요)에서 jdk-11.0.13_windows-x64_bin.exe 파일을 받아서 설치하면.
알집이 .jar 파일을 압축 파일로 인식하지 못하게 해야 한다. 윈도우 설정 → 앱 → 기본 앱 → 앱별 기본값 설정 → ALZip → .jar 파일에 연결하는 앱을 변경, 그런데 여기에서 Oracle JAVA JDK는 보이질 않는다. 차라리 알집을 지우고 반디집을 설치하는 것이 편리하다.

Artemis 웹사이트에서 발췌 The old v17.0.1 version of the Artemis software required Java version 1.8 to run. All recent releases from v18.0.0 onwards require a minimum of Java 9 and ideally Java 11. This must be installed first. The easiest way to install a non-commercial open source Java version is from AdoptOpenJDK - just select the OpenJDK version and Hotspot options for the relevant platform. See the user manual for further options. A Java installation is not required for the Bioconda or Docker options.

Java application

Java 실행 환경(JDK or JRE)을 설치한 것은 Windows에서 다음의 응용프로그램을 실행하기 위한 것이다. 뒤에서 설명할 WSL과는 무관하다.

artemis http://sanger-pathogens.github.io/Artemis/Artemis/
gepard https://cube.univie.ac.at/gepard
figtree https://github.com/rambaut/figtree/releases

Windows Terminal

WSL(Windows Subsystem for Linux) 활용

WSL 및 Windows 터미널 설치

https://webdir.tistory.com/541

https://docs.microsoft.com/ko-kr/windows/wsl/install

https://docs.microsoft.com/ko-kr/windows/terminal/install

기타 프로그램: Notepad++

배포한 'myUbuntu'의 재작

별도의 위키문서 myUbuntu distro 제작 및 재설치 과정에 기록하였음.

실습 내용

Fastq 파일 입수(~50x)

Bacillus velezensis GB03(3.9 Mb): PRJNA227787 SRR1034787 GCF_000508125.1 (LHCC01) GCF_000508125.2(complete)

(base) fastq-dump --split-files -A SRR1034787 -X 1000000 #202,000,000 bases
(base) reformat.sh in=SRR1034787_1.fastq in2=SRR1034787_2.fastq
No output stream specified.  To write to stdout, please specify 'out=stdout.fq' or similar.
Set INTERLEAVED to false
Input is being processed as paired
Input:                          2000000 reads           202000000 bases
Output:                         2000000 reads (100.00%)         202000000 bases 
(100.00%)

Time:                           8.217 seconds.
Reads Processed:       2000k    243.40k reads/sec
Bases Processed:        202m    24.58m bases/sec

bbmap 패키지의 reformat.sh은 원래 염기서열 파일의 형식을 바꾸거나 트리밍 등의 QC 또는 전처리를 하는 목적으로 쓰이지만, 출력을 지정하지 않으면 염기서열 파일의 read 수를 표시한다.

수강생이 Illumina read를 직접 준비한 경우에는 seqtk sample을 이용하여 적당히 일부를 취하도록 한다. 물론 reformat.sh을 이용하여 subsampling을 할 수도 있다. 'samplereadstarget=100000' x 2에 해당하는 분량이 나올 것이다.

(base) reformat.sh in=SRR1034787_1.fastq in2=SRR1034787_2.fastq samplereadstarget=100000 out1=sample_1.fq out2=sample_2.fq
Set INTERLEAVED to false
Input is being processed as paired
Input:                          2000000 reads           202000000 bases
Output:                         200000 reads (10.00%)   20200000 bases (10.00%)

De novo assembly & annotation (zga)

Dell Inspiron 3668(Intel Core i5-7400 @ 3GHz, 8 GB memory)에서 30분 정도 소요됨

(zga)  zga -1 SRR1034787_1.fastq -2 SRR1034787_2.fastq --threads 2 -o my_assembly
2022-02-26 20:27:22,648 - INFO - ZGA ver. 0.0.9post2
2022-02-26 20:27:22,648 - INFO - Full log location: /home/hyjeong/data/my_assembly/zga.log
2022-02-26 20:27:22,648 - INFO - Checking input files.
2022-02-26 20:27:22,649 - INFO - Read quality control started
2022-02-26 20:28:21,244 - INFO - Reads processing started
2022-02-26 20:28:21,245 - INFO - Trimming and filtering paired end reads
2022-02-26 20:29:32,562 - INFO - Merging paired-end reads.
2022-02-26 20:29:52,008 - INFO - Read processing finished
2022-02-26 20:29:52,008 - INFO - Assembling started
2022-02-26 20:40:42,071 - INFO - Assembling finished
2022-02-26 20:40:42,125 - INFO - Assembly length: 3844865
2022-02-26 20:40:42,127 - INFO - Contig count: 19
2022-02-26 20:40:42,127 - INFO - N50: 593033
2022-02-26 20:40:42,127 - INFO - Checking genome quality
2022-02-26 20:40:48,296 - INFO - Bacteria marker set will be used for CheckM
2022-02-26 20:41:31,876 - INFO - Genome completeness: 98.28%
2022-02-26 20:41:31,876 - INFO - Genome contamination: 1.72%
2022-02-26 20:41:31,877 - INFO - Genome heterogeneity: 0.0%
2022-02-26 20:41:31,877 - INFO - Genome annotation started
2022-02-26 20:41:32,146 - INFO - No locus tag provided. Generating it as MD5 hash of genome
2022-02-26 20:41:32,154 - INFO - Locus tag generated: SMIJAZ
2022-02-26 20:57:03,639 - INFO - Workflow finished!

Artemis에서 genome 확인하기(Windows)

Multi-record 이루어진 GenBank 파일을 artemis에서 열면 첫 번째 염기서열과 그에 부속된 featre만 나타난다. 첫 번째로는 csplit 유틸리티를 이용하여 GenBank file을 각 염기서열에 따라 개별 파일로 분리한 다음(구분자는 '//'), merge-gbk-records로 병합하여 하나의 pseudomolecule 형태로 만들어서 여는 방법이 있다. spacer의 기본 단위가 최소 1 kb 단위라서 너무 길다고 생각하면 python3.8/site-packages/merge_gbk_records/main.py 파일에서 다음 두 줄을 100으로 고쳐라.

# merge-gbk-records release 0.1.2
70         spacer_seq = Seq('N' * 100)
90         spacer_feature = SeqFeature(FeatureLocation(0, length * 100, 0),

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/262/045/GCF_000262045.1_KCTC_13613_01/GCF_000262045.1_KCTC_13613_01_genomic.gbff.gz # Bacillus siamensis
csplit --prefix=test GCF_000262045.1_KCTC_13613_01_genomic.gbff '/^\/\/$/' '{*}'
for f in $(ls test*)
> do
> echo '//' >> $f
> done
list=$(ls test* | paste -sd' ')
(base) merge-gbk-records --length 1 $list > merged.gbk # 1-kb spacer (minimum)
# stop codon으로 spacer를 대신함. 그러나 너무 긴 spacer(20 kb!)
(base) merge-gbk-records --spacer stop $list > merged.gbk

혹은 artemis에서 권장하는 방법을 써 보자. 아직은 썩 잘 되지 않는다. GFF 파일의 위치에 대한 정렬을 제대로 하려면 https://github.com/billzt/gff3sort 참고하라

mv GCF_000262045.1_KCTC_13613_01_genomic.gbff GCF_000262045.1.gbk
seqret GCF_000262045.1.gbk -outseq GCF_000262045.1.fna
bp_genbank2gff3.pl GCF_000262045.1.gbk
(zga) samtools faidx GCF_000262045.1.fna
gff3sort.pl GCF_000262045.1.gbk.gff | bgzip > sorted.gff.gz
(zga) tabix -p gff sorted.gff.gz

Long read assembly

Canu documentation 웹사이트에 게시된 대장균의 long read data를 다운로드하여 조립한 뒤 gepard에서 양 말단의 겹침 여부를 확인해 본다. Albertson lab에서 제공하는 nanopore sequencing raw data를 이용해도 좋다. 만약 사용자가 직접 long read를 준비하였다면, Filtlong을 이용하여 적당히 분량을 줄이는 것이 좋을 것이다.

zga에서는 '-a {spades,unicycler,flye}' 옵션으로 assembler를 선택한다. default assembler는 unicycler이다. short read를 동반하지 않으면 spades는 사용할 수 없다. flye를 택한 경우 'ERROR: Looks like the system ran out of memory' 메시지와 함께 중단될 것이다.

curl -L -o pacbio.fastq http://gembox.cbcb.umd.edu/mhap/raw/ecoli_p6_25x.filtered.fastq
# 자료 점검하기
echo '9bb4c10c41c5442d630af8b504042334  pacbio.fastq' > md5sum.txt
md5sum -c md5sum.txt # 'pacbio.fastq: 성공'이 출력될 것이다.
flye --pacbio-raw pacbio.fastq --out-dir flye_assembly # Oops, out of memory!
unicycler -l pacbio.fastq -o unicycler_assembly # Successful!
(zga) zga --pacbio pacbio.fastq --threads 2 -o my_assembly_lr # unicycler가 쓰임

canu는 현재 설치되어 있지 않다. unicycler를 실행하면 dnaA 또는 repA gene에 의한 circularization까지 완료된다. canu보다는 훨씬 편리하다. 조립된 contig의 circularity를 간편하게 확인하려면 sprai assembler에 포함된 check_circularity.pl 스크립트의 shebang line을 수정한 뒤 활용하라. sprai는 anaconda package를 풀어서 활용할 것. 자, 그런데 설치된 long read assembler는 전부 자동적으로 circularize를 하므로 check_circularity.pl을 실행할 기회가 없다.

https://blog.genoglobe.com/2022/03/contig-circularity-checkcircularitypl.html

Post-assembly analysis

16S rRNA-based identification

https://www.ezbiocloud.net/identify (login required)

16개의 cutoff 이내 hit 확인

k-mer based identification

https://cge.cbs.dtu.dk/services/KmerFinder/

결과: NZ_CP045926.1 Bacillus velezensis strain AL7 chromosome, complete genome

Alignment to reference genomes (complete)

ANI analysis

16S rRNA 검색에 의해 확인된 closest species의 type strain genome을 전부 다운로드한 뒤 ANI > 96%인 것이 어느 것인지 확인한다. conda base environment(ncbi-genome-download, legacy blast)에서 실행한다. GCF_001267695.1 하나가 이에 해당할 것이다.

list=$(paste -sd, 16_hits.txt)
ncbi-genome-download --genera "$list" -F fasta -M type bacteria -A > list.txt
# genomes.txt에서 19개의 assembly accession을 선별
ncbi-genome-download -A accession.txt -F fasta bacteria
mkdir fasta find refseq/ -name '*fna.gz' -exec cp {} fasta \;
cd fasta; gzip -d *gz
# zga assembly.fasta를 현 디렉토리에 옮긴다.
for f in $(ls *fna)
> do
> ANI.pl  -bl ./blast-2.2.23/bin/blastall -fd ./blast-2.2.23/bin/formatdb -qr assembly.fasta -sb strain2.fa -od result
> done

pyani를 실행하면 heatmap을 볼 수 있다. 30분 정도 소요됨.

mv assembly.fasta assembly.fna
# 20 x 20 nucmer calculation
average_nucleotide_identity.py -o pyani -m ANIm -i . -g

fastANI는 pyani보다 훨씬 빠르다.

On-line tools

http://ggdc.dsmz.de/ggdc.php#
EZBioCloud tools https://www.ezbiocloud.net/tools
Kostas lab http://enve-omics.ce.gatech.edu/ani/
Center for Genomic Epidemiology http://www.genomicepidemiology.org/services/

기타 자료

본 강좌에서는 리눅스 사용법을 직접 다루지는 않는다. 인터넷에는 리눅스(또는 유닉스)의 학습을 위한 자료가 정말 무궁무진하고 그 수준도 천차만별이다.

https://swcarpentry.github.io/shell-novice/

Table of Contents