Application of PacBio long-reads sequencing technology (software)

Application of PacBio long-reads sequencing technology (software)

이 페이지는 Oxford Nanopore Technologies(ONT)의 sequencing 응용도 포함할 수 있도록 개편되어야 한다. 즉, long read, single molecule 기반의 염기서열 해독을 전부 망라하도록 한다.

읽을 자료

(리뷰 논문) PacBio sequencing and its application (2015) 링크
PacBio blog
PacBio Analytical software - SMRT analysis includes SMRT portal, SMRT analysis APIs, and SMRT view. 응용 분야
PacBio SMRT analysis wiki
PacBio bioinformatics training wiki

pbh5tools

A Swiss-army knife for interrogating PacBio HDF5 files (cmp.h5, bas.h5)

bash5tools.py can extract read sequences and quality values for both Raw and circular consensus sequencing (CCS) readtypes and use create fastq and fasta files.
cmph5tools.py는 PacBio Alignment File Format(cmp.h5, 링크) 파일을 다루는 도구라는데 나는 아질 쓸 일이 없다.

–readType은 ccs, subreads, unrolled. ccs는 bas.h5 파일 내부에 ccs read가 있는 경우에 뽑아낸다. unrolled는 어떤 것인지 잘 모르겠다. 이것이 바로 raw read 그대로를 의미하는 것일까?

$ bash5tools.py input.bas.h5 --outFilePrefix myreads --outType fasta --readType subreads --minReadScore 0.75

Analysis Results 서브디렉토리에 있는 p0.[1-3].subreads.fast{a|q} 파일의 수치와 bash5tools.py를 이용해서 bas.h5 파일로부터 추출한 read의 수치를 비교해 보았다. 2번 항목부터 bash5tools.py를 이용한 것이다.

subreads.fastq 전체: 1339659706 bp / 124615 seqs; 10750.4 average length
(–readType unrolled): 1395878244 bp / 101548 seqs; 13746.0 average length
(–readType subreads): 1394529365 bp / 131741 seqs; 10585.4 average length
(–readType subreads –minReadScore 0.7): 1347690522 bp / 126271 seqs; 10673.0 average length
(–readType subreads –minReadScore 0.75): 1339662407 bp / 124756 seqs; 10738.3 average length
(–readType subreads –minReadScore 0.8): 1287559618 bp / 118740 seqs; 10843.5 average length
(–readType subreads –minReadScore 0.85): 1038658565 bp / 92346 seqs; 11247.5 average length
(–readType subreads –minReadScore 0.9): 654807 bp / 121 seqs; 5411.6 average length
(–readType subreads –minReadScore 0.95): no sequence extracted!

–minReadScore가 0.9에 근접하면서 결과물의 분량이 현저히 떨어진다. –minReadScore 0.75로 하는 것이 Analysis Results 서브디렉토리에 있는 subreads file의 분량과 거의 흡사하다.

SMRT analysis

SMRT analysis system requirements PDF 문서
v2.3.0 installation guide PDF 문서

Canu

공식 documentation

일단 실행하기

$ canu -p BRC5 -d canu_BRC5-3cells_2nd genomeSize=3.7m useGrid=false -pacbio-raw BRC5_raw/BRC5-3cells_raw.fasta

gnuplot과 관련한 에러가 나면 다음의 메시지를 참고한다. 명령행에서 'gnuplot'을 입력하여 아무런 오류 없이 잘 실행이 된다면 상관이 없다.

ERROR:  Failed to run gnuplot from 'gnuplot'.ERROR:  Set option gnuplot=<path-to-gnuplot> or gnuplotTested=true to skip this test and not generate plots.

진짜 raw data를 그대로 쓸 것인가, 아니면 SMRT portal에서 filtered read를 회수하여 쓸 것인가?

SPAdes

Falcon

CLC Genomics Workbench

Unicycler (hybrid assembler)

(논문) Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads (2017) PMC
GitHub Quick usage

Unicycler는 매우 최근에 공개된 short-read-first hybrid assembler이다. GFA 형식의 assembly graph는 같은 개발자가 2015년에 발표한 Bandage(PubMed GitHub Documentation)으로 시각화하면 좋다. Circlator와 같은 외부 tool 없이도 circularization을 해 준다.

설치

Bioconda(py35 environment)로 /data/anaconda2에 설치하였다.

사용법

$ unicycler -t 24 -1 MA-KW_1.fastq -2 MA-KW_2.fastq -l MA-KW_pacbio.fastq -o unicycler_run_20180518_1

이전 실행에서 이미 교정한 read를 사용하여 재조립을 한다면 다음과 같이 실행하여 시간을 줄일 수 있을 것이다.

$ unicycler -t 24 -l CorrectedReads_1.fastq.gz -2 CorrectedReads_2.fastq.gz -s unparedShortReads.fastq --no_correct -l longreads.gz -o outDirectory

Table of Contents