====== Application of PacBio long-reads sequencing technology (software) ======
이 페이지는 Oxford Nanopore Technologies(ONT)의 sequencing 응용도 포함할 수 있도록 개편되어야 한다. 즉, long read, single molecule 기반의 염기서열 해독을 전부 망라하도록 한다.
===== 읽을 자료 =====
  * (리뷰 논문) PacBio sequencing and its application (2015) [[http://www.sciencedirect.com/science/article/pii/S1672022915001345|링크]]
  * [[http://www.pacb.com/smrt-science/smrt-resources/blog/|PacBio blog]]
  * [[http://www.pacb.com/products-and-services/analytical-software/|PacBio Analytical software]] - SMRT analysis includes SMRT portal, SMRT analysis APIs, and SMRT view. [[http://www.pacb.com/products-and-services/analytical-software/smrt-analysis/analysis-applications/|응용 분야]]
  * [[https://github.com/PacificBiosciences/SMRT-Analysis/wiki|PacBio SMRT analysis wiki]]
  * [[https://github.com/PacificBiosciences/Bioinformatics-Training/wiki|PacBio bioinformatics training wiki]]

===== pbh5tools =====
A Swiss-army knife for interrogating PacBio HDF5 files (cmp.h5, bas.h5)
  * https://github.com/PacificBiosciences/pbh5tools
  * https://github.com/PacificBiosciences/pbh5tools/blob/master/doc/index.rst (사용법)

  * **bash5tools.py** can extract read sequences and quality values for both Raw and circular consensus sequencing (CCS) readtypes and use create fastq and fasta files.
  * **cmph5tools.py**는 PacBio Alignment File Format(cmp.h5, [[https://pacbiofileformats.readthedocs.io/en/3.0/legacy/CmpH5Spec.html|링크]]) 파일을 다루는 도구라는데 나는 아질 쓸 일이 없다.

--readType은 ccs, subreads, unrolled. ccs는 bas.h5 파일 내부에 ccs read가 있는 경우에 뽑아낸다. unrolled는 어떤 것인지 잘 모르겠다. 이것이 바로 raw read 그대로를 의미하는 것일까?

  $ bash5tools.py input.bas.h5 --outFilePrefix myreads --outType fasta --readType subreads --minReadScore 0.75

Analysis Results 서브디렉토리에 있는 p0.[1-3].subreads.fast{a|q} 파일의 수치와 bash5tools.py를 이용해서 bas.h5 파일로부터 추출한 read의 수치를 비교해 보았다. 2번 항목부터 bash5tools.py를 이용한 것이다.

  - subreads.fastq 전체: 1339659706 bp / 124615 seqs; 10750.4 average length
  - (--readType unrolled): 1395878244 bp / 101548 seqs; 13746.0 average length
  - (--readType subreads): 1394529365 bp / 131741 seqs; 10585.4 average length
  - (--readType subreads --minReadScore 0.7): 1347690522 bp / 126271 seqs; 10673.0 average length
  - (--readType subreads **--minReadScore 0.75**): 1339662407 bp / 124756 seqs; 10738.3 average length
  - (--readType subreads --minReadScore 0.8): 1287559618 bp / 118740 seqs; 10843.5 average length
  - (--readType subreads --minReadScore 0.85): 1038658565 bp / 92346 seqs; 11247.5 average length
  - (--readType subreads --minReadScore 0.9): 654807 bp / 121 seqs; 5411.6 average length
  - (--readType subreads --minReadScore 0.95): no sequence extracted!

--minReadScore가 0.9에 근접하면서 결과물의 분량이 현저히 떨어진다. --minReadScore 0.75로 하는 것이 Analysis Results 서브디렉토리에 있는 subreads file의 분량과 거의 흡사하다.
===== SMRT analysis =====
  * SMRT analysis system requirements [[http://www.pacb.com/wp-content/uploads/2015/09/SMRT-Analysis-System-Requirements.pdf|PDF 문서]]
  * v2.3.0 installation guide [[http://www.pacb.com/wp-content/uploads/2015/09/SMRT-Analysis-Software-Installation-v2.3.0.pdf|PDF 문서]]
===== Canu =====
[[http://canu.readthedocs.io/en/latest/index.html|공식 documentation]]

==== 일단 실행하기 ====

  $ canu -p BRC5 -d canu_BRC5-3cells_2nd genomeSize=3.7m useGrid=false -pacbio-raw BRC5_raw/BRC5-3cells_raw.fasta

gnuplot과 관련한 에러가 나면 다음의 메시지를 참고한다. 명령행에서 'gnuplot'을 입력하여 아무런 오류 없이 잘 실행이 된다면 상관이 없다.
  ERROR:  Failed to run gnuplot from 'gnuplot'.ERROR:  Set option gnuplot=<path-to-gnuplot> or gnuplotTested=true to skip this test and not generate plots.
  
**진짜 raw data를 그대로 쓸 것인가, 아니면 SMRT portal에서 filtered read를 회수하여 쓸 것인가?**
===== SPAdes =====
===== Falcon =====
===== CLC Genomics Workbench =====
===== Unicycler (hybrid assembler) =====
  * (논문) Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads (2017) [[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5481147/|PMC]]
  * [[https://github.com/rrwick/Unicycler|GitHub]] [[https://github.com/rrwick/Unicycler#quick-usage|Quick usage]]

Unicycler는 매우 최근에 공개된 short-read-first hybrid assembler이다. GFA 형식의 assembly graph는 같은 개발자가 2015년에 발표한 Bandage([[https://www.ncbi.nlm.nih.gov/pubmed/26099265|PubMed]] [[https://github.com/rrwick/Bandage|GitHub]] [[https://github.com/rrwick/Bandage/wiki|Documentation]])으로 시각화하면 좋다. Circlator와 같은 외부 tool 없이도 circularization을 해 준다.

==== 설치 ====
Bioconda(py35 environment)로 /data/anaconda2에 설치하였다.

=== 사용법 ===

  $ unicycler -t 24 -1 MA-KW_1.fastq -2 MA-KW_2.fastq -l MA-KW_pacbio.fastq -o unicycler_run_20180518_1

이전 실행에서 이미 교정한 read를 사용하여 재조립을 한다면 다음과 같이 실행하여 시간을 줄일 수 있을 것이다.

  $ unicycler -t 24 -l CorrectedReads_1.fastq.gz -2 CorrectedReads_2.fastq.gz -s unparedShortReads.fastq --no_correct -l longreads.gz -o outDirectory