Differences

This shows you the differences between two versions of the page.

--- bioinfo:long_read_sequencing_결과물_다루기 [2023/08/17 09:23] – [Long read의 reference mapping] hyjeong
+++ bioinfo:long_read_sequencing_결과물_다루기 [2023/08/17 09:33] (current) – [Long read의 reference mapping] hyjeong
@@ Line 18: / Line 18: @@
 ==== Canu ====
+진핵생물의 유전체의 경우 20x 정도의 시퀀싱 데이터만 있어도 canu를 사용하면 현존하는 hybrid method를 능가하는 결과를 얻을 수 있다. 그러나 최소 30~60x 정도의 데이터로 조립을 시작하는 것이 바람직하다. 다음은 25x의 대장균 PacBio 시퀀싱 데이터를 다운로드하여 조립하는 명령어이다. Nanopore data는 -nanopore 옵션을 사용한다(샘플 데이터 다운로드 [[https://nanopore.s3.climb.ac.uk/MAP006-PCR-1_2D_pass.fasta|링크]]). Input read는 FASTA/Q 무엇이든 관계가 없으며, gz/bz2/xz로 압축이 되어 있어도 좋다. 모든 결과 파일은 -d 옵션으로 지정된 디렉토리 아래에 -p 옵션으로 지정한 문자열을 접두사로 택하여 만들어진다. 입력 파라미터로서 target genome의 추정되는 크기를 제공해야 한다.
+  $ curl -L -o pacbio.fastq http://gembox.cbcb.umd.edu/mhap/raw/ecoli_p6_25x.filtered.fastq
+  $ canu -p ecoli -d ecoli-pacbio  genomeSize=4.8m -pacbio pacbio.fastq
+조립 결과물 중 <prefix>.contigs.fasta는 unique 혹은 repetitive element를 모두 망라한 서열을 수록한다. 반면 <prefix>.unitigs.fasta는 alternate path에서 분리된 서열을 저장한다. 서열의 description 항목에는 다음과 같은 metadata가 기록되어서 결과를 해석하는데 도움을 준다.
+  >tig00000001 len=2449075 reads=6437 class=contig suggestRepeat=no suggestBubble=no suggestCircular=yes
+  >tig00000002 len=27634 reads=884 class=contig suggestRepeat=no suggestBubble=no suggestCircular=no
+Contig와 unitig에 대한 보다 정확한 의미는 [[https://wgs-assembler.sourceforge.net/wiki/index.php/Celera_Assembler_Terminology|Celera Assembler Technology]] 위키 문서를 참조하라. ReasearchGate에 올라온 질문과 답([[https://www.researchgate.net/post/Contig-vs-Unitig|Contig vs. Unitig]]), 그리고 여기에 링크된 [[https://en.wikipedia.org/wiki/Hierarchical_Data_Format|PacBio hybrid assembly 매뉴얼]]도 도움이 된다.
+위에서 소개한 명령어는 raw long read의 correction, trimming, 그리고 조립을 한 번의 명령어로 진행하는 사례를 보인 것이지만, -correct, -trim 및 -assemble 옵션과 함께 세부적인 파라미터를 조정하여 단계별로 진행하는 것도 가능하다.
 ==== UniCycler ====
+[[https://github.com/rrwick/Unicycler|UniCycler]]는 세균의 유전체 조립을 위하여 만들어진 프로그램이다. 일루미나 데이터만 있다면 SPAdes-optimiser로 작동하는 반면, long-read data만 있는 상황에서는 minisam+Racon 파이프라인으로 작동한다. 그러나 일루미나 데이터와와 long read를 함께 사용하는 hybrid assembler로서 작동할 때 가장 정확한 결과를 생성한다.
+일반적인 hybrid assembly에서는 short read(일루미나)를 long read에 정렬하여 오류를 정정한 뒤 이를 이용하여 overlap-layout-consesus 기법의 조립을 수행하게 된다. 그러나 [[https://github.com/rrwick/Unicycler#method-hybrid-assembly|UniCycler의 hybrid assembly]]에서는 일루미나 데이터를 사용하여 먼저 SPAdes로 graph 형태의 assembly를 만든 뒤 여기에 long read를 더하여 repeat을 해소하고 complete genome sequence를 얻어내게 된다. 특히 circlator와 pilon을 마지막 단계에 실행하므로 별도의 post-process를 거치지 않아도 된다.
+사용하는 thread의 수는 -t 또는 %%--%%threads 옵션으로 지정하지 않으면 8을 기본으로 택한다. 사용하는 thread의 수는 -t 또는 --threads 옵션으로 지정하지 않으면 8을 기본으로 택한다. 실행 모드는 conservative, normal 및 bold의 세 가지가 있는데, --mode 옵션을 통하여 지정한다. 특별히 설정하지 않으면 normal mode로 작동한다. |[[https://github.com/rrwick/Unicycler#conservative-normal-and-bold|실행 모드]]는 conservative, normal 및 bold의 세 가지가 있는데, %%--%%mode 옵션을 통하여 지정한다. 특별히 설정하지 않으면 normal mode로 작동한다.
+{{ :bioinfo:conservative_normal_bold.png?400 |Unicycler의 run mode}}
+  $ unicycler -1 short_1.fastq -2 short_2.fastq -l long.fasta -o OUT_DIR -t 16
+신뢰할 수 있는 [[http://gfa-spec.github.io/GFA-spec/|GFA 포맷]]의 long read assembly가 있다면 %%--%%existing_long_read_assembly 옵션으로 지정하면 된다.
 ==== Flye ====
@@ Line 37: / Line 61: @@
 ==== Assembly graph의 구조 확인 ====
+LastGraph(Velvet), FASTA(SPAdes), 또는 GFA 등 de novo assembler가 만들어 내는 graph의 시각화에는 [[https://rrwick.github.io/Bandage/|Bandage]]를 활용한다. 이 프로그램은 Mac/Linux/Windows용을 같이 제공한다.
+  $ Bandage load assembly_graph.gfa
+인수 없이 Bandage라고만 입력하면 Bandage GUI가 작동하게 되며, File -> Load graph에서 그래프를 로드한 뒤 ‘Drawu graph’ 버튼을 클릭하면 화면에 시각화된 그래프가 표현된다.
 ===== Long read의 reference mapping =====
@@ Line 47: / Line 76: @@
   $ minimap2 -ax map-ont ref.fa ont-reads.fq > aln.sam # for Oxford Nanopore reads
   # Convert SAM to BAM
-  $ samtools view –b –S –o aln.bam aln.sam # samtools version  1.9 (for ‘-o’ option)
+  $ samtools view –b –S –o aln.bam aln.sam # samtools version # 1.9 (for '-o' option)