This is an old revision of the document!

Extracting mapped reads

Reference sequence에 매핑된 read를 뽑아내는 방법은 몇 가지가 존재한다. CLC Genomics Workbench에서도 다음 사이트를 참조하여 실행하면 된다.

그러나 이 방법이 특정 reference sequence(혹은 특정 region)에 mapping된 read만을 꺼낼 수 있는지, 혹은 reference set 전체에 대해서도 적용 가능한지는 자세히 확인하지 않았다.

이 페이지에서는 read pair 중 어느 하나만 reference에 mapping이 되었어도 pair를 전부 추출하여 후속 분석 작업에 활용하는 방법을 알아보고자 한다. 다음 그림을 참조하면 쉽게 이해가 될 것이다.

사용할 read mapper는 Sanger Institute의 SMALT이다.

SMALT employs a hash index of short words up to 20 nucleotides long and sampled at equidistant steps along the reference genome. For each sequencing read, potentially matching segments in the reference genome are identified from seed matches in the index and subsequently aligned with the read using dynamic programming.

(SMAT) Indexing & mapping

SMALT v0.7.6은 bioconda 환경으로 설치해 두었다(base environment).

$ smalt index -k 14 -s 8 REF reference.fasta
$ smalt map -n 16 -f bam -o mapped.bam REF reads_1.fastq reads_2.fasta  
$ samtools stats mapped.bam | grep ^SN | cut -f 2- # mapping report 출력

Mapped read를 pair 형태로 꺼내는 방법

나의 질문은 Biostars의 다음 질문과 매우 흡사하다.

Extract One-End Mapped Paired-End Reads From Bwa Bam/Sam Files
Collect Read Pairs Where At Least One Read Is Mapped
(SeqAnswers) SAM: Extract both pairs where at least one pair mapped

다음의 방법은 유일한 것도 아니고 가장 효율적인 것도 아닐 것이다. 그러나 원칙에 가장 충실한 방법일 것이라 자부한다. Read 목록을 참조하여 원본 read file에서 다시 서열을 뽑아내는 유틸리티로서 seqtk를 사용하였다. 이것 역시 bioconda로 재설치하였다.

$ samtools view -F 12 mapped.bam | awk 'NR%2!=0{print $1}' > list_both
$ samtools view -f 8 -F 4 mapped.bam | awk '{print $1}' > list_single
$ cat list_both list_single > list_all
$ seqtk subseq reads_1.fastq list_all > subset_1.fastq
$ seqtk subseq reads_2.fastq list_all > subset_1.fastq

Genome Informatics Laboratory at KRIBB

Table of Contents

Extracting mapped reads

(SMAT) Indexing & mapping

Mapped read를 pair 형태로 꺼내는 방법