Manipulation of FASTQ files

본 위키 사이트 내의 여러 페이지에서 간단한 fastq 파일 조작법을 필요한 곳에 소개해 두었다. 이 페이지에서는 이들을 종합함과 동시에 새로운 기법도 소개함을 목적으로 한다.

포맷 전환 1: fastq => fasta

$ fastq_to_fasta -Q33 -v -i infile.fastq -o outfile.fasta # FASTX_toolkit (-v for verbose)
$ fastq2fasta.pl -a infile.fastq # Brian Knaus의 스크립트; 출력파일은 infile.fa
$ seqtk seq -a infile.fastq > outfile.fa # Seqtk
$ fq2fa --merge --filter infile_1.fastq infile_2.fastq oufile.fa # idba에 포함된 명령어
$ (slow!) paste - - - - < infile.fastq | cut -f 1,2 | sed 's/^@/>/' | tr "\t" "\n" > outfile.fa

마지막 실행문은 paired file(2개)을 하나의 interleaved file로 병합하면서 동시에 N을 포함한 read를 제거하는 것이다. fq2fa –paired는 실제로 효과가 있는지를 잘 모르겠다.

포맷 전환 2: two paired files => one interleaved file

$ seqtk mergepe infile_1.fastq infile_2.fastq > outfile.pe.fq
$ interleave-reads.py -o outfile.pe.fastq infile_1.fastq infile_2.fastq # khmer
$ shuffleSequences_fastq.pl infile_1.fastq infile_2.fastq oufile.pe.fastq # velvet

마지막 실행문은 velvet 패키지의 contrib 디렉토리에 들어있다(shuffleSequences_fasta.pl, shuffleSequences_fasta.sh, shuffleSequences_fasta.py 및 shuffleSequences_fastq.pl)

포맷 전환 3: one interleaved file => two paired files

$ seqtk seq -1 infile.pe.fastq > outfile_1.fastq; seqtk seq -2 infile.pe.fastq > outfile_2.fastq

포맷 전환 4: one imperfect interleaved file => paired files + orphan file

프로그램에 따라서는 모든 read가 짝을 이룬 interleaved file과 orphan read file을 엄격히 구별하여 공급해야만 하는 것들이 있다. 그러나 각종 필터나 digital normalization을 거치는 과정에서 interleaved file 내에서 read들의 짝 관계가 깨지는 경우가 있다. khmer 패키지의 부속 스크립트인 extract-paired-reads.py는 이와 같이 짝 관계가 깨어진 interleaved file을 받아들여서 <input_file>.pe 및 <input_file>.se 파일을 만들어낸다. 단순히 한쪽 end read의 길이가 짧아진 것은 손상된 것으로 치치 않는다.

$ extract-paired-reads.py tests/test-data/paired.fq

기준 길이 미만의 read는 버리기

HiSeq과는 달리 MiSeq에서 생산된 read는 적은 분량이나마 정해진 사이클 수보다 짧은 read가 섞인 상태이다. 이를 quality trimming과 adapter removal을 거치고 나면 read length의 분포는 다음 그림처럼 크게 변한다. 특히 orphan read(unpaired)의 평균 길이는 pair를 유지한 것에 비하여 더욱 짧아진다.

후속 프로그램에 따라서는 일정한 길이의 read를 필요로하는 것도 있으므로, 전처리를 거치면서 길이 분포가 들쑥날쑥해진 read를 정리하여 특정 cutoff 이상의 것만을 모으는 방법을 알아보자. 이 작업에는 SolexaQA++이 유용하다. SolexaQA++에는 원래 quality trimming 기능이 있지만 본 문서에서는 다른 도구를 이용하여 trimming을 실시한 상황을 가정하고 있다.

$ SolexaQA++ lengthsort input_files (one single-end or two paired-end FASTQ files) [-l|length 25] [-d|directory path]

Options:
-l|--length       length cutoff [defaults to 25 nucleotides]
-d|--directory    path to directory where output files are saved
-c|--correct      when running in paired mode, removes unpaired reads from the two fastq files, saves them into two new *.fastq.clean files, and normally processes them.

단일 파일에 대해서는 아래와 같이 실행한다. interleaved file도 아래의 방법으로 실행한 다음, 결과 파일을 extract-paired-reads.py(khmer package)로 처리하면 된다.

$ SolexaQA++ lengthsort -l 85 infile.fastq # SolexaQA++ v3.1.3

결과 파일은 (1) infile.discard, (2) infile.single, (3) infile.summary.txt, (4) infile.summary.txt.pdf가 만들어진다. 만약 입력 파일이 paired fastq file 2개라면 -c (–correct) 옵션을 주면 된다. cutoff length(-l num)의 기본값은 25이다.

$ SolexaQA++ lengthsort -l 85 -c infile_1.fastq infile_2.fastq
...
Paired reads were written to:
/path-to/infile_1.fastq.clean
/path-to/infile_2.fastq.clean
...
Writing files...

화면에 나타나는 메시지만를 보면 마치 *clean 파일 두 개에 길이 기준을 통과한 깨끗한 read가 기록되었을 것만 같다. 그러나 infile_1.fastq와 infile_1.fastq.clean은 diff로 비교해보면 똑같은 파일이다(_2도 마찬가지)! 왜 사실상 동일한 파일을 .clean이라는 이름으로 새로 작성하는지, 실행 메시지는 왜 저렇게 혼동스럽게 출력하는지 알다가도 모를 일이다. 각 single file에 대해서 길이 기준을 실제로 길이 기준을 통과함과 동시에 짝을 이루는 read는 infile_1.paired와 infile_2.paired에 기록이 되고, paired/single/discard read의 집계 수치는 infile_1.clean.summary.txt(.pdf)에 저장된다.

기타 유용한 유틸리티

FASTX-Toolkit - 더 이상 설명이 필요없는 FASTQ/A 파일 처리 유틸리티의 고전. 아직도 버전은 0.0.14이다.
seqtk: toolkit for processing sequences in FASTA/Q formats
seqkit: a cross-platform and ultrafasta toolkit for FASTA/Q file manaipulation
sickle - a windowed adaptive trimming tools for FASTQ files using quality
cmpfastq - a simple perl program that allows the user to compare QC filtered fastq files. 퍄일 짝을 맞추는 가장 원초적인 도구이다. 그러나 최신 MiSeq read에 대해서는 read ID를 parsing하는 방법이 잘 작동하지 않을 수 있다(Problems with cmpfastq, can't process my fastq /1 and /2 files). 이에 대해서는 BBMap 패키지의 repair.sh를 사용하라는 제안이 있었다.
Brian Bushnell(JGI)의 BBTools - 어쩌면 모든 해답이 여기에 다 들어있는지도 모른다.
trimmomatic
khmer

Genome Informatics Laboratory at KRIBB

Table of Contents