Differences

This shows you the differences between two versions of the page.

--- manipulation_of_fastq_files [2017/04/19 15:38] – [포맷 전환 1(fastq => fasta)] hyjeong
+++ manipulation_of_fastq_files [2022/03/30 09:07] (current) – [포맷 전환 3: one interleaved file => two paired files] hyjeong
@@ Line 2: / Line 2: @@
 본 위키 사이트 내의 여러 페이지에서 간단한 fastq 파일 조작법을 필요한 곳에 소개해 두었다. 이 페이지에서는 이들을 종합함과 동시에 새로운 기법도 소개함을 목적으로 한다.
-===== 포맷 전환-1: fastq => fasta =====
+===== 포맷 전환 1: fastq => fasta =====
   $ fastq_to_fasta -Q33 -v -i infile.fastq -o outfile.fasta # FASTX_toolkit (-v for verbose)
   $ fastq2fasta.pl -a infile.fastq # Brian Knaus의 스크립트; 출력파일은 infile.fa
   $ seqtk seq -a infile.fastq > outfile.fa # Seqtk
   $ fq2fa --merge --filter infile_1.fastq infile_2.fastq oufile.fa # idba에 포함된 명령어
+  $ (slow!) paste - - - - < infile.fastq | cut -f 1,2 | sed 's/^@/>/' | tr "\t" "\n" > outfile.fa
 마지막 실행문은 paired file(2개)을 하나의 interleaved file로 병합하면서 동시에 N을 포함한 read를 제거하는 것이다. fq2fa --paired는 실제로 효과가 있는지를 잘 모르겠다.
-===== 포맷 전환 2(two paired files => one interleaved file) =====
+===== 포맷 전환 2: two paired files => one interleaved file =====
   $ seqtk mergepe infile_1.fastq infile_2.fastq > outfile.pe.fq
   $ interleave-reads.py -o outfile.pe.fastq infile_1.fastq infile_2.fastq # khmer
@@ Line 15: / Line 16: @@
 마지막 실행문은 velvet 패키지의 contrib 디렉토리에 들어있다(shuffleSequences_fasta.pl, shuffleSequences_fasta.sh, shuffleSequences_fasta.py 및 shuffleSequences_fastq.pl)
-===== 포맷 전환 3(one interleaved file => two paired files) =====
+===== 포맷 전환 3: one interleaved file => two paired files =====
-  $ seqtk seq -1 infile.pe.fastq > outfile_1.fastq; seqrtk seq -2 infile.pe.fastq > outfile_2.fastq
+  $ seqtk seq -1 infile.pe.fastq > outfile_1.fastq; seqtk seq -2 infile.pe.fastq > outfile_2.fastq
 ===== 포맷 전환 4: one imperfect interleaved file => paired files + orphan file =====
+프로그램에 따라서는 모든 read가 짝을 이룬 interleaved file과 orphan read file을 엄격히 구별하여 공급해야만 하는 것들이 있다. 그러나 각종 필터나 digital normalization을 거치는 과정에서 interleaved file 내에서 read들의 짝 관계가 깨지는 경우가 있다. khmer 패키지의 부속 스크립트인 extract-paired-reads.py는 이와 같이 짝 관계가 깨어진 interleaved file을 받아들여서 <input_file>.pe 및 <input_file>.se 파일을 만들어낸다. 단순히 한쪽 end read의 길이가 짧아진 것은 손상된 것으로 치치 않는다.
+  $ extract-paired-reads.py tests/test-data/paired.fq
 ===== 기준 길이 미만의 read는 버리기 =====
@@ Line 33: / Line 35: @@
   -d|--directory    path to directory where output files are saved
   -c|--correct      when running in paired mode, removes unpaired reads from the two fastq files, saves them into two new *.fastq.clean files, and normally processes them.
-단일 파일에 대해서는 아래와 같이 실행한다.
+단일 파일에 대해서는 아래와 같이 실행한다. interleaved file도 아래의 방법으로 실행한 다음, 결과 파일을 extract-paired-reads.py(khmer package)로 처리하면 된다.
   $ SolexaQA++ lengthsort -l 85 infile.fastq # SolexaQA++ v3.1.3
-결과 파일은 infile.discard infile.single infile.summary.txt 그리고 infile.summary.txt.pdf가 만들어진다. 만약 입력 파일이 paired fastq file 2개라면 -c (--correct) 옵션을 주면 된다. cutoff length(-l num)의 기본값은 25이다.
+결과 파일은 (1) infile.discard, (2) infile.single, (3) infile.summary.txt, (4) infile.summary.txt.pdf가 만들어진다. 만약 입력 파일이 paired fastq file 2개라면 -c (--correct) 옵션을 주면 된다. cutoff length(-l num)의 기본값은 25이다.
   $ SolexaQA++ lengthsort -l 85 -c infile_1.fastq infile_2.fastq
+  ...
+  Paired reads were written to:
+  /path-to/infile_1.fastq.clean
+  /path-to/infile_2.fastq.clean
+  ...
+  Writing files...
-길이 기준을 만족하면서 쌍을 이루는 read는 infile_(1,2).fastq.clean으로 출력된다는 메시지가 나올 것이다. 그런데 infile_(1,2).fastq.paired는 무엇인가? *.clean 파일과는 매우 근소한 차이가 난다. infile_1.fastq.clean.summary.txt(.pdf)에 기록되는 수치는 .paired 파일의 수치와 같다.
+화면에 나타나는 메시지만를 보면 마치 *clean 파일 두 개에 길이 기준을 통과한 깨끗한 read가 기록되었을 것만 같다. 그러나 infile_1.fastq와 infile_1.fastq.clean은 diff로 비교해보면 똑같은 파일이다(_2도 마찬가지)! 왜 사실상 동일한 파일을 .clean이라는 이름으로 새로 작성하는지, 실행 메시지는 왜 저렇게 혼동스럽게 출력하는지 알다가도 모를 일이다. 각 single file에 대해서 길이 기준을 실제로 길이 기준을 통과함과 동시에 짝을 이루는 read는 infile_1.paired와 infile_2.paired에 기록이 되고, paired/single/discard read의 집계 수치는 infile_1.clean.summary.txt(.pdf)에 저장된다.
 ===== 기타 유용한 유틸리티 =====
-  * [[http://hannonlab.cshl.edu/fastx_toolkit/|FASTX-Toolkit]]
+  * [[http://hannonlab.cshl.edu/fastx_toolkit/|FASTX-Toolkit]] - 더 이상 설명이 필요없는 FASTQ/A 파일 처리 유틸리티의 고전. 아직도 버전은 0.0.14이다.
+  * [[https://github.com/lh3/seqtk|seqtk]]: toolkit for processing sequences in FASTA/Q formats
+  * [[http://bioinf.shenwei.me/seqkit/|seqkit]]: a cross-platform and ultrafasta toolkit for FASTA/Q file manaipulation
   * [[https://github.com/najoshi/sickle|sickle]] - a windowed adaptive trimming tools for FASTQ files using quality
-  * [[http://compbio.brc.iop.kcl.ac.uk/software/cmpfastq.php|cmpfastq]] - a simple perl program that allows the user to compare QC filtered fastq files
+  * [[http://compbio.brc.iop.kcl.ac.uk/software/cmpfastq.php|cmpfastq]] - a simple perl program that allows the user to compare QC filtered fastq files. 퍄일 짝을 맞추는 가장 원초적인 도구이다. 그러나 최신 MiSeq read에 대해서는 read ID를 parsing하는 방법이 잘 작동하지 않을 수 있다([[http://seqanswers.com/forums/showthread.php?t=24032|Problems with cmpfastq, can't process my fastq /1 and /2 files]]). 이에 대해서는 [[https://sourceforge.net/projects/bbmap/|BBMap]] 패키지의 repair.sh를 사용하라는 [[http://seqanswers.com/forums/showpost.php?p=141460&postcount=45|제안]]이 있었다.
+  * Brian Bushnell(JGI)의 [[http://jgi.doe.gov/data-and-tools/bbtools/|BBTools]] - 어쩌면 모든 해답이 여기에 다 들어있는지도 모른다.
+  * trimmomatic
+  * khmer