K-mer analysis software

khmer

SGA preqc

KAT - The K-mer Analysis Toolkit

일반 정보

논문: KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics (2016) PubMed
GitHub Documentation

설치

bioconda를 이용하여 설치하였다. CentOS 6.x에서 bioconda로 설치한 gnuplot은 약간 까다로운 에러를 발생한다.

gnuplot: error while loading shared libraries: libjpeg.so.8: cannot open shared object file: No such file or directory

차라리 $PATH 맨 앞에 /usr/local/bin(/gnuplot version 5.0 patchlevel 3)이 오게 만들어서 실행하는 것이 나을 것이다. 그러나 'set terminal png large size 1024,1024'를 문법적 에러로 인식할 수도 있다. 그래도 png 그림은 만들어진다.

gnuplot> set terminal png large size 1024,1024
                          ^
         line 0: unrecognized terminal option

사용법

도움말 보기

$ kat
$ kat gcp
$ kat filter seq

Hist mode

Distinct k-mer의 히스토그램 파일(.hist)과 spectra hist plot(.hist.png)을 만든다.

$ kat hist AH10_149ng_1.fastq  AH10_149ng_2.fastq
$ ls
AH10_149ng_1.fastq AH10_149ng_2.fastq kat.hist kat.hist.png

GCP mode

Distinct k-mer의 GC content를 계산하여 matrix(.mx) 및 density plot(.mx.png)을 만든다.

$ kat kat gcp AH10_149ng_1.fastq AH10_149ng_2.fastq
$ ls
AH10_149ng_1.fastq AH10_149ng_2.fastq kat-gcp.mx kat-gcp.mx.png

K-mer count hash의 비교

(작성 예정)

Filtering

K-mer filtering

이것은 read 단위가 아니라 사용자 정의 기준값 이내(혹은 바깥)에 위치하는 k-mer 자체(k-mer hash)를 뽑아내는 것이다. 기본 동작은 low..high 사이의 k-mer를 출력하는 것이나 -i [ –invert ] 옵션을 설정하면 경계치 외부의 것을 뽑아낸다. 이때에도 k-mer hash의 이름은 여전히 .k-mer-in.jf27이다. -s [ –separate ]를 -i와 함께 사용하면 경계 내부와 외부의 hash table을 별도로 출력한다. 이때 조심할 점은 원래 의도한 경계 내부에 해당하는 k-mer hash는 out이다!

$ kat filter kmer --low_count=200 --high_count=500 --low_gc=2 --high_gc=18 AH10_149ng_1.fastq AH10_149ng_2.fastq
$ ls
AH10_149ng_1.fastq AH10_149ng_2.fastq kat.filter.kmer-in.jf27
$ kat filter kmer --low_count=200 --high_count=500 --low_gc=2 --high_gc=18 -i -s AH10_149ng_1.fastq AH10_149ng_2.fastq
$ ls
AH10_149ng_1.fastq  kat-gcp.mx      kat.filter.kmer-in.jf27   kat.hist
AH10_149ng_2.fastq  kat-gcp.mx.png  kat.filter.kmer-out.jf27  kat.hist.png

Sequence filtering

Sequence(read)를 대상으로 필터를 적용한다. 오염을 제거하거나 오염된 read를 추출할 때, 혹은 high coverage region을 추출하고자 할때 쓰인다. khmer에서는 abundance 값을 제공하지만 kat filter seq에서는 인수로 공급한 k-mer hash 파일을 참조하여 이를 갖는 sequence를 필터링한다. 여기에서도 -i 및 -s 옵션을 사용 가능하다. 그리고 khmer에서는 read를 검사하다가 제거할 k-mer를 만나면 그 이후를 전체 read에서 잘라버리지만 kat filter seq는 특정 k-mer를 일정 수준 이상 포함하는 read(-T arg로 설정)을 남긴다. 개인적으로 생각할 때 매우 독특한 동작이다.

Filter sequences based on whether those sequences contain specific k-mers.

The user loads a k-mer hash and then filters sequences (either in or out) depending on whether those sequences contain the k-mer or not. The user can also apply a threshold requiring X% of k-mers to be in the sequence before filtering is applied.

따라서 오염, 즉 low abundant k-mer를 지닌 read를 제거하려면 다음과 같이 해야 할 것이다.

kat filter kmer -i을 실행하여 설정한 범위 바깥에 해당하는 k-mer hash를 얻는다.
kat filter seq -i를 실행한다. 왜냐하면 위의 과정에서 얻는 k-mer를 갖지 않는 read를 남겨야 하기 때문이다.

다음의 예제는 50 count 미만의 k-mer hash를 먼저 찾아내고, 이를 일정 비율 이상 갖지 않는 read를 오염에 의한 것으로 간주하여 제거한다. MiSeq 기준의 300 bp read에 대해서 27-mer가 한번만 존재한다면 27/300 = 9%이다. 기본값인 0.1%로 해도 되겠다. 맨 마지막 단계에서 옵션을 줘야 하는가, 혹은 그렇지 않은가? 잘 생각해 보라!

$ kat filter kmer --low_count=50 -i AH10_149ng_1.fastq AH10_149ng_2.fastq
$ ls
AH10_149ng_1.fastq  AH10_149ng_2.fastq  kat.filter.kmer-in.jf27
$ kat filter seq --threshold 0.1 AH10_149ng_1.fastq AH10_149ng_2.fastq kat.filter.kmer-in.jf27
$ ls kat.filter*fastq
kat.filter.kmer.in.R1.fastq  kat.filter.kmer.in.R2.fastq

Genome Informatics Laboratory at KRIBB

Table of Contents