Differences

This shows you the differences between two versions of the page.

--- microbial_forensics [2021/10/05 10:54] – [분석과정(2) - distance matrix에서 확률 밀도 함수(pdf) 구하기] hyjeong
+++ microbial_forensics [2021/10/05 12:42] (current) – [plot_pdfs_from_distance_matrix.R 코드의 수정판] hyjeong
@@ Line 21: / Line 21: @@
   # index1 points out within source samples in matrix mat for the first group (only two groups allowed)
-가장 먼저 할 일은 selected_listeria_dist.csv 파일을 읽어들여서 mat 매트릭스에 저장하는 일입니다. 논문 혹은 R 코드에 설명에 의하면 facility1 또는 facility2에 둘 중 하나에 '확실하게' 속하는 것으로 정리를 해야 할 것 같습니다. 따라서 selected_listeria_dist.csv를 읽어들인 다음, 이것의 일부분만을 남겨야 합니다. 논문 4쪽의 오른쪽 단 **3. Results** //3.1 Tracing the source of a listerosis outbtreak to ice cream production facilities// 항목 두번째 단락을 살펴봅시다.
+가장 먼저 할 일은 selected_listeria_dist.csv 파일을 읽어들여서 mat 매트릭스에 저장하는 일입니다. 논문 혹은 R 코드에 설명에 의하면 facility1 또는 facility2에 둘 중 하나에 '확실하게' 속하는 것으로 정리를 해야 할 것 같습니다. 따라서 selected_listeria_dist.csv(148 균주)를 읽어들인 다음, 이것의 일부분만을 남겨야 합니다. 논문 4쪽의 오른쪽 단 **3. Results** //3.1 Tracing the source of a listerosis outbtreak to ice cream production facilities// 항목 두번째 단락을 살펴봅시다.
   The population defined to form probability density functions
-  (PDFs) under H<sub>m</sub> and Ha consisted of all strains originating from
+  (PDFs) under Hm and Ha consisted of all strains originating from
   facilities 1 and 2. All strains of other origins (i.e. environmental
-  strains and those isolated furing earlier disease outbreaks) were
+  strains and those isolated during earlier disease outbreaks) were
   excluded from the evidence calculations because those origins
-  were not included in the investigational hpothesis.
+  were not included in the investigated hypothesis.
+자, 그러면 selected_listeria_dist.csv 중에서 무엇을 배제하는 것이 좋을까요? metadata_listeria.txt 파일에서 첫번째 컬럼에 'Environment(또는 environment)'가 포함된 것은 배제하겠습니다. 그 다음으로는 첫번째 컬럼에 'Clinical'이 포함된 8개 균주의 포함 여부입니다. 이 균주는 논문 Table 1 'Evidential values and corresponding conclusions for the clinical isolates based on the Hamming genetic distance'에 나타낸 8개의 균주와 동일합니다. 이들은 test하려는 균주 자체인데, 이를 pdf plotting에 사용하는 것이 옳으냐의 기로에 서게 됩니다. 논문에서는 'earlier outbreak에서 유래한 것을 제외'했다 하였으니, 이는 포함시키는 것이 맞는 것 같습니다. 이러한 판단에 근거하여 148개 균주 중 분석에 포함시킬 것(111_accessions_all.txt)과 facility1 유래의 것(82_accessions_facility1.txt)을 확정하였습니다. 두 텍스트 파일을 압축한 것은 {{ :data_20211005.zip |여기}}를 클릭하여 다운로드하면 됩니다.
+===== plot_pdfs_from_distance_matrix.R 코드의 수정판 =====
+함수만 선언한 [[https://github.com/FOI-Bioinformatics/MicrobialForensics/blob/master/src/plot_pdfs_from_distance_matrix.R|원본 R 코드]]로는 plot이 잘 그려지지 않아서 약간의 수정을 가했습니다.
+  * 함수 선언 전까지의 9줄은 데이터 파일을 읽어서 처리하기 위한 것입니다.
+  * 함수의 default argument로 주어지는 lim 벡터는 원본(c(0.05,0.18))대로 하면 그림을 그릴 수 있는 범위를 벗어나기에 약간의 시행착오를 거쳐서 수정하였습니다.
+  * Plot 파일의 이름과 형식은 eps가 아닌 pdf로 수정하였습니다.
+  * ggplot()의 그림은 print() 함수를 실행하여 파일로 저장하도록 하였습니다. 원본 코드 그대로는 파일이 제대로 생기지 않았습니다. 아래의 코드를 사용하여 얻은 플롯을 첨부합니다({{ :df_dist.pdf |}}).
+  mat.org=read.table("selected_listeria_dist.csv",sep=",")
+  dim(mat.org) # 148 x 148
+  all = paste(scan("111_accessions.txt", character()), "fa", sep=".")
+  mat = mat.org[all, ]
+  mat = mat[, all]
+  dim(mat) # 111 x 111
+  index0 = rownames(mat)
+  f1=paste(scan("82_accessions_facility1.txt", character()), "fa", sep=".")
+  index1 = which(index0 %in% f1)
+  plot_pdfs_from_distance_matrix<-function(mat,index1,lim=c(-0.005, 0.025),alpha=0.5,filename="df_dist.pdf",filetype="pdf"){
+  require(ggplot2)
+  nr<-nrow(mat)
+  ntot<-(nr-1)*nr/2
+  index.tot<-seq(1,nr)
+  index2<-setdiff(index.tot,index1)
+  df<-data.frame(rep(0,ntot))
+  colnames(df)<-"sample_i"
+  df$sample_j<-rep(0,ntot)
+  df$Type<-"between"
+  df$Value<-rep(0,ntot)
+  k<-1
+  nr1<-nr-1
+  for(i in 1:nr1){
+    i1<-i+1
+    for(j in i1:nr){
+      if((i %in% index1 && j %in% index1) || (i %in% index2 && j %in% index2)){
+        df[k,1]<-i
+        df[k,2]<-j
+        df[k,3]<-"within"
+        df[k,4]<-mat[i,j]
+      }
+      else{
+        df[k,1]<-i
+        df[k,2]<-j
+        df[k,4]<-mat[i,j]
+      }
+      k<-k+1
+    }
+  }
+  df2<-df[,3:4]
+  if(is.null(filename)==FALSE){
+    if(filetype=="eps"){
+      postscript(filename)
+    }
+    else{
+      pdf(filename)
+    }
+  }
+  p = ggplot(df2,aes(x=Value, fill=Type)) + geom_density(alpha=alpha) +scale_x_continuous(limits = lim)
+  print(p)
+  if(is.null(filename)==FALSE){
+    dev.off()
+  }
+    return(df)
+  }
+  # 원본 R 코드는 바로 직전까지에 해당합니다. 실제 함수를 호출하기 위해 다음의 한 줄을 삽입했습니다.
+  dat = plot_pdfs_from_distance_matrix(mat, index1)
+===== 앞으로 해결할 문제 =====
+위의 코드를 실행하여 얻어진 분포(dat 데이터프레임으로 반환됨)를 kernel density estimation으로 처리하여 연속형 확률 분포와 유사하게 만드는 것이 바로 다음에 진행해야 할 일입니다. 구체적인 코드는 제시하지 않았습니다. 다음으로는 8개의 clinical sample에 대하여 다음의 likelihood ratio('evidential value')를 구하는 것입니다.
+{{ :lr.png?200 |}}
+이 수식으로 표현된 evidential value를 구하는 방법은 논문 5쪽 왼쪽단에 다음과 같이 기술하였습니다.
+  To calculate evidential values, the genetic distances from
+  sequences associated with facility 1 were calculated for the
+  sequences obtained from each patient, and the densities under
+  Hm and Ha were compared according to Eq. (4). For each distance,
+  a likelihood ratio was obtained from the PDFs and the average
+  likelihood ratio for each patient were reported. All the calculated
+  evidential values were strongly supportive or contradictive,
+  depending on the origin of each isolate (Table 1), and the
+  assignments of clinical samples to their respective sources were
+  fully consistent with the findings of Chen et al. [29].