Skip to content

overrep_kmer: Generate overrepresented kmers of length k based on their

observed to expected ratio at each position across all sequences in the dataset. The expected proportion of a length k kmer assumes site independence and is computed as the sum of the count of each base pair in the kmer times the probability of observing that base pair in the data set, i.e. P(A)count_in_kmer(A)+P(C)count_in_kmer(C)+... The observed to expected ratio is computed as log2(obs/exp). Those with obsexp_ratio > 2 are considered to be overrepresented and appear in the returned data frame along with their position in the sequence.

Description

Generate overrepresented kmers of length k based on their observed to expected ratio at each position across all sequences in the dataset. The expected proportion of a length k kmer assumes site independence and is computed as the sum of the count of each base pair in the kmer times the probability of observing that base pair in the data set, i.e. P(A)count_in_kmer(A)+P(C)count_in_kmer(C)+... The observed to expected ratio is computed as log2(obs/exp). Those with obsexp_ratio > 2 are considered to be overrepresented and appear in the returned data frame along with their position in the sequence.

Usage

overrep_kmer(infile, k, output_file = NA)

Arguments

Argument Description
infile path to gzipped FASTQ file
k the kmer length
output_file File to save plot to. Default NA.

Value

Data frame with columns: Position (in read), Obsexp_ratio, & Kmer

Examples

```r

infile <-system.file("extdata", "test.fq.gz", package = "qckitfastq") overrep_kmer(infile,k=4)

```