chapter-10-sequence

Chapter 10: Sequence Data

Files

egfr_flank.fasta comes from Ensembl (see below).
contam.fastq is used as an example FASTQ file. It's the same file we used in chapter 7 for less examples.
Heng Li's readfq.py, downloaded from https://github.com/lh3/readfq/blob/master/readfq.py. See it's commit history, which is mentioned in the book: https://github.com/lh3/readfq/commits/master/readfq.py
untreated1_chr4.fq: a FASTQ file generated from the pasillaBamSubset file untreated1_chr4.bam.

egfr_flank.fasta File

The example FASTA file was downloaded from the Ensembl biomart interface. This example file contains the upstream flanking regions (only 200bp, so the example prints in a single page width) for the mouse Epidermal growth factor receptor (Egfr) gene. The biomart query in XML format is:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Query>
<Query  virtualSchemaName = "default" formatter = "FASTA" header = "0" uniqueRows = "0" count = "" datasetConfigVersion = "0.6" >
  <Dataset name = "mmusculus_gene_ensembl" interface = "default" >
    <Filter name = "upstream_flank" value = "200"/>
    <Filter name = "ensembl_gene_id" value = "ENSMUSG00000020122"/>
    <Attribute name = "ensembl_gene_id" />
    <Attribute name = "ensembl_transcript_id" />
    <Attribute name = "transcript_flank" />
  </Dataset>
</Query>

I've formatted the file with the following:

seqtk seq -l 60 mart_export.txt.gz > egfr_fank.fasta

Base Qualities

The quality line used in examples is:

JJJJJJJJJJJJGJJJJJIIJJJJJIGJJJJJIJJJJJJJIJIJJJJHHHHHFFFDFCCC

If you want to copy and paste it and try converting yourself.

Indexed FASTA Files

This example uses the mouse genome (Ensembl release 75) for an example. You can get this with:

$ wget ftp://ftp.ensembl.org/pub/release-75/fasta/mus_musculus/dna/Mus_musculus.GRCm38.75.dna.toplevel.fa.gz

untreated1_chr4.fq

Created a FASTQ file from:

$ samtools view untreated1_chr4.bam | \
 bioawk -c sam '{s=$seq; q=$qual; if(and($flag, 16)) {s=revcomp($seq);q=reverse($qual)} print "@"$qname"\n"s"\n+\n"q}' > untreated1_chr4.fq

CASAVA Manual

Much, much more information about the FASTQ format produced by Illumina's CASVA base calling and read processing pipeline can be found in the CASAVA manual.

FASTA Extraction Benchmarks

Note that extract.py suffers some limitations. First, this script is not robust to regions that are in the correct format -- we never explicitly check that the regular expression matches region. This is an easy fix; we could either check the object returned from re.match() is not None or wrap the statement in a try/except block. Second, this script does not do explicit bounds checking, meaning the user would not be warned if they tried to access a range outside the sequence length.

TODO seqtk comparison

Name	Name	Last commit message	Last commit date
parent directory ..
Mus_musculus.GRCm38.75.dna.chromosome.8.fa.gz	Mus_musculus.GRCm38.75.dna.chromosome.8.fa.gz	more material	Mar 1, 2015
README.md	README.md	added pre-indexed files	Aug 28, 2014
contam.fastq	contam.fastq	updated readme, ch10 files	Aug 21, 2014
egfr_flank.fasta	egfr_flank.fasta	updated readme, ch10 files	Aug 21, 2014
nuccount.py	nuccount.py	added nuccount.py	Feb 28, 2015
readfq.py	readfq.py	updated readme, ch10 files	Aug 21, 2014
untreated1_chr4.fq	untreated1_chr4.fq	updated readme, ch10 files	Aug 21, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

chapter-10-sequence

chapter-10-sequence

README.md

Chapter 10: Sequence Data

Files

egfr_flank.fasta File

Base Qualities

Indexed FASTA Files

untreated1_chr4.fq

CASAVA Manual

FASTA Extraction Benchmarks

Collapse file tree

Files

chapter-10-sequence

Directory actions

More options

Directory actions

More options

Latest commit

History

chapter-10-sequence

Folders and files

parent directory

README.md

Chapter 10: Sequence Data

Files

egfr_flank.fasta File

Base Qualities

Indexed FASTA Files

untreated1_chr4.fq

CASAVA Manual

FASTA Extraction Benchmarks