chapter-07-unix-data-tools

Resources for Chapter 7 -- Unix Data Tools

Bentley's Programming Pearls column with Donald Knuth and Doug McIlroy is available on the ACM website. One particularly excellent quote from McIlroy that I didn't include is:

Knuth has shown us here how to program intelligibly, but not wisely. I buy the discipline. I do not buy the result. He has fashioned a sort of industrial-strength Fabergé egg -- intricate, wonderfully worked, refined beyond all ordinary desires, a museum piece from the start.

Regular Expressions

There's numerous good resources on Regular Expressions, but one of the best ways to tackle regular expression problems is with an interactive debugger like:

A note about the `Mus_musculus.GRCm38.75_chr1.bed` file

There's a subtle (yet exceedingly common in bioinformatics) off by one error in this file (sorry!). I thought about fixing it during the book editing process, but actually think it's a valuable lesson, so I'll keep it in there. Can you find out what it is?

If you need a hint: read chapter 9's section genomic range formats and use shasum to compare the test.txt file to Mus_musculus.GRCm38.75_chr1.bed.

Dealing with a Variable Number of Spaces

Plaintext data that uses a variable number of spaces to delimit columns looks clean in the terminal but can be a nightmare to parse. Still, some programs will occasionally output data this way (usually to provide easily readable data to users). However, data in this format will not work with Unix data tools like cut; it first needs to be converted to tab-delimited (or CSV). Using the tool sed this is quite easy:

$ sed 's/ */	/g' badly_formatted.txt > tab_delimited.txt

Note that the character between / / is a literal tab (you can enter this in your shell using control-w ). However, note that this will introduce a slew of problems if your columns themselves have spaces in them (which can common in data). This why tab-delimited and CSV formats are preferable to variable spaces.

Grep Tricks

Here's an interesting grep trick to make it 50 times faster.

Parsing GTF Group Column with Awk/Bioawk vs Python

This can be quite messy... consider:

  bioawk -c gff '$3 ~ /gene/ && $2 ~ /protein_coding/ \
      {split($group, a, "; ");                        \
      print $seqname,$end-$start, a[1]}' Mus_musculus.GRCm38.75_chr1.gtf | \
      sed -e 's/gene_id //' -e 's/"//g'

This assumes that gene name will always be in the first column of group (a safe assumption for well-formatted GTFs).

I may take the time to do this with Python. For example, the last column can easily be turned into a dictionary with:

group = 'gene_id "ENSMUSG00000090025"; gene_name "Gm16088"; gene_source "havana"; gene_biotype "pseudogene";'
def parse_keyvals(x):
    key, val = x.split(" ")
    return (key, val.replace('"', ''))

keyvals = dict([parse_keyvals(x) for x in group.strip(";").split("; ")])

# keyvals:
# {'gene_source': 'havana', 'gene_biotype': 'pseudogene', 'gene_name': 'Gm16088', 'gene_id': 'ENSMUSG00000090025'}

Conclusion

If you're interested in where the "one feverish night" quote came from, it's from a document by Doug McIlroy.

Name	Name	Last commit message	Last commit date
parent directory ..
Mus_musculus.GRCm38.75_chr1.bed	Mus_musculus.GRCm38.75_chr1.bed	added some ch7 materials	Jul 22, 2014
Mus_musculus.GRCm38.75_chr1.gtf	Mus_musculus.GRCm38.75_chr1.gtf	added some ch7 materials	Jul 22, 2014
Mus_musculus.GRCm38.75_chr1.gtf.gz	Mus_musculus.GRCm38.75_chr1.gtf.gz	added some ch7 materials	Jul 22, 2014
Mus_musculus.GRCm38.75_chr1_bed.csv	Mus_musculus.GRCm38.75_chr1_bed.csv	added some ch7 materials	Jul 22, 2014
Mus_musculus.GRCm38.75_chr1_genes.txt	Mus_musculus.GRCm38.75_chr1_genes.txt	added some ch7 materials	Jul 22, 2014
Mus_musculus.GRCm38.75_chr1_random.gtf	Mus_musculus.GRCm38.75_chr1_random.gtf	added unix chapter files	Aug 1, 2014
README.md	README.md	better title for section	Jun 14, 2015
bmark.txt	bmark.txt	added some ch7 materials	Jul 22, 2014
chroms.txt	chroms.txt	added unix chapter files	Aug 1, 2014
contam.fastq	contam.fastq	added unix chapter files	Aug 1, 2014
contaminated.fastq	contaminated.fastq	added unix chapter files	Aug 1, 2014
example.bed	example.bed	added unix chapter files	Aug 1, 2014
example.txt	example.txt	some necessary files added to ch7	Mar 11, 2015
example2.bed	example2.bed	added unix chapter files	Aug 1, 2014
example_lengths.txt	example_lengths.txt	added unix chapter files	Aug 1, 2014
genotypes.txt	genotypes.txt	additional ch7 data added	Aug 2, 2014
greedy_example.txt	greedy_example.txt	added unix chapter files	Aug 1, 2014
grep-benchmark.md	grep-benchmark.md	added some ch7 materials	Jul 22, 2014
improper.fa	improper.fa	some necessary files added to ch7	Mar 11, 2015
lengths.txt	lengths.txt	additional ch7 data added	Aug 2, 2014
letters.txt	letters.txt	added unix chapter files	Aug 1, 2014
mm_gene_names.txt	mm_gene_names.txt	added unix chapter files	Aug 1, 2014
test.bed	test.bed	updated materials	Mar 13, 2015
utf8.txt	utf8.txt	some necessary files added to ch7	Mar 11, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

chapter-07-unix-data-tools

chapter-07-unix-data-tools

README.md

Resources for Chapter 7 -- Unix Data Tools

Regular Expressions

A note about the `Mus_musculus.GRCm38.75_chr1.bed` file

Dealing with a Variable Number of Spaces

Grep Tricks

Parsing GTF Group Column with Awk/Bioawk vs Python

Conclusion

Files

chapter-07-unix-data-tools

Directory actions

More options

Directory actions

More options

Latest commit

History

chapter-07-unix-data-tools

Folders and files

parent directory

README.md

Resources for Chapter 7 -- Unix Data Tools

Regular Expressions

A note about the Mus_musculus.GRCm38.75_chr1.bed file

Dealing with a Variable Number of Spaces

Grep Tricks

Parsing GTF Group Column with Awk/Bioawk vs Python

Conclusion

A note about the `Mus_musculus.GRCm38.75_chr1.bed` file