Skip to content

Files

Latest commit

b940345 · Jun 14, 2015

History

History

chapter-07-unix-data-tools

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Jul 22, 2014
Jul 22, 2014
Jul 22, 2014
Jul 22, 2014
Jul 22, 2014
Aug 1, 2014
Jun 14, 2015
Jul 22, 2014
Aug 1, 2014
Aug 1, 2014
Aug 1, 2014
Aug 1, 2014
Mar 11, 2015
Aug 1, 2014
Aug 1, 2014
Aug 2, 2014
Aug 1, 2014
Jul 22, 2014
Mar 11, 2015
Aug 2, 2014
Aug 1, 2014
Aug 1, 2014
Mar 13, 2015
Mar 11, 2015

Resources for Chapter 7 -- Unix Data Tools

Bentley's Programming Pearls column with Donald Knuth and Doug McIlroy is available on the ACM website. One particularly excellent quote from McIlroy that I didn't include is:

Knuth has shown us here how to program intelligibly, but not wisely. I buy the discipline. I do not buy the result. He has fashioned a sort of industrial-strength Fabergé egg -- intricate, wonderfully worked, refined beyond all ordinary desires, a museum piece from the start.

Regular Expressions

xkcd 1171

There's numerous good resources on Regular Expressions, but one of the best ways to tackle regular expression problems is with an interactive debugger like:

A note about the Mus_musculus.GRCm38.75_chr1.bed file

There's a subtle (yet exceedingly common in bioinformatics) off by one error in this file (sorry!). I thought about fixing it during the book editing process, but actually think it's a valuable lesson, so I'll keep it in there. Can you find out what it is?

If you need a hint: read chapter 9's section genomic range formats and use shasum to compare the test.txt file to Mus_musculus.GRCm38.75_chr1.bed.

Dealing with a Variable Number of Spaces

Plaintext data that uses a variable number of spaces to delimit columns looks clean in the terminal but can be a nightmare to parse. Still, some programs will occasionally output data this way (usually to provide easily readable data to users). However, data in this format will not work with Unix data tools like cut; it first needs to be converted to tab-delimited (or CSV). Using the tool sed this is quite easy:

$ sed 's/ */	/g' badly_formatted.txt > tab_delimited.txt

Note that the character between / / is a literal tab (you can enter this in your shell using control-w ). However, note that this will introduce a slew of problems if your columns themselves have spaces in them (which can common in data). This why tab-delimited and CSV formats are preferable to variable spaces.

Grep Tricks

Here's an interesting grep trick to make it 50 times faster.

Parsing GTF Group Column with Awk/Bioawk vs Python

This can be quite messy... consider:

  bioawk -c gff '$3 ~ /gene/ && $2 ~ /protein_coding/ \
      {split($group, a, "; ");                        \
      print $seqname,$end-$start, a[1]}' Mus_musculus.GRCm38.75_chr1.gtf | \
      sed -e 's/gene_id //' -e 's/"//g'

This assumes that gene name will always be in the first column of group (a safe assumption for well-formatted GTFs).

I may take the time to do this with Python. For example, the last column can easily be turned into a dictionary with:

group = 'gene_id "ENSMUSG00000090025"; gene_name "Gm16088"; gene_source "havana"; gene_biotype "pseudogene";'
def parse_keyvals(x):
    key, val = x.split(" ")
    return (key, val.replace('"', ''))

keyvals = dict([parse_keyvals(x) for x in group.strip(";").split("; ")])

# keyvals:
# {'gene_source': 'havana', 'gene_biotype': 'pseudogene', 'gene_name': 'Gm16088', 'gene_id': 'ENSMUSG00000090025'}

Conclusion

If you're interested in where the "one feverish night" quote came from, it's from a document by Doug McIlroy.