Bentley's Programming Pearls column with Donald Knuth and Doug McIlroy is available on the ACM website. One particularly excellent quote from McIlroy that I didn't include is:
Knuth has shown us here how to program intelligibly, but not wisely. I buy the discipline. I do not buy the result. He has fashioned a sort of industrial-strength Fabergé egg -- intricate, wonderfully worked, refined beyond all ordinary desires, a museum piece from the start.
There's numerous good resources on Regular Expressions, but one of the best ways to tackle regular expression problems is with an interactive debugger like:
There's a subtle (yet exceedingly common in bioinformatics) off by one error in this file (sorry!). I thought about fixing it during the book editing process, but actually think it's a valuable lesson, so I'll keep it in there. Can you find out what it is?
If you need a hint: read chapter 9's section genomic range formats and use
shasum
to compare the test.txt
file to Mus_musculus.GRCm38.75_chr1.bed
.
Plaintext data that uses a variable number of spaces to delimit columns looks
clean in the terminal but can be a nightmare to parse. Still, some programs
will occasionally output data this way (usually to provide easily readable data
to users). However, data in this format will not work with Unix data tools
like cut
; it first needs to be converted to tab-delimited (or CSV). Using the
tool sed
this is quite easy:
$ sed 's/ */ /g' badly_formatted.txt > tab_delimited.txt
Note that the character between / /
is a literal tab (you can enter this in
your shell using control-w ). However, note that this will introduce a
slew of problems if your columns themselves have spaces in them (which can
common in data). This why tab-delimited and CSV formats are preferable to
variable spaces.
Here's an interesting grep
trick to make it 50 times
faster.
This can be quite messy... consider:
bioawk -c gff '$3 ~ /gene/ && $2 ~ /protein_coding/ \
{split($group, a, "; "); \
print $seqname,$end-$start, a[1]}' Mus_musculus.GRCm38.75_chr1.gtf | \
sed -e 's/gene_id //' -e 's/"//g'
This assumes that gene name will always be in the first column of group (a safe assumption for well-formatted GTFs).
I may take the time to do this with Python. For example, the last column can easily be turned into a dictionary with:
group = 'gene_id "ENSMUSG00000090025"; gene_name "Gm16088"; gene_source "havana"; gene_biotype "pseudogene";'
def parse_keyvals(x):
key, val = x.split(" ")
return (key, val.replace('"', ''))
keyvals = dict([parse_keyvals(x) for x in group.strip(";").split("; ")])
# keyvals:
# {'gene_source': 'havana', 'gene_biotype': 'pseudogene', 'gene_name': 'Gm16088', 'gene_id': 'ENSMUSG00000090025'}
If you're interested in where the "one feverish night" quote came from, it's from a document by Doug McIlroy.