XPATH Scripting Magic
Recently, a co-worker wanted to grab all of the Dia mailing list archives and do something with them on his local machine. What was he doing? Who cares, the way he accomplished it was nice.
The first part of the solution, which is the cool part (after that it's nothing too fancy), is he downloaded the mailing list archive page, converted it to XHTML, then printed out the href attribute of all the anchor tags whose href attribute contains ".gz" (gzipped). Man, that sounds like a lot of work. Here's the line:
wget -O - \
http://mail.gnome.org/archives/dia-list/index.html \
|tidy -q -asxml \
|xmlstarlet sel \
-N x="http://www.w3.org/1999/xhtml" -t \
-m "//x:a[contains(@href, 'gz')]" -v @href -n
Now, you probably need to go get HTML Tidy and XMLStarlet but after that you should be good to go. You can sudo apt-get install xmlstarlet tidy on Ubuntu and there are Windows versions of everything available, including a port of wget.
I probably don't need to explain this to anyone, but I will anyway (skip to the next paragraph if you haven't stopped reading already). wget -O - http://some.url will grab a page and dump the contents to stdout. tidy -q -asxml turns that page into XHTML. The last one is where the magic happens. -N x="http://some.url predefines a namespace for later use so you don't have to keep re-typing it, -m "//x:a[contains(@href, 'gz')]" matches an XPATH expression (notice the re-appearance of our x: namespace) while -v @href -n prints the value of an XPATH expression followed by a newline. You could also use the "-o" flag to output a string literal in there somewhere if you're into that sort of thing.
The output looks like this:
2006-May.txt.gz
2006-April.txt.gz
2006-March.txt.gz
2006-February.txt.gz
2006-January.txt.gz
...
2001-July.txt.gz
2001-June.txt.gz
2001-May.txt.gz
This sort of stuff seems like it could prove useful at some point in my future. Of course, that's just one way to solve it. In fact, this strikes me as the kind of thing that would make a good take home interview question for technical problem solving skills. It'd be interesting to see how different people do it, especially since you don't need to put any technology restriction on the solution and that it may be difficult to simply search for an answer on Google (with a little tweaking).
Leave a Reply