Developer Diary: The Wayback Machine Saves JohnChamberlain.com

Developer Diary

Developer Diary · You Heard It Here First · 16 January 2018

The Wayback Machine Saves JohnChamberlain.com

Back in 2006, my web site went off the web. Although I might say I lost interest in blogging, the real reason was a hard drive crash. I had the whole web site on the hard drive of my main computer and no backup. I lost a lot of data on the hard drive, including my web site. I took it a data recovery specialist. Those guys are all the same. They take your $500 and then give some excuse why they cannot rescue your data. In my case the guy said, "Is it encrypted?" No, it is not encrypted, you are just incompetent. Recently, I tried to restore an SSD from work and the same thing happened. First I sent it to Drive Savers, one of the top data recovery services, and they return it saying that it is internally encrypted by design and they have no way to recover the key from the microcontroller. Hmm, ok, so the drive manufacturer deliberately makes it impossible to recover your data from the SSD. I then tried another company. I explained exactly what Drive Savers told me and they assured me that they could recover data from the drive, so I sent it to them. Two weeks later I get the same story: sorry, we cannot recover your data. What a waste of time.

By the way, if you think SSDs cannot fail because they have no moving parts, think again. SSDs actually fail sooner than disk drives.

I gave up my web site for lost, but years later it occurred to me: the Wayback Machine !!! It saves all the web pages on the Internet. I went there, and sure enough, there it was, nearly all the pages and images. I manually downloaded each page because I did not want to miss anything, which might happen if I used an automated system. After mucho clicking I had the 200 pages or so that consistituted my old web site. Unfortunately, I had to do it twice, because it turns out that if you "Save Web Page, Complete" in Firefox, the browser inexplicably inserts newlines to paginate the text. So, I had to download all 200 pages again using "Save Web Page, HTML only." Then the real work began.

Waback Machine

Like most content reflectors, the Wayback Machine rewrites all the URLs in its pages and adds all kinds of Javascript garbage to the page: ads, analytics, headers, footers, beacons. You name it, they add it. To remove all this stuff took a lot of search-and-replace scripts. The average techie would do this with sed, but that would have been a mistake. The problem with sed and its ilk is that, like most unix tools, it is stream oriented (Stream EDitor, get it?). If you have search patterns that span multiple lines or arbitrary regions of text, sed is very awkward. Perl is the same story: once you have a multi-line problem, things get complicated. The right approach is to use Vim. Vim's regex interpreter is more sophisticated than sed's (believe it or not) and can handle multi-line patterns easily. Vim can also handle multiple files easily by using its command-line variant: ex. The resulting commands look like this:

find -type f -name "*.html" -exec ex -sc '%s%http://web.archive.org/web/200.........../http:%http:%g' -cx {} \;
find -type f -name "diary*.html" -exec ex -sc '%s%/web/200...........im_/http://johnchamberlain.com:80/dot_clear.gif%/graphics/dot_clear.gif%g' -cx {} \;
find -type f -name "diary*.html" -exec ex -sc '%s%/web/20060508190551/http://johnchamberlain.com/diary_%diary_%g' -cx {} \;
find -type f -name "diary*.html" -exec ex -sc '%g%color="dimgray" size="2">Developer Diary History%d' -cx {} \;
find -type f -name "diary*.html" -exec ex -sc '%g%cell/PDA: %d' -cx {} \;
find -type f -name "diary*.html" -exec ex -sc '%s/\n//g' -cx {} \;
find -type f -name "diary*.html" -exec ex -sc '%s//\r/g' -cx {} \;
find -type f -name "*.html" -exec ex -sc '%s/<div style="position:fixed\_.* //g' -cx

Imagine writing about 100 such commands, which is what it took. In the second-to-last command above the \_.* is the magic pattern that will match any number of characters, even over multiple lines. You have to be careful when using it, because it will match the longest possible text, so if the terminal part of the pattern occurs multiple times (like at the end of file), it will just delete everything. That happened to me once, but luckily I had backups of all the files. Eventually, I restored everything to a working state. Thank you, Wayback Machine!

It feels good to write "Revised 16 January 2018".

return to John Chamberlain's home · diary index

Developer Diary · about · info@johnchamberlain.com · bio · Revised 16 January 2018 · Pure Content