Published in Blogging on Monday, March 14th, 2005
For items that aren't worthy of a full weblog post but are worth sharing, more and more people are turning to a small 2-3 sentence post that links to something interesting on the web. This has proven to be a great way to quickly share ideas and opinions, but are these types of posts more likely to be invaluable as the months and years pass?
These types of posts, typically called link blogs, provide a method for sharing a link and a short opinion or idea. They are a great way to store resources and for getting the word out on a hot new topic. They differ from a full (traditional?) weblog post in that the focus or purpose of the post itself is what lies at the other end of the link, that is, the meat isn't in the post itself but in the resource being linked to. So what happens if a resource goes missing?
While we all know that URIs are supposed to be permanent, but lets face it, servers fail, data gets lost, and sometimes we just don't have 20/20 foresight, and as a result, that useful link you had posted in your link-blog is now a path to nowhere.
Where a traditional blog post would normally have something interesting to offer in and of itself, a link-blog style post is much more likely to become useless if it's target resource goes missing.
For me, when writing up the script used on this site, somewhere back in my head I had the words Anne wrote sometime ago about the perfect blogging solution. In that post he mentions:
All links should be stored in a separate database table. Referenced to from within the post.
- This is needed so links can be easily checked. (If you find errors in the following, please contact me and I will update it accordingly.)
- If the link returns a 200, leave it. (You might want to check if the title is updated though.)
- If the link returns a 301 (permanent redirect), it can be updated. (The old URI could be stored in a separate database field temporarily so the that the end user can see what changes are made.)
- If the link returns a 410, it can be removed immediately. (No need to check again, do inform the end user.)
- If the link returns a 404, it should be checked each day one time 10 days long and after that by the end user to see if the file is really missing.
- All other error codes should get the same treatment as 404.
- All other status codes should get the same treatment as 200.
Anne may have been outlining his perfect system, but this type of thing could be implemented on existing systems using a little PHP. Anyone interested in link validation could take a look at John Coggeshall's article from October, 2001 entitled A PHP Link Validation Script.
There's enough in that article to put something decent together, though given how people are building their link blogs by simply making smaller posts in traditional weblog software (and not storing the links separately, as Anne mentioned) some regex may come in handy for anyone who wants to do this.
So assuming that we have a decent link checking script up that we can run periodically to validate our links, what do we do when it comes across a broken link?
There are a few services out there that could help provide a user with an alternative location for your original link, should the real page go missing.
Google's cache works well if a page is temporarily down (example), and if the link is to a blog post, reverting to a search on Technorati for the post in question may allow the user to find the information they are interested in, albeit from another source that linked to the missing URI (example).
By implementing a few extra automated functions along with the link validator, a person could offer some alternative links and a short explanation as to what happened:
The last time I tested this link, the page was not available (404). Go ahead and try it if you like, otherwise you may find it in Google's cache, or perhaps you can find the information that you are after by checking other sites that link to it.
Sitepoint's web devlopment books have helped me out on many occasions both for finding a quick solution to a problem but also to level out my knowlegde in weaker areas (JavaScript, I'm looking at you!). I am recommending the following titles from my bookshelf:
I started freelancing by diving in head first and getting on with it. Many years and a lot of experience later I was still able to take away some gems from this book, and there are plenty I wish I had thought of beforehand. If you are new to freelancing and have a lot of questions (or maybe don't know what questions to ask!) do yourself a favor and at least check out the sample chapters.
The author line-up for this book says it all. 7 excellent developers show you how to get your JavaScript coding up to speed with 7 chapters of great theory, code and examples. Metaprogramming with JavaScript (chapter 5 from Dan Webb) really helped me iron out some things I was missing about JavaScript. That said each chapter really helped me to develop my JavaScript skills beyond simple Ajax calls and html insertion with libs like JQuery.
Like the other books listed here, this provides a great reference for the PHP developer looking to have the right answers from the right people at their fingertips. I tend to pull this off the shelf when I need to delve into new territory and usually find a workable solution to keep development moving. This only needs to happen once and you recoup the price of the book in time saved from having to develop the solution or find the right pattern for getting the job done..
Comments and Feedback
When I'm looking for pages that seem to have vanished, another place to look is in the Wayback Machine at web.archive.org. And thanks for linking to that article on link validation -- I need to write up link validation for my bookmark system I use. Also, I'm going down Anne's list for my own software, and that's been an unchecked item so far.
I think the biggest problem with link rot ultimately lies within web server design. Looking at the HTTP specification in its abstract nature and how it is actually implemented in web servers leaves a lot to be desired. Because of this maintaining links is left to the diligence of the application (e.g. weblog, forum) writer who likely has a fairly good knowledge of HTML but much less of HTTP. Making a “proper� web application (one that includes say 410 responses) is hard and requires an intimate knowledge of HTTP, URIs, and their abstractions and semantic characteristics. To make matters worse not only do the web servers not provide help in this area but the tools (ASP.Net, PHP) people use to make these applications offers no support as well and in many cases (especially ASP.Net) mimic traditional web server behavior.
Free cache is a possible option to cache the content of your intended link. A favelet can also be used: http://mathibus.com/archives/2005/02/13/free-cache-favelet/
Don't forget that your link checking script should honor each target site's robots.txt rules, and that many sites ban robots altogether. A link checking script that I've written in the past merely marks a link as gone in this case, but I'd love to hear better ideas.
I claim that setting up "The last time I tested this link, the page was not available [...]" is no option since it ain't really helpful for the user, I fear.
Next, link rot is definitely "unlovely", but checking, looking for alternative locations (if a link doesn't work anymore) and, if necessary, removing them must do the trick - so Anne's proposal seems perfectly reasonable.
Alas, link posts do have a problem.
Hey all, thanks for the input.
I had forgotten about the Wayback machine, and Mathibus' cache bookmarklet (and that page caching option).
Jens, why do you say that it isn't an option?
Simply no longer using the link, in the case of a link-blog, will result in having to removing the resource.
By providing an optional source of information, at the bare minimum you are helping the user on their way to perhaps finding what they are after, and you can keep that resource alive. By making it clear to the user what has happened, they can choose whether to continue searching for the resource or not.
> Jens, why do you say that it isn't an option?
Well, a user won't want to search for anything he suspects to be already there, if you know what I mean. I fear that this is a real problem - and really, if somebody only posts a link, the entire post depends on that link. If it doesn't work, your post "doesn't work", either.
I am still having issues getting to that page.
Hey Chris, what page are you referring to?