Feedability: Node.js Feed Proxy With Readability

This is something I’ve planned to do for a long time, so I’m really happy that finally I was able to do it.

There is a common (annoying) habit of commercial news sites, to publish only short article excerpts (if at all) in their Feeds. Although I can totally understand why they do this (to increase page impressions/ad revenue), it really distracts when you read with a feed reader and need to open a web browser everytime you want to read the complete article, this gets even worse in situations where you’ve no Internet connection available. Another possible scenario is when you’ve waited too long and the article is already been unpublicized, and yes the german public broadcasting stations are actually forced (by the German publisher lobby) to “unpublish” (“depublizieren“) their content after some days. These are (for me at least), totally valid points that justify the effort to do something against it.

It is of course not a new problem, as I said I’ve planned to do something for quite some time now. One of the first solutions I came up with was to use specified regular expressions to extract the content of articles and built a new feed (including the extracted full article text). I’m not the only one who thought of this solution, there are for an example some Snownews/Liferea/Newsbeuter filter scripts available that do exactly that. This works, but it would be better to have a more generic solution besides specifying and maintaining regular expressions (or XPath for that matter) for all the news sites and (“commercial”-)blogs I read.

A more powerful approach to extract the articles, would be to use content extraction or template detection algorithms. I’d written an article (in German) about that when I played with some of these algorithms a while back. But I couldn’t find a suitable implementation, that was developed and stable enough to do this and I wasn’t really crazy about writing one of my own either.

Then in 2009 comes arc90‘s Readability that implements a mature content extraction algorithm in Client JavaScript. It is not perfect but I guess it is by far the best open source solution available for it right now. One problem in particular that I’ve noticed are comment sections below articles, sometimes comments include more text than the actual article, this can confuse Readability to think that the comment is the actual main content. So although it works most of the time, you should expect problems like this. The first application that I’m aware of using Readability for feeds is the Apple feed reader “Reeder” that can fetch the full text for selected articles.

There are some approaches to port Readability to other languages, but I’ve never seen a complete reimplementation. A few days ago I stumbled upon a Readability node.js library written by Arrix Zhou that uses just a slightly modified version of the original. Since I’ve planned to learn node.js anyways (like most people I’ve only written client JavaScript before) I used the opportunity to write Feedability:

Feedability is written in javascript using the v8 engine and node.js, it requires the node-readability and node-expat libraries that can be installed using npm. Feedability implements a small HTTP Server, you sent the feed you want to read just as a query string, so for instance: http://127.0.0.1:1912/http://example.com/atom.xml The node.js server will download the feed and parse it for item links (article links), it will also remove any existing content excerpts. Then it crawls all articles and uses readability to extract the content of the received pages. The original feed will be extended with the full-text of the articles and send to the user (the feed reader software). Feedability also supports filters based on jquery selectors.

I’ve tested it with Atom, RSS1.0 and RSS2.0 Feeds, but there are some known bugs, for instance: The character encoding breaks sometimes. As I said this is my first Node.js application, there are some parts that I’m particular unhappy with, for example the current feed parser/generator based on expat (lib/feed.js), maybe I’m going to rewrite that sometime.

One thought on “Feedability: Node.js Feed Proxy With Readability

  1. Pingback: Tweets that mention Feedability: NodeJS Feed Proxy With Readability « sixserv blog -- Topsy.com

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>