This is something I’ve planned to do for a long time, so I’m really happy that finally I was able to do it.
There is a common (annoying) habit of commercial news sites, to publish only short article excerpts (if at all) in their Feeds. Although I can totally understand why they do this (to increase page impressions/ad revenue), it really distracts when you read with a feed reader and need to open a web browser everytime you want to read the complete article, this gets even worse in situations where you’ve no Internet connection available. Another possible scenario is when you’ve waited too long and the article is already been unpublicized, and yes the german public broadcasting stations are actually forced (by the German publisher lobby) to “unpublish” (“depublizieren“) their content after some days. These are (for me at least), totally valid points that justify the effort to do something against it.
It is of course not a new problem, as I said I’ve planned to do something for quite some time now. One of the first solutions I came up with was to use specified regular expressions to extract the content of articles and built a new feed (including the extracted full article text). I’m not the only one who thought of this solution, there are for an example some Snownews/Liferea/Newsbeuter filter scripts available that do exactly that. This works, but it would be better to have a more generic solution besides specifying and maintaining regular expressions (or XPath for that matter) for all the news sites and (“commercial”-)blogs I read.
A more powerful approach to extract the articles, would be to use content extraction or template detection algorithms. I’d written an article (in German) about that when I played with some of these algorithms a while back. But I couldn’t find a suitable implementation, that was developed and stable enough to do this and I wasn’t really crazy about writing one of my own either.
http://127.0.0.1:1912/http://example.com/atom.xml The node.js server will download the feed and parse it for item links (article links), it will also remove any existing content excerpts. Then it crawls all articles and uses readability to extract the content of the received pages. The original feed will be extended with the full-text of the articles and send to the user (the feed reader software). Feedability also supports filters based on jquery selectors.
I’ve tested it with Atom, RSS1.0 and RSS2.0 Feeds, but there are some known bugs, for instance: The character encoding breaks sometimes. As I said this is my first Node.js application, there are some parts that I’m particular unhappy with, for example the current feed parser/generator based on expat (lib/feed.js), maybe I’m going to rewrite that sometime.