Friday 26 October 2012

Doing RSS right (2) - including content

In addition to the issues I described in 'Doing RSS right', there's another problem with RSS feeds, though at least this one doesn't apply to Atom.

The problem is that there's nothing in RSS to say if the various blocks of text are allowed to contain markup, and if so which. Apparently (see here):
"Userland's RSS reader—generally considered as the reference implementation—did not originally filter out HTML markup from feeds. As a result, publishers began placing HTML markup into the titles and descriptions of items in their RSS feeds. This behavior has become expected of readers, to the point of becoming a de facto standard"
This isn't just difficult, it's unresolvable. If you find

<strong>Boo!</strong>

in feed data you simply can't know if the author intended it as an example of HTML markup, in which case you should escape the brackets before including them in your page, or as 'Boo!', in which case you probably expected to include the data as it stands.

And if you are expected to include the data as it stands you have the added problem that including HTML authored by third parties in your pages is dangerous. If they get their HTML wrong they could wreck the layout of you page (think missing close tag) and, worse, they could inject JavaScript into your pages or open you up to cross site scripting attacks by others. As I wrote here and here, if you let other people add any content to your pages then you are essentially giving them editing rights to the entire page, and perhaps the entire site.

However, given how things are and unless you know from agreements or documentation that a feed will only ever contain text then you are going to have to assume that the content includes HTML. Stripping out all the tags would be fairly easy, but probably isn't going to be useful because it will turn the text into nonsense - think of a post that includes a list.

The only safe way to deal with this is to parse the content and then only allow that subset of HTML tags and/or attributes that you believe to be safe. Don't fall for the trap of trying to filter out only what you consider to be dangerous because that's almost impossible to get right, and don't let all attributes through because they can be dangerous too - consider <a href="javascript:...">.

What should you let through? Well, that's hard to say. Most of the in-line elements, like <b>, <strong>, <a> (carefully), etc. will probably be needed. Also at least some block level stuff - <p>, <div>, <ul>, <ol>, etc. And note that you will have to think carefully about the character encoding both of the RSS feed and the page you are substituting it into, otherwise you might not realise that +ADw-script+AD4- could be dangerous (hint: take a look at UTF7)

If at all possible I'd try to avoid doing this yourself and use a reputable library for the purpose. Selecting such a library is left as an exercise for the reader.

See also Doing RSS right (3) - character encoding.

No comments:

Post a Comment