Earlier this week I wrote about sanitizing CSS, and I’ve been thinking about it a bit more. Like many RSS aggregators, for security and presentation reasons the current version of FeedDemon strips all inline styles before displaying a feed, and I thought this was the best approach. But after seeing the Wikipedia feed that Sam Ruby pointed me to, I’m rethinking that.
Just so you know what I’m talking about, take a look at the screenshots linked below. The first shows the Wikipedia feed in FeedDemon with all inline styles removed, while the second shows the same feed with styles intact:
As you can see, the feed is far more useful with the styles intact. So rather than blindly strip all inline styles, the next version of FeedDemon will use a "whitelist" of allowed CSS properties and values. FeedDemon’s whitelist will be based on the same rules that Bloglines uses, as outlined in the Sanitization Rules wiki. However, I may make FeedDemon’s whitelist even stricter, since I’m not convinced that it’s wise to enable things like background images and CSS cursors in feed content.
At this point, you might be wondering why RSS aggregators need to bother whitelisting inline styles – why not just leave all the inline styles intact? Beyond the security issues, one problem is that some people will use things like excessively large font sizes to make their posts stand out. Other people will deliberately insert "prank" CSS, like a page full of offensive images designed to ruin the reading experience.
These annoyances aren’t really a problem when the post is viewed by itself or within its feed – after all, if you subscribe to a feed that annoys you, you’ll simply unsubscribe from it. But it’s a different story when it’s combined with posts from other feeds in a "river of news" view, or in a search feed from Technorati or Google. The latter issue is the one that concerns me the most, since theoretically someone could ruin a ton of RSS search feeds by littering their blog with popular keywords, and then injecting some nasty CSS into the blog’s feed.
Luckily for me, I’ve already got a ton of CSS parsing code which I wrote for TopStyle, so it won’t be a big deal to add inline style whitelisting to FeedDemon. But if you’re an aggregator developer who’d also like to whitelist inline styles and you don’t have a background in CSS, you might appreciate a few tips I learned the hard way:
- Assuming valid CSS is an invalid assumption. Trust me: just like HTML and RSS, plenty of people use completely invalid CSS. Things like unclosed quotes and declarations without colons can trip up your parser if you assume that inline styles will be correctly written.
- Quotes can be escaped. Although rarely used in practice, characters inside CSS values may be escaped with a backslash. This is most commonly used in the box model hack, which relies on escaped quotes to trick outdated browsers into ignoring specific styles (ex:
<p style="width:400px;voice-family: "\"}\"";voice-family:inherit;width:300px;">). In other words, your parser can’t assume that quotes always mark the start or end of a value.
- Quotes are optional, and single quotes are allowed. Although XHTML requires attribute values to be inside quotes, browsers don’t enforce this requirement. In addition, it’s fine to use single quotes instead of double quotes around values. So make sure your parser handles all three variants (ex:
- Negative values are sometimes allowed. Unlike values for padding properties, values for margin properties can be negative (ex:
- Pixels are the default length unit. One of the things I’m doing in FeedDemon is stripping excessively large font sizes (ex:
<p style="font-size: 800px">), which requires enforcing a max size based on the length unit. If you plan to do the same thing, keep in mind that when the length unit is missing, browsers may assume that pixels (px) were intended. So
<p style="font-size: 12">is the same thing as
<p style="font-size: 12px">.
- Font sizes can get you into trouble. If you "flow" multiple posts in the same newspaper page (like FeedDemon), you have to be careful that a font size declared in an unclosed tag in one post doesn’t affect subsequent posts. The problem gets worse with relative font sizes (ex:
<p style="font-size: smaller">), since improperly nested relative font sizes could result in a tiny single-pixel font size (or a huge font size when "larger" is used).
- Floats can also get you into trouble. If your aggregator uses a multi-column newspaper view, be careful that floated elements don’t overlap posts in adjacent columns (ex:
<img style="float:right" src="http://nick.typepad.com/images/basil.gif" />). And you might want to consider only permitting images to be floated, to avoid having floating DIVs, etc., causing problems.
- Strip class and id attributes. If your newspaper view relies on classes and/or ids to identify items in the page, I recommend removing class and id attributes from the actual posts – otherwise a post could use the same class names that you use in your newspaper, potentially creating all kinds of havoc.
- Remove top-level tags. Although they shouldn’t be there, I’ve seen some feeds that contain top-level tags such as BODY and HTML in their posts. Imagine the impact on your river of news if some prankish feed author inserts a styled BODY tag into their feed.
- If your aggregator embeds IE, get out of the local zone. This applies more to script than it does to CSS, but it bears repeating: if you’re embedding the WebBrowser object, don’t allow locally displayed content to operate in the local zone. If you’re not sure what I’m talking about, refer to my earlier post on this topic.
In addition to the above tips, the W3C’s rules for handling CSS parsing errors may also be of help.