Dave Winer writes about relying on titles for de-duping items across feeds:
“You can do a decent job of figuring out if you’ve seen an item before and not show it to the user if you look at the title of the story….It ain’t perfect, but then neither is anything else in the world.”
Dave, I understand where you’re coming from – as an aggregator developer, I’ve certainly seen enough to know that many feeds are far from perfect, and it’s up to developers such as myself to handle imperfect feeds without involving the end user. But I disagree that
title is reliable enough to use for de-duping. If you’re subscribed to a lot of blog or news feeds, then yes, this likely works well enough (for the exact reasons you stated). But there are many other feed sources out there, some of which use the same title over and over again (and they often don’t use guids, so you have to rely on various combinations of
link to determine uniqueness).
For example, I’m subscribed to a bug report feed in which all edits to the original bug come through with the same title. I’m also subscribed to a feed from a web-based forum in which all replies to a specific forum post have the same title prefixed with “RE:”. We’ll see many more feeds like this as RSS continues to break out of the blog/media world, so I don’t think
title alone is good enough for de-duping.
In other words, not only is the world less than perfect, it’s even more less than perfect than we thought ;)
16 thoughts on “Why the Title Doesn’t Work”
I wasn’t talking about all contexts for all sets of feeds, rather across a single publications, the NY Times, where it works. It also worked for the BBC.
Having written my own share of syndication software I have to completely concur with Nick.
Feeds really need a guid/id to help consuming software avoid potential duplication issues. The title, link, and pubdate can even fail if, say for instance, the feed has no timestamp or the feed publisher reorgs their site and changes the link.
Ah – sorry, Dave, I missed that. So I guess we’re in agreement that relying on title for de-duping only works for a certain subset of feeds.
I’m a happy Feeddemon customer but I would love an option on each Feed to supress items that have the same title and are on the same date. I have lots of problems which duplicate items in feeds.
Michael, could you share the URLs of a couple of these feeds?
Nick, it’s happened on quite a few feeds. I did report it to your support once before. I don’t think the problem is with FeedDemon though. Often I can see that an item has been changed by the author and appears as a duplicate. It hasn’t happended this week except for one feed, but they had moved to WordPress and the items all had new dates.
I must say that it seems to be happening less and less lately, so hopefully the feeds are improving.
I use RSS a lot however I have never setup an RSS feed (other than automagically with software) so I know nothing technically regarding RSS feeds so forgive me if my question is a little pointless :)
Don’t each item in an RSS feed have a unique ID and so if you get the same item from 2 or more sources they are not listed as you have already seen then? This is how things work regarding usenet posts and it works pretty well IMHO. A number post fixed with the domain name of the original feed or something would work fine, eg firstname.lastname@example.org for this post then email@example.com for the next, etc.
Would this not work with RSS? As I said I have very little knowledge however usenet has many more posts per day than each site has new items posted to thier rss feed (even Scoble can’t post that much heh).
Or does something like this already exist but it doesn’t work? Or it does work if people use it but nobody uses it :)
You point is *very* well taken. More than half of my common feeds wouldn’t work with Winer’s “why the title works” solution. If it were just news, I wouldn’t mind a few false positives. But for other feeds I’d sure hate to miss.
Morgan, items in RSS 2.0 and Atom feeds are *supposed* to have unique IDs, but they often don’t (and RSS 0.9x feeds never do).
I subscribe to a blog where the author posts regular “Quote of the Day” (that’s the title) posts.
I just had the problem again so I thought I’d post the feed:
The post “Human Behavior” appeared as an unread item. I’m 100% sure all items in the feed where marked as read. I checked and there where was no duplicate of the item in FeedDemon. Normally when I have this problem there is a duplicate of the item, but about 10% of the time it seems to just have “forgotten” that the item was read.
Nick: You are seeing Atom feeds without an id? I can see where many (most?) RSS feeds do not have a guid, but Atom has always made id an explicit requirement from the beginning. I ask because it would surprise to learn a significant number of Atom feeds exist without one.
I just had exactly the same problem on the same feed. This time the post “Silent H” became unread, and there was no duplicate item.
Timothy, I have seen some Atom feeds without IDs, but admittedly very few.
I have here the same problem, that title is not enough. Even the publications are coming from dpa (Deutsche Presse Agentur). And a lot of german newspapers rely on the dpa messages. But it is not straight forward to get unique titles and rss items from there messages, because they are now and then corrected and sometimes deleted. So it happens, that the rss feed is generated the the original posting is dropped. So it becomes very hard to determine if the message is the same. And it is really not enough to check the title in newspapers.
So the idea isn’t necessarily to use in RSS 2.0, rather we have to determine uniqueness on a per-feed basis? That’s the strongest justification for using Atom I’ve heard in a long while…
Comments are closed.