Web 3.0 Does Not Validate

This weekend much of the geekosphere was buzzing about the “Web 3.0” article in the NY Times, but from where I stand, Web 3.0 does not validate.

Apparently, Web 3.0 is the latest re-branding of the Semantic Web, an attempt to turn the Web of documents into a Web of data. Don’t get me wrong – the goals of the Semantic Web are good ones, and I believe many of those goals will be met in my lifetime. But too much of the Semantic Web relies on data being valid – that is, valid XML, XHTML, RDF, etc. – and too many of us will never publish valid data.

Unless the world comes up with a way to punish those who publish invalid data, invalid data will always exist. Yeah, companies like Google could be the punishers by refusing to index data that isn’t valid, but what are the chances of that happening? Google’s Web search is successful in part because it makes sense of the chaos of the invalid Web. Why mess with that formula?

If the Semantic Web hopes to exist, it’s going to have to deal with invalid HTML, badly-formed XML, and RSS with vague entity escaping. It’s also going to have to filter out every new variation of spam, and be smart enough to know when people lie.

The Semantic Web may happen, but if it does, it’s going to be a helluva lot messier than the architects would like.

14 thoughts on “Web 3.0 Does Not Validate

  1. You are totally missing the point here.. First Web 3.0 is just another buzz word not invented by tech people, but by sales/marketing people..
    Second, right now we’re stuck in an era where loads of people are still writing relatively low level data..
    Stuff like RDF will be generated from existing data sources, and likely most people who publish that stuff will most likely not have to worry about the syntax, in the same way people are publishing .mp3’s .doc’s .pdf’s or whatever..
    You can also compare it with IP/TCP/HTTP.. protocols we use every day and never have to worry about. Of course there will always be buggy generators, but whats the point of writing one if you can’t interface with proper parsers?
    Sadly incompatibilities happen.. like with SOAP, but RDF is stable and well defined as of right now.. We should not worry about the amateur writing shitty markup, but about the big vendors that have the actual power to change incompatibility to a semi-standard.

  2. Evert, what you’re suggesting is that future tools will generate valid syntax, yet past experience has proved this wrong (people said that about HTML, and then XML). What is really needed is more tools that can read invalid syntax.
    TCP/IP works because geeks such as myself have agreed to try to follow the rules. But once a technology grows beyond the geekosphere, it’s unreasonable to assume that it can remain syntactically valid.

  3. I missed the hype, which is what I’m sure it is. I thought that was what Web 2.0 was all about.
    I think that systems and users on systems that produce broken code will be increasing invisible, or rather unfindable and unsearchable. I have a feeling the usefulness of the semantic web will aid it’s own proliferation and those who do not conform, you just won’t hear about them unless they shout and spend lots of cash getting themselves noticed.

  4. STOP, STOP THIS FUCKING WEB 2.0.1.2.3.2.4 BULLSHIT! JUST STOP! YOU’RE RUINING THE INTERNET!!!!

  5. Your arguments are flawed.
    As stated in earlier comment, much of the data published is formatted automatically using tools, which eliminates much of the invalid use of markup.
    You also equate invalid markup with not being able to determine if a person is lying. By any stretch, there’s no way to see the logical connection between these two. And no one problems the semantic web would be a mind reader.

  6. Shelley, the fact that much of the data will be formatted by tools doesn’t mean it will be valid. People made the same arguments about HTML, but for years we’ve dealt with web authoring tools that generate invalid markup. And of course, right now we’re dealing with tons of invalid RSS feeds despite similar arguments that we could rely on tools to generate well-formed XML files.
    Also, I wasn’t trying to equate invalid markup with being able to determine whether a person is lying, but I can see how my imprecise writing could be interpreted that way. I just think that the Semantic Web assumes too much about the quality and reliability of data.

  7. Web 3.0? Isnt 2.0 Still In Beta?

    Just when it seems like Web 2.0 is finally getting somewhere, some people say its over and are moving on to Web 3.0. I guess the YouTube sale was the end of the line for some.
    I can almost see some kind of Wired Tired…

  8. Web 3.0 – In The Beginning

    Web 3.0. Buzzword? A new term to launch the next wave of investor financed startups that don\’t have a viable business plan? Or is it the \”next big thing\” we have all been waiting for. We barely had a chance to sit down, savor, and sift through the m…

  9. Don’t count on Google to validate anything. It’s not in their genes.
    Have you ever run a validator over the pages they produce??

  10. Ok, my turn –
    re. “The Semantic Web may happen, but if it does, it’s going to be a helluva lot messier than the architects would like.”
    I believe it’s starting to happen, and it certainly is messy.
    Evert got a key point in early – there’s all this stuff already in databases, moved around by software. Why should expressing it in a slightly different fashion make it any the less reliable?

  11. After having to trawl through thousands of feeds, dealing with all the ‘intricacies’ (too polite) to ‘sanitise’ them for presentation, I have to heartily agree.
    Call the XML Police!! ;)

  12. Hi,
    what an interesting report about the WEB 3.0.
    The whole world is talking about the web 2.0 and the bubble 2.0, because nobody knows exactly what the web 2.0 actually means.
    According to this confusion I read a few days ago an article in a German Newspaper about the Web 3.0. It was very amusing.
    Best wishes from Germany

  13. I’ve had this out with some semwebbers recently. There’s an entire layer missing in the semantic web, which is reverse engineering structured information out of semi-structured and ill-formed nonsense. You’re right – the semweb is making lab level assumptions about data quality, that don’t hold up minutes in the field. It’s a very GOAFAI way to think. The winners here will those who parse at any cost.
    Shelly: “Your arguments are flawed.”
    Really? Look around you. We’re drowning in junk markup.
    Danny: “there’s all this stuff already in databases, moved around by software. Why should expressing it in a slightly different fashion make it any the less reliable?”
    This doesn’t make any sense – it already *is* unreliable. Data in databases are engineered or validated to be reliable, sure. Yet, lots of the malformed junk on the data comes straight from a DB.

Comments are closed.