Glenn's Web Factory

Monday, December 05, 2005

Is the web Normalized? Should we care?

Brad Feld raises an interesting question regarding normalization of the web's data. He speaks of the web becoming increasingly denormalized as information is replicated and spread around throughout the webosphere.

There are a few interesting questions raised by this observation.
  • Can we speak of the web's data as "normalized" or "denormalized"?
  • If the web were to become more denormalized, is this good or bad?
  • What can we do (and should we do it) to try to re-normalize the web's data?
One of the goals of normalization, is to remove redundancies in the data. If you are storing the same information in several places, normalization will centralize the data and reference it from those varied locations. So, while I do not believe all the aspects of RDBMS normalization necessarily apply to data on the web, we can certainly talk about centralization and redundancy of canonical data. And here I am using "canonical" to refer to data in which there is an authoritative source - be it physically or theoretically. For example, the factual attributes of films or audio recordings or books. It would be denormalized to store "authoritative copies" of the Author of a given book in more than one place. Are we doing this on the web?

One trend we are seeing on the web is the offering of APIs for information access. This means that information is being made available (and often updatable) by everyone and from everywhere. This is really a move towards normalization, in the sense that the "authoritative copy" of the data remains only the server that offers the API. Any service that reads that data and offers it as part of their service is acting as an "aggregator".

If a service like Google Base helps to index a database and make it searchable by the world, it is acting more like a "cache" than a non-normalized database. Meaning, the original DB is still the "authoritative copy" and updates to it will eventually "percolate" down into the Google Base, which simply provides faster, easier access to this data.

This talk of authoritative copies of data is not limited to such obviously database-friendly information as film and book attributes. Text from a blog entry also qualifies and should be scrutinized for normalization. The text you are now reading is being entered into blogger.com, and until otherwise stated, glennswebfactory.blogspot.com is the authoritative source of this data. If an aggregator or other such service copies this text to ease access to it by others, this again can be seen as a non-authoritative copy to speed distribution or enable organization of the data. But glennswebfactory.blogspot.com will remain the sole authoritative source, and as long as any replication services allow for updates to percolate from the authoritative source, I would argue that no denormalization has occurred.

To answer the questions above: I do think we can speak of normalization on the web, and I do think it is an important consideration. If we have multiple sources of authoritative sources of data, we increase work to maintain them, and we risk the data getting out of synch (wherein we have two differing opinions as to what the data should be).

To ensure we keep the web as normalized as possible, consider the following guidelines:
  • If you are a creator of original data, offer an API to access this data. Ensure that the API allows for updates (if relevant) to be propagated to users of your data. Provide clear guidelines in the use of the API and the nature of your data to ensure propagators of your data keep it reasonably fresh. (provide expiration dates on the data for example).
  • If you obtain data from an original source, ensure you follow guidelines for ensuring fresh copies of that data. If no such guidelines are published, think about the nature of the data and employ your own methodologies to ensure freshness. (The author of a book is unlikely to change once it's published, but the president of a company may change with some frequency). Be sure and clearly post information on your site regarding the authoritative source of your data, such that if questions arise as to the currency of your information, users can go to the source.

Saturday, December 03, 2005

Skype Hype

So according to Mena Trott, the next step in web blogging is making the blogger all the more accessible by including their skype contact information on all their publications.
"This allows you to see a button on a blog and start talking to the person who publishes that blog," Mena Trott, co-founder and president of Six Apart, said in a phone interview. "That is the next step in blogging."
I'd have to say that with all due respect Mena, I disagree.

I think this is a classic mistake that has been made repeatedly in the communications space: that making increasingly interactive (and interruptive) modes of communication is always the path of progress.

I always disliked having a phone at my desk that I was required to answer. Inevitably it would ring just as I was mentally walking through some code path and was holding a bunch of stuff in my head. Answering the phone meant loosing all that data and the 15 to 20 minutes that corresponded to getting into that "zone".

Email is much more conducive to the kind of work I have generally done. When an email arrives, I get a small notification and I can get to it at my leisure. To me, it reaches the perfect balance between immediacy and non-invasiveness.

I use Skype, and I use it often. But even with Skype, my contacts generally IM me first to see if the timing is right for a conversation. Often I ask for 10 or 15 minutes to finish a current task. This works well - particularly since I have only about a dozen contacts.

Blog commenting is an essential part of the 2-way experience that makes blogging a more interesting form of journalism. As this blog becomes more read (I hope this happens!), I look forward to reading lots of comments and feeding off others' ideas. But I look forward to doing this on my schedule. In fact, if a reader were to skype me and discuss my entry directly with me, it would cease to benefit other readers, which is largely the point.

I personally think the future of blogging lies more in making the medium more accessible and easier to use for the billions of people not yet involved. It also lies with helping to connect bloggers and their potential audiences. It lies with making it quicker and easier to find a blog or blog entry that will be interesting to a reader. It lies with making it easier, faster, more ubiquitous and more fun to "keep up" with your favorite bloggers. It lies with enabling effective search services across past blog entries for research purposes.

For those wanting to be interrupted with skype conversations can already display their skype name on their blog. They can put their home address and phone number too. They can even pull a Scoble and frequently detail their dinner plans and eat with their readers.

I really value input from readers... But don't expect to see the Skype Me button on my blog.

Friday, December 02, 2005

Virgin's "Exercise Your Music Muscle" Challenge

Recently Virgin launched a campaign called "Exercise Your Music Muscle".

The challenge is to examine a picture and name up to 74 bands whose names are represented within. There are quite a few very obvious ones ("Rolling Stones", "Queen", "Sex Pistols", "Smashing Pumpkins", etc.) but some are less literal. In fact, the possible solution space (the names of all bands) is so large, I'm sure one could take time and name well over 500 bands!

It seems the "contest" is just to be able to name a single band "correctly" which enters you into a drawing for the prizes. This is rather lame. In fact, readers of this blog already have 4 answers that are certain to be "correct"!

But before chiding Virgin too much, how else could they really hope to fairly administer a contest such as this? Should they just enter into the drawing the much smaller group of people to answer the 74 bands "correctly"? Should they simply award prizes as people submit "correct" answers in the order they are received?

Web-based collaboration aids in solving difficult problems, and contests are absolutely fair game. A flickr page is dedicated to solving this challenge, and they already have something like 81 bands identified.

Administering web contests is very difficult. When there are prizes involved, people seek ways to exploit the contest to their advantage against the "spirit" of the contest. It would seem the "spirit" of this challenge is to see how many bands an individual can name by examining the picture, not to see how adept one is at searching for the answers with Google!

Of course the real objective with a campaign like this is to promote a brand, and Virgin has succeeded in doing so. The more people are talking about it (this blog entry included!), the more successful it has become.

So, while administering web-based contests is difficult, sometimes it simply doesn't matter.