Glenn's Web Factory

Monday, December 05, 2005

Is the web Normalized? Should we care?

Brad Feld raises an interesting question regarding normalization of the web's data. He speaks of the web becoming increasingly denormalized as information is replicated and spread around throughout the webosphere.

There are a few interesting questions raised by this observation.
  • Can we speak of the web's data as "normalized" or "denormalized"?
  • If the web were to become more denormalized, is this good or bad?
  • What can we do (and should we do it) to try to re-normalize the web's data?
One of the goals of normalization, is to remove redundancies in the data. If you are storing the same information in several places, normalization will centralize the data and reference it from those varied locations. So, while I do not believe all the aspects of RDBMS normalization necessarily apply to data on the web, we can certainly talk about centralization and redundancy of canonical data. And here I am using "canonical" to refer to data in which there is an authoritative source - be it physically or theoretically. For example, the factual attributes of films or audio recordings or books. It would be denormalized to store "authoritative copies" of the Author of a given book in more than one place. Are we doing this on the web?

One trend we are seeing on the web is the offering of APIs for information access. This means that information is being made available (and often updatable) by everyone and from everywhere. This is really a move towards normalization, in the sense that the "authoritative copy" of the data remains only the server that offers the API. Any service that reads that data and offers it as part of their service is acting as an "aggregator".

If a service like Google Base helps to index a database and make it searchable by the world, it is acting more like a "cache" than a non-normalized database. Meaning, the original DB is still the "authoritative copy" and updates to it will eventually "percolate" down into the Google Base, which simply provides faster, easier access to this data.

This talk of authoritative copies of data is not limited to such obviously database-friendly information as film and book attributes. Text from a blog entry also qualifies and should be scrutinized for normalization. The text you are now reading is being entered into blogger.com, and until otherwise stated, glennswebfactory.blogspot.com is the authoritative source of this data. If an aggregator or other such service copies this text to ease access to it by others, this again can be seen as a non-authoritative copy to speed distribution or enable organization of the data. But glennswebfactory.blogspot.com will remain the sole authoritative source, and as long as any replication services allow for updates to percolate from the authoritative source, I would argue that no denormalization has occurred.

To answer the questions above: I do think we can speak of normalization on the web, and I do think it is an important consideration. If we have multiple sources of authoritative sources of data, we increase work to maintain them, and we risk the data getting out of synch (wherein we have two differing opinions as to what the data should be).

To ensure we keep the web as normalized as possible, consider the following guidelines:
  • If you are a creator of original data, offer an API to access this data. Ensure that the API allows for updates (if relevant) to be propagated to users of your data. Provide clear guidelines in the use of the API and the nature of your data to ensure propagators of your data keep it reasonably fresh. (provide expiration dates on the data for example).
  • If you obtain data from an original source, ensure you follow guidelines for ensuring fresh copies of that data. If no such guidelines are published, think about the nature of the data and employ your own methodologies to ensure freshness. (The author of a book is unlikely to change once it's published, but the president of a company may change with some frequency). Be sure and clearly post information on your site regarding the authoritative source of your data, such that if questions arise as to the currency of your information, users can go to the source.

Saturday, December 03, 2005

Skype Hype

So according to Mena Trott, the next step in web blogging is making the blogger all the more accessible by including their skype contact information on all their publications.
"This allows you to see a button on a blog and start talking to the person who publishes that blog," Mena Trott, co-founder and president of Six Apart, said in a phone interview. "That is the next step in blogging."
I'd have to say that with all due respect Mena, I disagree.

I think this is a classic mistake that has been made repeatedly in the communications space: that making increasingly interactive (and interruptive) modes of communication is always the path of progress.

I always disliked having a phone at my desk that I was required to answer. Inevitably it would ring just as I was mentally walking through some code path and was holding a bunch of stuff in my head. Answering the phone meant loosing all that data and the 15 to 20 minutes that corresponded to getting into that "zone".

Email is much more conducive to the kind of work I have generally done. When an email arrives, I get a small notification and I can get to it at my leisure. To me, it reaches the perfect balance between immediacy and non-invasiveness.

I use Skype, and I use it often. But even with Skype, my contacts generally IM me first to see if the timing is right for a conversation. Often I ask for 10 or 15 minutes to finish a current task. This works well - particularly since I have only about a dozen contacts.

Blog commenting is an essential part of the 2-way experience that makes blogging a more interesting form of journalism. As this blog becomes more read (I hope this happens!), I look forward to reading lots of comments and feeding off others' ideas. But I look forward to doing this on my schedule. In fact, if a reader were to skype me and discuss my entry directly with me, it would cease to benefit other readers, which is largely the point.

I personally think the future of blogging lies more in making the medium more accessible and easier to use for the billions of people not yet involved. It also lies with helping to connect bloggers and their potential audiences. It lies with making it quicker and easier to find a blog or blog entry that will be interesting to a reader. It lies with making it easier, faster, more ubiquitous and more fun to "keep up" with your favorite bloggers. It lies with enabling effective search services across past blog entries for research purposes.

For those wanting to be interrupted with skype conversations can already display their skype name on their blog. They can put their home address and phone number too. They can even pull a Scoble and frequently detail their dinner plans and eat with their readers.

I really value input from readers... But don't expect to see the Skype Me button on my blog.

Friday, December 02, 2005

Virgin's "Exercise Your Music Muscle" Challenge

Recently Virgin launched a campaign called "Exercise Your Music Muscle".

The challenge is to examine a picture and name up to 74 bands whose names are represented within. There are quite a few very obvious ones ("Rolling Stones", "Queen", "Sex Pistols", "Smashing Pumpkins", etc.) but some are less literal. In fact, the possible solution space (the names of all bands) is so large, I'm sure one could take time and name well over 500 bands!

It seems the "contest" is just to be able to name a single band "correctly" which enters you into a drawing for the prizes. This is rather lame. In fact, readers of this blog already have 4 answers that are certain to be "correct"!

But before chiding Virgin too much, how else could they really hope to fairly administer a contest such as this? Should they just enter into the drawing the much smaller group of people to answer the 74 bands "correctly"? Should they simply award prizes as people submit "correct" answers in the order they are received?

Web-based collaboration aids in solving difficult problems, and contests are absolutely fair game. A flickr page is dedicated to solving this challenge, and they already have something like 81 bands identified.

Administering web contests is very difficult. When there are prizes involved, people seek ways to exploit the contest to their advantage against the "spirit" of the contest. It would seem the "spirit" of this challenge is to see how many bands an individual can name by examining the picture, not to see how adept one is at searching for the answers with Google!

Of course the real objective with a campaign like this is to promote a brand, and Virgin has succeeded in doing so. The more people are talking about it (this blog entry included!), the more successful it has become.

So, while administering web-based contests is difficult, sometimes it simply doesn't matter.

Saturday, November 26, 2005

They use computers!

Walk, don't run! right over to Videlectrix.com and play some of their exciting videogames. Armed with the power-slogan "We use computers... to make video games!" you know you are in for some excitement from screen 1!

All their art work is created by award winning artists! And the theme music is so awesome, you'll be thrilled to know they offer them all as MP3 downloads! Hello IPod!!! My personal music fav is the theme from 50k Racewalker!

Thursday, November 24, 2005

More Lessons from the Sony Fiasco - Source Code Origin Transparency

Many of us have watched in subdued horror as the Sony BMG DRM drama plays out. Since the October 31 discovery that Sony had released music audio CDs with a rootkit installed (spyware which annoys paying customers in attempts to keep them honest while having no effect on experienced music pirates or anyone who has tape) we've seen Sony compound their mistakes in providing damaging removal software (don't install this!) and attempts to ignore, deny or downplay the seriousness of their offenses. (For the sarcastic among you, get your I "heart" rootkit tShirts here!)

The public outcry (and class action complaints!) seems to be getting through to some degree, but now another discovery that the software contained on the discs appear to contain open source software that is used in violation of its license. Sony purchased this software from a third-party called First4Internet) who has declined comment about the matter) but that code clearly contains code fragments from the LAME open source project. This raises lots of questions of accountability and it will be interesting to see who is held responsible for this violation.

But regardless, it emphasizes the need for companies who purchase software to heavily scrutinize the code origins to ensure compliance by all applicable licenses and regulations. Corporate software purchasers should demand absolute transparency to all code used for an application and obtain a signed statement attesting to it.

With the tremendous amount of available source code on the Internet, developers are increasingly depending on code, libraries or components developed by others. In turn, these libraries or components may depend on other components and so on. With the endless array of licensing options available, each with their own rules for use or extension, a fair amount of scrutiny is required to determine just how "owned" a software product is, and under just what conditions it may be legal to be used, modified, distributed or sold.

For example, Yokohama (my company's flagship product) incorporates a small handful of "third-party" components to enhance it's functionality:
  • TinyMCE - An excellent rich text editing component produced by Moxiecode Systems AB. Nature of use: "linked library". License: LGPL.
  • FileUpload and DBCP from the Apache Jakarta Commons project for handling file submissions via the web and database pooling. Nature of use: "linked library". License: Apache
  • Matt Kruse's Calendar Popup for easy entry of dates. Nature of use: "linked library". License: Custom (Allows free use, must retain original header)
  • Walter Zorn's DHTML Tooltips for enhanced tooltips. Nature of use: "linked library". License: LGPL.
The nature of "linked library" usage means the code has not been copied and pasted into your own application, but is left in whole on it's own and linked to via a script call (for javascript libraries) or via a CLASSPATH (for java libraries). For each of these libraries, this allows for full distribution with commercial products and does not impose their licenses upon your proprietary applications which link to them.

That is the full disclosure of our software's included third-party code. I recommend every ISV to create a similar list for their clients (it wouldn't necessarily have to be published on the web). Furthermore, I strongly recommend companies who purchase software to demand such a list and require a signed statement as to the accuracy of the list.

I believe such precautions would go a long way towards limiting the liability for companies who might otherwise find themselves in Sony's shoes.

Comments?

Sunday, November 20, 2005

Introduction

Well, it's time for me to jump on this blog bandwagon! Something I've been meaning to do for some time now - but haven't as it has seemed hard to justify the time investment.

But I hope I can give something of use to this community in the form of ideas and solutions and observations to issues that are encountered by enterprise web developers and small companies with a strong emphasis on development for the web.

Much of what I discuss will likely have a Java focus, but I try to watch closely the goings on of other frameworks such as Ruby on Rails, Zope, or PHP offerings - and I think many of the interesting problems are language agnostic.

Additionally, I hope this blog will be of interest to non-programmers who have a strong interest in trends of web applications and creating "best-of-breed" solutions for their sites or for their clients' sites.

Here at my company (bluejava) we challenge every status quo notion of this business and try to improve upon it. So far, I believe we have succeeded in handling many complex problems better than I have seen them managed before - and I look forward to sharing these methodologies with others and opening them up for comment and discussion.

Feel free to email or comment here with any questions or comments about web development in the enterprise, and I will do my best to get to them - perhaps answering them in this public forum for the benefit and scrutiny of others.

Thanks for reading, and here's to a better web experience!