Is the web Normalized? Should we care?
Brad Feld raises an interesting question regarding normalization of the web's data. He speaks of the web becoming increasingly denormalized as information is replicated and spread around throughout the webosphere.
There are a few interesting questions raised by this observation.
One trend we are seeing on the web is the offering of APIs for information access. This means that information is being made available (and often updatable) by everyone and from everywhere. This is really a move towards normalization, in the sense that the "authoritative copy" of the data remains only the server that offers the API. Any service that reads that data and offers it as part of their service is acting as an "aggregator".
If a service like Google Base helps to index a database and make it searchable by the world, it is acting more like a "cache" than a non-normalized database. Meaning, the original DB is still the "authoritative copy" and updates to it will eventually "percolate" down into the Google Base, which simply provides faster, easier access to this data.
This talk of authoritative copies of data is not limited to such obviously database-friendly information as film and book attributes. Text from a blog entry also qualifies and should be scrutinized for normalization. The text you are now reading is being entered into blogger.com, and until otherwise stated, glennswebfactory.blogspot.com is the authoritative source of this data. If an aggregator or other such service copies this text to ease access to it by others, this again can be seen as a non-authoritative copy to speed distribution or enable organization of the data. But glennswebfactory.blogspot.com will remain the sole authoritative source, and as long as any replication services allow for updates to percolate from the authoritative source, I would argue that no denormalization has occurred.
To answer the questions above: I do think we can speak of normalization on the web, and I do think it is an important consideration. If we have multiple sources of authoritative sources of data, we increase work to maintain them, and we risk the data getting out of synch (wherein we have two differing opinions as to what the data should be).
To ensure we keep the web as normalized as possible, consider the following guidelines:
There are a few interesting questions raised by this observation.
- Can we speak of the web's data as "normalized" or "denormalized"?
- If the web were to become more denormalized, is this good or bad?
- What can we do (and should we do it) to try to re-normalize the web's data?
One trend we are seeing on the web is the offering of APIs for information access. This means that information is being made available (and often updatable) by everyone and from everywhere. This is really a move towards normalization, in the sense that the "authoritative copy" of the data remains only the server that offers the API. Any service that reads that data and offers it as part of their service is acting as an "aggregator".
If a service like Google Base helps to index a database and make it searchable by the world, it is acting more like a "cache" than a non-normalized database. Meaning, the original DB is still the "authoritative copy" and updates to it will eventually "percolate" down into the Google Base, which simply provides faster, easier access to this data.
This talk of authoritative copies of data is not limited to such obviously database-friendly information as film and book attributes. Text from a blog entry also qualifies and should be scrutinized for normalization. The text you are now reading is being entered into blogger.com, and until otherwise stated, glennswebfactory.blogspot.com is the authoritative source of this data. If an aggregator or other such service copies this text to ease access to it by others, this again can be seen as a non-authoritative copy to speed distribution or enable organization of the data. But glennswebfactory.blogspot.com will remain the sole authoritative source, and as long as any replication services allow for updates to percolate from the authoritative source, I would argue that no denormalization has occurred.
To answer the questions above: I do think we can speak of normalization on the web, and I do think it is an important consideration. If we have multiple sources of authoritative sources of data, we increase work to maintain them, and we risk the data getting out of synch (wherein we have two differing opinions as to what the data should be).
To ensure we keep the web as normalized as possible, consider the following guidelines:
- If you are a creator of original data, offer an API to access this data. Ensure that the API allows for updates (if relevant) to be propagated to users of your data. Provide clear guidelines in the use of the API and the nature of your data to ensure propagators of your data keep it reasonably fresh. (provide expiration dates on the data for example).
- If you obtain data from an original source, ensure you follow guidelines for ensuring fresh copies of that data. If no such guidelines are published, think about the nature of the data and employ your own methodologies to ensure freshness. (The author of a book is unlikely to change once it's published, but the president of a company may change with some frequency). Be sure and clearly post information on your site regarding the authoritative source of your data, such that if questions arise as to the currency of your information, users can go to the source.