2008 Web Search is still in 1979

April 28, 2008 at 12:51 am 1 comment

On Thursday (04/24/2008 ) last week, I had the privilege of talking to Dr. Jim Martin’s Natural Language Processing (NLP) graduate class, at the University of Colorado at Boulder, about the work that we are  doing at Filtrbox and the role that current NLP students will play in the future of information technology.  This blog post is the basis of my message to the class.

As I have written before, the problem that we face today is how to harness the data that is available on the web so that we can apply meaningful interpretation to it using applications.  This problem is rooted in the assumption that the data that is stored on the web is “unstructured”.  Unlike the majority of the data processed by applications today which is stored in some form of a structure e.g. a relational database, the data on the web is not so, as its is perceived as discrete pieces of data scattered all over the web.

I told the class that part of what I am doing at Filtrbox is an attempt to prove that the data on the web is not as “unstructured” as we may think today.  Within that data, there is a lot of structure, relationship and general interconnectedness no matter how “discrete” we may think it is.  With effective mining of the data and good applications, we can apply interpretation to the data and produce meaningful information.  However, we are still far from applications that can apply effective interpretive meaning on this data.  The reason for this is that we have to address the problem of information retrieval (IR) first before we can get to the writing of applications. 

To recognize where we are today on the continuum of web data information retreival and applications; a look at the evolution of enterprise applications gives us a great analogy:

Enterprise applications are where they are today primarily because they have a structured data storage model (Relational Database or RDB) and a standard access model (Structured Query Language or SQL).  Before there were enterprise applications that we know today, there were only RDBs and SQL.  While RDB work dates back to the 1960s, the RDBs that the majority is familiar with today had their beginnings in the 1970s.  The first (or widely believed to be) commercially available implementation of RDB+SQL was Oracle, then known as Relational Software, in 1979. This provided the ability to query an RDB for data using SQL but no applications as we know them today.  Analogizing this with the web, this is where we are today. We can go on Google or our favorite RSS readers (RDB analogy) and query for web data using a weak REST API or search form (SQL analogy) but we have no applications comparative to what is in enterprise today to interpret that data.  So simply put, today we are where enterprise applications were in 1979.

My message to the class was that applications like Filtrbox are starting to barely scratch the surface with respect to the implementing of applications on top of web data.  That is because, although its 2008, we are still in 1979.  The stumbling block is the perception of the “unstructured” nature of web data. Today’s NLP students will play a large role tomorrow in identifying and establishing structure in the “unstructured” web data in order to move us beyond 1979.

Advertisements

Entry filed under: Enterprise software, RSS, Software Engineering, Startups, Web 2.0.

Filtrbox is hiring Boulder city services Radiohead-style

1 Comment Add your own

  • […] A couple of weeks ago, I attended the Yahoo Open Hack Day at the Yahoo Campus in Sunnyvale, CA.  At Open Hack Day, Yahoo opened up all their technologies for a few chosen hackers to play with and evaluate for a weekend.  The technology that I was most interested in was BOSS (Build your Own Search Service). BOSS is “Yahoo!’s open search web services platform”.  Simply put, this means Yahoo has opened up its web index for anyone to use using the BOSS API.  This is unprecedented and opens up a ton of opportunities to advance some of the topics that I have discussed on this blog, primarily NLP: Unstructured thinking for unstructured data and 2008 Web Search is still in 1979. […]

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


Calendar

April 2008
M T W T F S S
« Mar   Jun »
 123456
78910111213
14151617181920
21222324252627
282930  

Most Recent Posts


%d bloggers like this: