NLP: Unstructured thinking for unstructured data

February 29, 2008 at 8:56 am 2 comments

In my last blog post, I talked about how we have had to develop Natural Language Processing (NLP) algorithms in order to overcome the lack of standardization on the web.  At Filtrbox, the more we dig deeper into the web, exploring its inner depths for information, the more I find that we are having to use a NLP concept here or a half NLP concept there to facilitate the process of mining unstructured data. The application of NLP concepts is increasingly figuring into the majority of our algorithms.  I have begun to notice that my thought process as software architect, designer and developer is tending to exhibit influences of NLP and machine learning concepts much more than before. 

I think NLP fundamentals are essential for those who wish to undertake the challenge of building the next generation of web applications that process the unstructured data on the web.  Yes, there are efforts to build a structured web via initiatives such as the semantic web and the various APIs being proposed. I respect these efforts; however, I would not solely rely on these initiatives alone.  The proposed APIs provide access to structured data stored on various islands on the web.  For those users who do not have their data on those islands, their data is not accessible via the API.  The Semantic Web is the initiative that will bring us closest to structured data on the web.  However, as we are witnessing its painfully slow adoption, it looks like its going to be a while before we have some structure on the web. The challenge is what do we do now while we wait for these initiatives to mature. I think what we do today is, instead of waiting for content publishers to structure their content, we process content publishers’ content as is and we programmatically infer the structure of the content.  The application of NLP concepts are one way we can make the content structure inferences.  By applying NLP, this will take us a step closer to programmatic input, processing and storage of unstructured data.  We have traditionally thought in terms of structured data, programmed for structured data and stored structured data.  The challenge posed by the web today is an opportunity to break new ground for software engineers and start thinking, programming and storing unstructured data.


Entry filed under: Social Networking, Software Engineering, Web 2.0.

A case for standardizing blog templates TechStars notes in the raw #1

2 Comments Add your own

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


February 2008
« Jan   Mar »

Most Recent Posts

%d bloggers like this: