Posts filed under ‘Social Networking’

NLP: Unstructured thinking for unstructured data

In my last blog post, I talked about how we have had to develop Natural Language Processing (NLP) algorithms in order to overcome the lack of standardization on the web.  At Filtrbox, the more we dig deeper into the web, exploring its inner depths for information, the more I find that we are having to use a NLP concept here or a half NLP concept there to facilitate the process of mining unstructured data. The application of NLP concepts is increasingly figuring into the majority of our algorithms.  I have begun to notice that my thought process as software architect, designer and developer is tending to exhibit influences of NLP and machine learning concepts much more than before. 

I think NLP fundamentals are essential for those who wish to undertake the challenge of building the next generation of web applications that process the unstructured data on the web.  Yes, there are efforts to build a structured web via initiatives such as the semantic web and the various APIs being proposed. I respect these efforts; however, I would not solely rely on these initiatives alone.  The proposed APIs provide access to structured data stored on various islands on the web.  For those users who do not have their data on those islands, their data is not accessible via the API.  The Semantic Web is the initiative that will bring us closest to structured data on the web.  However, as we are witnessing its painfully slow adoption, it looks like its going to be a while before we have some structure on the web. The challenge is what do we do now while we wait for these initiatives to mature. I think what we do today is, instead of waiting for content publishers to structure their content, we process content publishers’ content as is and we programmatically infer the structure of the content.  The application of NLP concepts are one way we can make the content structure inferences.  By applying NLP, this will take us a step closer to programmatic input, processing and storage of unstructured data.  We have traditionally thought in terms of structured data, programmed for structured data and stored structured data.  The challenge posed by the web today is an opportunity to break new ground for software engineers and start thinking, programming and storing unstructured data.


February 29, 2008 at 8:56 am 2 comments

A case for standardizing blog templates

Alex Isikold of AdaptiveBlue has published a great post on “How YOU can make the web more structured”.  A section of this post, “Standardizing Blog Templates Across Platforms”, really resonates with me.  Isikold is suggesting that blogging platforms such as WordPress and TypePad standardize their templates.  Why is this important? 

To help answer this question, here is the Web 2.0 school of thought that I subscribe to:  Let’s start off with an enterprise database analogy. The basic assumption is that blogs are nothing but a data store.  While information in a blog makes for an interesting read, it is about as interesting as reading data in a text column in a relational database.  While the data in a single text column may have a lot of meaning, its meaning and usefulnes is enhanced when the data is combined with other columns in the same table in database, or with other tables in the same database, or even with data in other databases. The wealth of data is hidden in its interconnections with other data. In order to harvest the wealth of data in databases, applications are built on top of the databases that reference and make relational semantic inferences between the data in the database(s).  Today, blogs are the database(s). What is lacking are the applications that harvest the wealth of information stored in the blogs.  These are the applications that the next wave of Web 2.0 companies (including myself) are working on. 

The pace of these next generation applications is being hindered by the lack of a consistent structure (standard) in blog data. What Isikold is bringing attention to is that unlike relational databases, which adhere to relational database management system standard (characterized by a simple TABLE/COLUMN/ROW+SQL structure that has been consistent over the years), blogs have no such standard. The structure of blogs is currently left up to the blogging platforms such a WordPress, Typepad etc. Blogging standards today are akin to having Oracle, SQL Server, MySQL each using a different standard for storing and retrieving information. Not only a different a standard for each of the databases, but a different standard for each version of each database.  Exacerbating the problem further, each of the different databases being customizable by anyone and anyone can change the standard to a standard of their liking. If these databases were is such a state, it would be very difficult to write any applications that leverage data from these databases. ODBC and JDBC standards would be very unreliable, if not useless.  Such is the state of the blogosphere today when one looks at it from a data interface perspective.  

As many of you know, I am currently devoted to work on the layer of applications that leverages the data in blogs and beyond in order make such data more useful to users.  The lack of standardization (as described above) makes it difficult to identify the content in blogs.  Content identification is important because an application needs to be able to identify the difference between actual blog post text and some other text on the blog so that analyses and inferences can be established appropriately.  I have been monitoring the different types of templates in an attempt to predict template patterns for the different blogging platforms (mainly WordPress, TypePad, Blogger, MovableType).  I came to the conclusion that pattern prediction is only successful to a certain point due to the following

1) the original templates from the blogging platform vendor consists of multiple major and minor versions that do not have a predictable consistency in the template content tagging and

2) there are modified/hand coded templates floating out there which are totally unreliable.

As a result of these observations, I have resorted to writing my own content identification algorithms that include a combination of template pattern predictor algorithms and NLP based semantic blog post text identification algorithms.  While this has served me well up to now, a blog template standard will be very beneficial not only to myself but many people who have not figured out how get past the problem.  

Isikold is suggesting that a standard be adopted with the goal of giving blog templates a consistent structure.  This means the adoption of a template standard that identifies the different types of data on the different parts of bogs post. Isikold is suggesting that on a blog post, the template should make it easy to identify the blog post text, the side bar, the name of the author, the data that blog post was published, the tags for the blog post content and the blog posts comments.  I believe an adoption of this simple template will go a long way in helping to bring the next wave of Web 2.0 applications to market faster.  I support a blog template standard.

February 4, 2008 at 9:06 pm Leave a comment

Enterprise 2.0 in an “anti social” enterprise world

There has been a lot of talk regarding Enterprise 2.0 a.k.a Enterprise Social Software recently but there seems to be a dearth of vision for Enterprise 2.0. As a person who spent many years engineering software for the enterprise, here is my two cents: 

Today, the words “enterprise” and “social”,  convey two contradictory notions. The enterprise today is characterized by its emphasis on the productivity of the individual employee. For the 8+ hours that an employee is at work, they are supposed to be 100% productive (even though we all know that this rarely happens).  “Social” is  a word that does not really exist in the vernacular of the productivity oriented enterprise especially as it relates to software.  Everything in the enterprise is geared towards productivity, thus every enterprise software vendor attempts  to tag their software with the phrase “productivity tool”. With all this obsession with productivity, the enterprise is very “anti social”. Thus, the perception of social computing in the enterprise is not really the same as that of the people outside the enterprise.   While those outside the enterprise harness the variety of benefits of social software for variety of business needs on a daily basis, to some in the enterprise, social software still carries the stigma of being a non productive, time wasting web based consumer applications that you use at home (not at work).  Unlike those who believe that the terms “enterprise” and “social” are contradictory, I believe otherwise.  I think that there is a lot of social software that can be very beneficial to the enterprise but the enterprise will not fully embrace it until three things happen: 

  1. Social software companies need to leverage concepts that being applied by web-based consumer applications rather than try to implement these application the enterprise as they are.  Trying to implement a Facebook in an enterprise is not necessarily the right approach, however, applying the concept of a “social graph” for a Sales Department or “implicit web” concepts for lead qualification and cross selling will have a better shot of being successful in the enterprise.

  2. Elimination of the word “social”. While this may sound silly, I think I may be onto to something here. Words like “productivity” and “collaboration” mean something in the enterprise.  Take for example, concepts can be very useful in an enterprise intranet; however, calling that concept  “social classification” will not carry as much weight in the enterprise as “collaborative classification” So, instead of “social networking” maybe start using “productive networking” or “collaborative networking”. No more “social graph”, it’s now a “collaborative graph” J A good example is IM/Chat which was renamed to “Real Time Communications Suite” by some enterprise vendors (well, you guessed it, the word “Chat” is too social); it is quickly becoming a staple within the enterprise.

  3. The enterprise needs to make a mind shift from its current notion of “productivity tools”.  The enterprise is beginning to absorb a generation of employees who are proficient with “social” tools. Why not leverage the social tools to make them even more “productive”?

The perception of the gap between “enterprise” and “social” exists only at a semantic level. The convergence of the enterprise space and the social space is inevitable; however for some of the more popular applications, it’s not a matter of simply transplanting the application as-is but rather, transplanting the concept. 

I believe Enterprise Social Software/Enterprise 2.0 is here to stay. Recall several years ago many companies resisted employee access to the web in the enterprise because it would affect “productivity”.  Looks like a similar battle brewing here.

November 2, 2007 at 8:40 am Leave a comment


July 2018
« Sep    

Posts by Month

Posts by Category