NLP: Unstructured thinking for unstructured data
In my last blog post, I talked about how we have had to develop Natural Language Processing (NLP) algorithms in order to overcome the lack of standardization on the web. At Filtrbox, the more we dig deeper into the web, exploring its inner depths for information, the more I find that we are having to use a NLP concept here or a half NLP concept there to facilitate the process of mining unstructured data. The application of NLP concepts is increasingly figuring into the majority of our algorithms. I have begun to notice that my thought process as software architect, designer and developer is tending to exhibit influences of NLP and machine learning concepts much more than before.
I think NLP fundamentals are essential for those who wish to undertake the challenge of building the next generation of web applications that process the unstructured data on the web. Yes, there are efforts to build a structured web via initiatives such as the semantic web and the various APIs being proposed. I respect these efforts; however, I would not solely rely on these initiatives alone. The proposed APIs provide access to structured data stored on various islands on the web. For those users who do not have their data on those islands, their data is not accessible via the API. The Semantic Web is the initiative that will bring us closest to structured data on the web. However, as we are witnessing its painfully slow adoption, it looks like its going to be a while before we have some structure on the web. The challenge is what do we do now while we wait for these initiatives to mature. I think what we do today is, instead of waiting for content publishers to structure their content, we process content publishers’ content as is and we programmatically infer the structure of the content. The application of NLP concepts are one way we can make the content structure inferences. By applying NLP, this will take us a step closer to programmatic input, processing and storage of unstructured data. We have traditionally thought in terms of structured data, programmed for structured data and stored structured data. The challenge posed by the web today is an opportunity to break new ground for software engineers and start thinking, programming and storing unstructured data.
2 comments February 29, 2008
A case for standardizing blog templates
Alex Isikold of AdaptiveBlue has published a great post on “How YOU can make the web more structured”. A section of this post, “Standardizing Blog Templates Across Platforms”, really resonates with me. Isikold is suggesting that blogging platforms such as WordPress and TypePad standardize their templates. Why is this important?
To help answer this question, here is the Web 2.0 school of thought that I subscribe to: Let’s start off with an enterprise database analogy. The basic assumption is that blogs are nothing but a data store. While information in a blog makes for an interesting read, it is about as interesting as reading data in a text column in a relational database. While the data in a single text column may have a lot of meaning, its meaning and usefulnes is enhanced when the data is combined with other columns in the same table in database, or with other tables in the same database, or even with data in other databases. The wealth of data is hidden in its interconnections with other data. In order to harvest the wealth of data in databases, applications are built on top of the databases that reference and make relational semantic inferences between the data in the database(s). Today, blogs are the database(s). What is lacking are the applications that harvest the wealth of information stored in the blogs. These are the applications that the next wave of Web 2.0 companies (including myself) are working on.
The pace of these next generation applications is being hindered by the lack of a consistent structure (standard) in blog data. What Isikold is bringing attention to is that unlike relational databases, which adhere to relational database management system standard (characterized by a simple TABLE/COLUMN/ROW+SQL structure that has been consistent over the years), blogs have no such standard. The structure of blogs is currently left up to the blogging platforms such a WordPress, Typepad etc. Blogging standards today are akin to having Oracle, SQL Server, MySQL each using a different standard for storing and retrieving information. Not only a different a standard for each of the databases, but a different standard for each version of each database. Exacerbating the problem further, each of the different databases being customizable by anyone and anyone can change the standard to a standard of their liking. If these databases were is such a state, it would be very difficult to write any applications that leverage data from these databases. ODBC and JDBC standards would be very unreliable, if not useless. Such is the state of the blogosphere today when one looks at it from a data interface perspective.
As many of you know, I am currently devoted to work on the layer of applications that leverages the data in blogs and beyond in order make such data more useful to users. The lack of standardization (as described above) makes it difficult to identify the content in blogs. Content identification is important because an application needs to be able to identify the difference between actual blog post text and some other text on the blog so that analyses and inferences can be established appropriately. I have been monitoring the different types of templates in an attempt to predict template patterns for the different blogging platforms (mainly WordPress, TypePad, Blogger, MovableType). I came to the conclusion that pattern prediction is only successful to a certain point due to the following
1) the original templates from the blogging platform vendor consists of multiple major and minor versions that do not have a predictable consistency in the template content tagging and
2) there are modified/hand coded templates floating out there which are totally unreliable.
As a result of these observations, I have resorted to writing my own content identification algorithms that include a combination of template pattern predictor algorithms and NLP based semantic blog post text identification algorithms. While this has served me well up to now, a blog template standard will be very beneficial not only to myself but many people who have not figured out how get past the problem.
Isikold is suggesting that a standard be adopted with the goal of giving blog templates a consistent structure. This means the adoption of a template standard that identifies the different types of data on the different parts of bogs post. Isikold is suggesting that on a blog post, the template should make it easy to identify the blog post text, the side bar, the name of the author, the data that blog post was published, the tags for the blog post content and the blog posts comments. I believe an adoption of this simple template will go a long way in helping to bring the next wave of Web 2.0 applications to market faster. I support a blog template standard.
Add comment February 4, 2008
Correct RSS date format
If you see a date like “01/02/07” in an RSS feed, what do you do? You write a blog post about it.
The applications that I am working on are reliant on some calculations using RSS dates. I have noticed that the RSS date specification is probably the most taken for granted part of the RSS spec. It is taken for granted because many consumers of RSS program around the date inconsistencies so there is not much of an outcry. However, when you see a date like 01/02/07, you have to stop and say something.
To those developers generating RSS feeds, please take a look at the RSS date format specifications as per the RSS specification. I will summarize it here:
The RSS date must conform to the RFC-822 (refer to the BNF for “date-time” in section 5) date time format. Examples of this format are:
Wed, 04 Feb 2008 08:00:00 EST
Wed, 04 Feb 2008 13:00:00 GMT
Wed, 04 Feb 2008 15:00:00 +0200
Do not just execute a stringifying method on your date object before writing it to the RSS feed. Set the date format to the above mentioned format first before writing it to the RSS feed.
To validate whether your date is correct, you can use http://feedvalidator.org
2 comments February 4, 2008
A LEGENDary tribute
This afternoon my wife and I visited the recently opened Denver Museum of Contemporary Art. All the exhibits are great, however, there is one exhibit that consumed the majority of our time (and of other museum goers as well). No, it was not some complex hard to figure abstract art. It was the simple ”Legend (a portrait of Bob Marley), 2005″ by Candice Breitz.
Here is what Candice Breitz put together: In March 2005, 30 different people were filmed at the Gee Jam Studio in Port Antonio, Jamaica singing a capella (no instrumental accompaniment) of a compilation of Bob Marley songs. All 30 shots are then played simultaneously on a 30 channel installation viewed through 30 different flat-screen TVs (one person per TV screen). The coolest thing about this, is that even though it looks like one giant movie screen from afar, you get a spatial effect of the sound (the sound comes directly from location of the person on the screen) . That is simply because there are 30 different TVs with speakers right next to the each screen, so the audio comes directly from the location of the TV screen. I definitely sat there for more than 30 minutes (I could have watched all 62 minutes and 40 seconds of it) because, first I am a huge Marley fan and second it was fascinating watching these 30 individuals sing these legendary songs. They were not perfect singers, did not necessarily hold a tune and did not necessarily know the words to the songs. However, I was captured by expressions; their facial expressions and their body expressions both when they knew the words and when they were clueless. I loved the simplicity of the whole concept.
This is a great exhibit to check out when you are in Denver, especially if you are a Marley fan. Be warned that this exhibit is pretty loud (which I think may annoy some people). The voices of these 30 individuals echo through the whole museum. If you are a Marley fan, it’s a great sound track while you check out the cool exhibits that they have at the Denver Museum of Contemporary Art.
Add comment January 27, 2008
Filtrbox is hiring
*Solid web application development skills
*Experience with Natural Language Processing concepts (a plus)
*Actionscript 2 or 3 (a plus)
*System administration skills, Linux, Apache, Tomcat, MySQL (a plus)
*Must be energetic, motivated and creative
Add comment January 25, 2008
Advice for TechStars applicants
TechStars has announced that it is now accepting applications for 2008. I was part of the inaugural TechStars 2007 last year and here is my advice for TechStars applicants:
Apply
My first advice to aspiring entrepreneurs is simply, apply to TechStars. It’s a great opportunity if you are looking for help with your start-up idea. Applying to TechStars was one of the best things that we ever did at Filtrbox.
Team
Your idea is important but the team is even more important. Putting together a team that can effectively execute on the idea is of paramount importance. The reason for the team being more important than the idea is that during the course of the summer, your idea may change a number of times, so it is important that you have a solid team that can deal with the changes and effectively execute on the changes. While it is important to put together a team with complementary skills, in my opinion, I think that it is far much more important to have a team that is execution oriented and that has great chemistry. Keep in mind that TechStars is making a investment in you in the form of $15,000.00, they are not going to make that investment in a bunch of buddies with a great idea but who cannot deliver.
Prototype
Have a body of work to show to TechStars to complement your application. I am making the assumption that your odds of being accepted to TechStars are far much better if you have a working prototype or a full blown working application altogether because I believe all the teams that were chosen to be part of TechStars last year had some body of work to show. For some of the teams, the body of work was a prototype manifestation of their idea and for some, it was work that they had done in the past. If you do not have a prototype, I would suggest, you get busy and start working on one right now.
No part timers
TechStars is a full time gig. If you currently have a job, be ready to quit your job. I did. TechStars is a grueling summer, it will be dificult to do it part time. The only thing that you can afford to do part time during TechStars is sleep; be prepared for only a couple of hours of sleep a night and all nighters especially as you get closer to Investor Day.
On a serious note, I think that it is important for you to realize this now so that you can prepare your loved ones for the fact that you may have to leave your current job or school if you get accepted into Techstars. It’s not easy leaving your high paying job or your degree program, both for you and for those around you, so you might as well let those around you know today that in the event that you get accepted into TechStars, you may have to quit. Most importantly, I would suggest that you plan your finances now.
Some of you might be saying to yourselves “Why do all that? What if I don’t get accepted”? If you are asking yourself this question, in my opinion, you are not a TechStar. Pessimism is definitely NOT a characteristic that they look for at TechStars.
Relocation
Be prepared to relocate to beautiful Boulder , Colorado or surrounding areas. Like I said in 3., TechStars is a full time job, it is important that you are willing to relocate to Boulder. My primary reason for this suggestion is that it makes for a much more cohesive team. Once you get accepted, the team is no longer limited to the list of people that you submitted in your application. The “team” increases when you get into TechStars because it includes all your advisors and all the people who are not necessarily your advisors but are simply rooting for you to succeed; your cheerleaders. You need to be able to schedule coffee, lunch, dinner or whatever with your “team” and interact with them. The great thing about Boulder is that the tech community is pretty much concentrated around Pearl Street so you have easy access to everyone.
Last year, we had teams that had team members relocating from as far as Sweden. For those of you who are used to working in geographically dispersed areas, this is an opportunity to meet for the first time (the team with a team member from Sweden literally had their first face to face in-person meeting at TechStars). I cannot understate the valuable team building benefits of a team working in one place. A better team is more likely to build a better product.
On a more important note, you need to realize this so that you can let your loved ones know that relocation to Boulder may be the one of the consequences of being accepted to TechStars. You also need to start preparing yourself financially for the relocation.
Advisors
If you are not a person who takes advice well, I would suggest that you do not apply. While TechStars does not force you to take the advice of the array of advisors that they have lined up, attending TechStars and being unwilling to take advice defeats the whole purpose. At TechStars, they like to say that it’s a “hatchery” where individuals with great ideas can get advice to help them turn the ideas into meaningful products and the individuals into viable companies. So TechStars is all about being able to listen to other people’s opinions and taking some advice from seasoned entrepreneurs.
I have written a blog post in the past on Seth Levine, our awesome advisor at TechStars. Advisors do not come in any better quality than people like Seth, so be prepared to take advantage of them.
Presentation skills
Practice your presentation skills. This might not seem to be all that important but if you think about it, as an entrepreneur you are going to HAVE to make pitches, a lot of pitches, so you might as well hone your skills now. With respect to TechStars, at one point you are going to have to pitch to TechStars before you get accepted, so improving you presentation skills can only help your case. If you have a great idea, spend the time to make sure that you can communicate the idea effectively in order to make sure that everyone else realizes how great your idea is.
Note that not everyone on your team has to be a great presenter, however, you need at least one person who can effectively communicate your idea.
Idea mutation
Don’t be married to your idea. Be prepared for the fact that your idea MAY change significantly during the course of the summer. Not all ideas will change, ours did not change significantly. However, some did change significantly and for the better. For those who are married to their idea, keep in mind that the opportunities in the tech industry are very fluid and as an entrepreneur sometimes you have to take advantage of opportunities as they present themselves. Take for example, during TechStars 2007, the Facebook Platform was released which opened up a lot of unanticipated opportunities. The Techstars team that abandoned their original idea in order to take advantage of the new Facebook Platform was ringing in some revenue by the end of TechStars and are now the proud owners of one of the premier Facebook applications.
Burritos
Be prepared to eat lot of burritos. Lots and lots of burritos.
I hope that these tips help aspiring applicants. I am in no way part of the TechStars organization so please take the above as my opinion, and, my opinion only. Good luck and I hope to see you in Boulder this summer.
9 comments January 22, 2008
That software may be around for a very long time….write it well.
During the holidays I was surfing the web and discovered forums dedicated to software that I wrote almost a decade ago. It felt really good discovering that there are hordes of consultants out there being certified on architecture, designs and API that I conceived and developed (There is nothing like discovering that people’s passing of a certification hinges upon them knowing the meaning of a phrase or term that you coined).
Feeling proud of myself and maybe even a little boastful, I decided to anonymously answer a question in one of the free forums since I would “obviously” be the final authority on such matters. As soon as I posted the “obviously correct” answer to the question, there was a response from one veteran consultant who indicated that I did not know what I was talking about, I had it all wrong and he proceeded to teach me the correct usage of the part of the software under discussion. WHOA!!! Wait a minute!!! But, I created the software!!! You can’t tell me the “correct usage” of my own API. It turns out that after so many years of consulting on the software, many consultants have come up with very creative workarounds and ingenious uses of the software. I tip my hat to them because they are now doing things with the software that I did not even imagine at the time that I designed and developed the software. I was both proud and humbled after reading the response from the consultant.
This experience reminded me of the importance of architecting, designing and developing enduring software because you never know how long your code will be out there making a difference in people’s lives.
Add comment January 13, 2008
For a startup company, “every day has meaning”
A couple of days ago I had coffee with Tim Wolters, one of Bouder’s thought leaders. We discussed several topics including the startup life. During our conversation he used a phrase to sum up life in the startup lane, “every day has meaning”. For those of you who have been asking me about my take on life in the startup lane, I’d say that is the most accurate phrase to sum it all up.
“Every day has meaning”, what does it really mean?
Well, here is a reality check for you: An early stage startup is characterized by the abundance of scarcity. Everywhere you look around you there is a scarcity of one resource or another: just enough capital to last till the next funding round, paychecks that flirt with the poverty line, not enough engineers to write code, not enough servers, not enough hours in the day….not enough of almost everything. To maximize your scarce resources you make sure that you use your resources wisely, especially your time. Time is allocated such that you get the most out of the time you spend on each task with what ever little you have. Unlike in a large company, the consequences of not putting your time to good use in a startup are amplified because, in a startup company, YOU are the company. YOU are the impact. A startup’s survival depends on how much YOU put into it from day to day. Every day is a difference between fulfilling your startup ambition and life in a cubicle. Every day has meaning.
2 comments November 15, 2007
Thoughts on Enterprise 2.0
I was at the Defrag Conference in Denver today. There was a lot of talk around Enterprise 2.0 and Web 2.0. Here is a comment that I made last week on Brad Feld’s great post (More Thoughts on Consumer Internet Innovations Migrating to the Enterprise) which discussed a lot of the Enterprise 2.0/Web 2.0 issues that were discussed at Defrag today:
The successful migration of Web 2.0 to the enterprise hinges on the successful adoption of the Web 2.0 concepts into the enterprise. I will go into some specific application examples here to help move this discussion forward:
Supply Chain Management: Most enterprise software has an event notification system of some sort. Most of the events are in proprietary formats, thus are only consumable from a specific vendor’s software. RSS can move the enterprise towards client agnostic event notifications. For example, if database triggers on an inventory control table, generate RSS, an inventory manager can be notified on their mobile e-mail client or RSS reader of choice whenever the stock unit re-order level is reached. RSS provides more choices for the consumption of event notifications beyond the e-mail and app notifications that are generally used today.
Sales and Marketing: The sales and marketing organization within the enterprise is the best example of a candidate for an application that implements social networking concepts because the interaction patterns between the members of this organization mirror the social networking that occurs outside the enterprise. At a previous job, I would often venture to the sales and marketing department’s section of the company intranet; I noticed that they used any tool they could get their hands on in order to communicate. This included bulletin boards (to announce deals), threaded discussions (to discuss strategy for prospects) and good old e-mail blasts (to find out if anyone know anything about a prospect). The current hodge-podge of tools used by sales and marketing organizations can benefit from applications that aggregate social networking concepts around information sharing.
Human Resources: An enterprise’s human resources organization can benefit from the concept of a “social graph”. Social graph concept enhance the traditional org chart. For example, an application that implements an HR social graph that describes relationships between employees such as ” employed A worked with employee B under the management of employee C on project X in year ZZZZ” is very valuable to a member of the organization that is looking to put together a team for a new project. By examining an employee’s historical social graph, the organization can better assess an employee’s experience and it also helps the employee in career planning.
Customer Relationship Management: The implementation of Web 2.0 in CRM is pretty obvious and has been well articulated. CRM is an example of case where complete Web 2.0 applications can be transplanted into the enterprise e.g. a LinkeIn-like application provides the same benefits as part of a CRM suite as-is.
Here are some of my thoughts around enterprise technology:
Widgets: While widgets have really taken off in the Web 2.0 world, the major enterprise vendors have been traditionally strong in this space through the concept of portlets. Ironically this is concept that came about from the Web 1.0 days. However, there is still some opportunity around the delivery and integration.
AJAX: When enterprise software vendors moved from the desktop app model to the web app model, their customers clamored for the same rich user experience that they had on the desktop. The experience on the desktop was enhanced by a better UI event model. As a result, for the past several years most UI teams at major enterprise software vendors have been dedicated to re-capturing the desktop experience. They primarily achieved this by hacking some AJAX-like functionality well before the AJAX that we know of today. In my assessment, the user wants the desktop experience, so it is highly unlikely that vendors will invest their UI teams’ development cycles in AJAX, instead they are more likely to spend their cycles porting their code to RIA frameworks instead.
Data: There is a wealth of data that is very useful to the enterprise, especially to marketing organizations, that resides outside the enterprise (on the web). However, enterprise software vendors mostly build applications that access and manipulate data that has been gathered and structured in their databases (and most preferably using their apps).This traditional reluctance of the enterprise software vendors to go outside the firewall provides an opportunity for Web 2.0 applications that gather, structure (and store) data resident outside the enterprise for use within the enterprise.
Semantic Web: In this case, I prefer to call it the Semantic Enterprise. Enterprise applications from the major vendors come with a heavy dose of semantic information both for the application itself and the data that it generates in the form of meta-data. Content Management systems from enterprise vendors usually provide a lot of meta-data (not necessarily RDF but XML nonetheless) to describe the content. This meta-data is a great starting point for Web 2.0 applications that implement semantic web concepts.
Meta data: Enterprise software systems are heavily meta-data driven (for reasons that I will not go into here). This means user interfaces, application interfaces, data sources and data are all described using meta-data. The implementation partners of the vendors have access to metadata generators and sometimes the meta-data spec itself. If one wants to develop Web 2.0 applications for the enterprise, approach the vendor for the meta-data generators or the spec itself and you should be ready to go. Most vendors are working on SOA frameworks, so you should have no problem integrating your application.
In short, the Enterprise 2.0 approach that I am advocating here is first understanding how the current enterprise systems work before judiciously determining where to apply concepts vs. products. Obviously other people have other approaches and it would be great if they can chime in.
Lastly, let me take a stab at the Facebook-type application question from your reader. I think a Facebook-type of application can be adopted for an enterprise – with some restrictions. One has to realize that “social” networks in an enterprise are not organic. The nature of the enterprise does not lend itself to organic networks similar to those that form outside the enterprise. In an enterprise you may not be able to choose your “friends”. Your “friends” were chosen for you when you accepted that job offer. So an implementation of a Facebook-type of application may have to use a different granularity for its “users”. For example, it may have to be a group oriented Facebook –type application rather than a user oriented Facebook-type application.
1 comment November 6, 2007
Enterprise 2.0 in an “anti social” enterprise world
There has been a lot of talk regarding Enterprise 2.0 a.k.a Enterprise Social Software recently but there seems to be a dearth of vision for Enterprise 2.0. As a person who spent many years engineering software for the enterprise, here is my two cents:
Today, the words “enterprise” and “social”, convey two contradictory notions. The enterprise today is characterized by its emphasis on the productivity of the individual employee. For the 8+ hours that an employee is at work, they are supposed to be 100% productive (even though we all know that this rarely happens). “Social” is a word that does not really exist in the vernacular of the productivity oriented enterprise especially as it relates to software. Everything in the enterprise is geared towards productivity, thus every enterprise software vendor attempts to tag their software with the phrase “productivity tool”. With all this obsession with productivity, the enterprise is very “anti social”. Thus, the perception of social computing in the enterprise is not really the same as that of the people outside the enterprise. While those outside the enterprise harness the variety of benefits of social software for variety of business needs on a daily basis, to some in the enterprise, social software still carries the stigma of being a non productive, time wasting web based consumer applications that you use at home (not at work). Unlike those who believe that the terms “enterprise” and “social” are contradictory, I believe otherwise. I think that there is a lot of social software that can be very beneficial to the enterprise but the enterprise will not fully embrace it until three things happen:
-
Social software companies need to leverage concepts that being applied by web-based consumer applications rather than try to implement these application the enterprise as they are. Trying to implement a Facebook in an enterprise is not necessarily the right approach, however, applying the concept of a “social graph” for a Sales Department or “implicit web” concepts for lead qualification and cross selling will have a better shot of being successful in the enterprise.
-
Elimination of the word “social”. While this may sound silly, I think I may be onto to something here. Words like “productivity” and “collaboration” mean something in the enterprise. Take for example, del.icio.us concepts can be very useful in an enterprise intranet; however, calling that concept “social classification” will not carry as much weight in the enterprise as “collaborative classification” So, instead of “social networking” maybe start using “productive networking” or “collaborative networking”. No more “social graph”, it’s now a “collaborative graph” J A good example is IM/Chat which was renamed to “Real Time Communications Suite” by some enterprise vendors (well, you guessed it, the word ”Chat” is too social); it is quickly becoming a staple within the enterprise.
-
The enterprise needs to make a mind shift from its current notion of “productivity tools”. The enterprise is beginning to absorb a generation of employees who are proficient with “social” tools. Why not leverage the social tools to make them even more “productive”?
The perception of the gap between “enterprise” and “social” exists only at a semantic level. The convergence of the enterprise space and the social space is inevitable; however for some of the more popular applications, it’s not a matter of simply transplanting the application as-is but rather, transplanting the concept.
I believe Enterprise Social Software/Enterprise 2.0 is here to stay. Recall several years ago many companies resisted employee access to the web in the enterprise because it would affect “productivity”. Looks like a similar battle brewing here.
Add comment November 2, 2007