Posts filed under 'Software Engineering'

2008 Web Search is still in 1979

On Thursday (04/24/2008 ) last week, I had the privilege of talking to Dr. Jim Martin’s Natural Language Processing (NLP) graduate class, at the University of Colorado at Boulder, about the work that we are  doing at Filtrbox and the role that current NLP students will play in the future of information technology.  This blog post is the basis of my message to the class.

As I have written before, the problem that we face today is how to harness the data that is available on the web so that we can apply meaningful interpretation to it using applications.  This problem is rooted in the assumption that the data that is stored on the web is “unstructured”.  Unlike the majority of the data processed by applications today which is stored in some form of a structure e.g. a relational database, the data on the web is not so, as its is perceived as discrete pieces of data scattered all over the web.

I told the class that part of what I am doing at Filtrbox is an attempt to prove that the data on the web is not as “unstructured” as we may think today.  Within that data, there is a lot of structure, relationship and general interconnectedness no matter how “discrete” we may think it is.  With effective mining of the data and good applications, we can apply interpretation to the data and produce meaningful information.  However, we are still far from applications that can apply effective interpretive meaning on this data.  The reason for this is that we have to address the problem of information retrieval (IR) first before we can get to the writing of applications. 

To recognize where we are today on the continuum of web data information retreival and applications; a look at the evolution of enterprise applications gives us a great analogy:

Enterprise applications are where they are today primarily because they have a structured data storage model (Relational Database or RDB) and a standard access model (Structured Query Language or SQL).  Before there were enterprise applications that we know today, there were only RDBs and SQL.  While RDB work dates back to the 1960s, the RDBs that the majority is familiar with today had their beginnings in the 1970s.  The first (or widely believed to be) commercially available implementation of RDB+SQL was Oracle, then known as Relational Software, in 1979. This provided the ability to query an RDB for data using SQL but no applications as we know them today.  Analogizing this with the web, this is where we are today. We can go on Google or our favorite RSS readers (RDB analogy) and query for web data using a weak REST API or search form (SQL analogy) but we have no applications comparative to what is in enterprise today to interpret that data.  So simply put, today we are where enterprise applications were in 1979.

My message to the class was that applications like Filtrbox are starting to barely scratch the surface with respect to the implementing of applications on top of web data.  That is because, although its 2008, we are still in 1979.  The stumbling block is the perception of the “unstructured” nature of web data. Today’s NLP students will play a large role tomorrow in identifying and establishing structure in the “unstructured” web data in order to move us beyond 1979.

1 comment April 28, 2008

Filtrbox Pizzabox Bug-Bash

Last night, Filtrbox commandeered “The Bunker” at Techstars (thanks to David Cohen) for the first Filtrbox Pizzabox Bug-Bash.  We invited Boulder locals to come in and help us test Filtrbox as well as provide us feedback on the product thus far. The event was a success and I would like to thank all those who were in attendance.  The feedback that we received from testers was great.   Look out for more information about this event on the Filtrbox blog. We had an awesome evening of fun, pizza and beer; here are some pictures from last night:

 Filtrbox Pizzabox Bug-Bash (click to enlarge)  Filtrbox Pizzabox Bug-Bash (click to enlarge) Filtrbox Pizzabox Bug-Bash (click to enlarge) Filtrbox Pizzabox Bug-Bash (click to enlarge)

1 comment March 27, 2008

NLP: Unstructured thinking for unstructured data

In my last blog post, I talked about how we have had to develop Natural Language Processing (NLP) algorithms in order to overcome the lack of standardization on the web.  At Filtrbox, the more we dig deeper into the web, exploring its inner depths for information, the more I find that we are having to use a NLP concept here or a half NLP concept there to facilitate the process of mining unstructured data. The application of NLP concepts is increasingly figuring into the majority of our algorithms.  I have begun to notice that my thought process as software architect, designer and developer is tending to exhibit influences of NLP and machine learning concepts much more than before. 

I think NLP fundamentals are essential for those who wish to undertake the challenge of building the next generation of web applications that process the unstructured data on the web.  Yes, there are efforts to build a structured web via initiatives such as the semantic web and the various APIs being proposed. I respect these efforts; however, I would not solely rely on these initiatives alone.  The proposed APIs provide access to structured data stored on various islands on the web.  For those users who do not have their data on those islands, their data is not accessible via the API.  The Semantic Web is the initiative that will bring us closest to structured data on the web.  However, as we are witnessing its painfully slow adoption, it looks like its going to be a while before we have some structure on the web. The challenge is what do we do now while we wait for these initiatives to mature. I think what we do today is, instead of waiting for content publishers to structure their content, we process content publishers’ content as is and we programmatically infer the structure of the content.  The application of NLP concepts are one way we can make the content structure inferences.  By applying NLP, this will take us a step closer to programmatic input, processing and storage of unstructured data.  We have traditionally thought in terms of structured data, programmed for structured data and stored structured data.  The challenge posed by the web today is an opportunity to break new ground for software engineers and start thinking, programming and storing unstructured data.

2 comments February 29, 2008

A case for standardizing blog templates

Alex Isikold of AdaptiveBlue has published a great post on “How YOU can make the web more structured”.  A section of this post, “Standardizing Blog Templates Across Platforms”, really resonates with me.  Isikold is suggesting that blogging platforms such as WordPress and TypePad standardize their templates.  Why is this important? 

To help answer this question, here is the Web 2.0 school of thought that I subscribe to:  Let’s start off with an enterprise database analogy. The basic assumption is that blogs are nothing but a data store.  While information in a blog makes for an interesting read, it is about as interesting as reading data in a text column in a relational database.  While the data in a single text column may have a lot of meaning, its meaning and usefulnes is enhanced when the data is combined with other columns in the same table in database, or with other tables in the same database, or even with data in other databases. The wealth of data is hidden in its interconnections with other data. In order to harvest the wealth of data in databases, applications are built on top of the databases that reference and make relational semantic inferences between the data in the database(s).  Today, blogs are the database(s). What is lacking are the applications that harvest the wealth of information stored in the blogs.  These are the applications that the next wave of Web 2.0 companies (including myself) are working on. 

The pace of these next generation applications is being hindered by the lack of a consistent structure (standard) in blog data. What Isikold is bringing attention to is that unlike relational databases, which adhere to relational database management system standard (characterized by a simple TABLE/COLUMN/ROW+SQL structure that has been consistent over the years), blogs have no such standard. The structure of blogs is currently left up to the blogging platforms such a WordPress, Typepad etc. Blogging standards today are akin to having Oracle, SQL Server, MySQL each using a different standard for storing and retrieving information. Not only a different a standard for each of the databases, but a different standard for each version of each database.  Exacerbating the problem further, each of the different databases being customizable by anyone and anyone can change the standard to a standard of their liking. If these databases were is such a state, it would be very difficult to write any applications that leverage data from these databases. ODBC and JDBC standards would be very unreliable, if not useless.  Such is the state of the blogosphere today when one looks at it from a data interface perspective.  

As many of you know, I am currently devoted to work on the layer of applications that leverages the data in blogs and beyond in order make such data more useful to users.  The lack of standardization (as described above) makes it difficult to identify the content in blogs.  Content identification is important because an application needs to be able to identify the difference between actual blog post text and some other text on the blog so that analyses and inferences can be established appropriately.  I have been monitoring the different types of templates in an attempt to predict template patterns for the different blogging platforms (mainly WordPress, TypePad, Blogger, MovableType).  I came to the conclusion that pattern prediction is only successful to a certain point due to the following

1) the original templates from the blogging platform vendor consists of multiple major and minor versions that do not have a predictable consistency in the template content tagging and

2) there are modified/hand coded templates floating out there which are totally unreliable.

As a result of these observations, I have resorted to writing my own content identification algorithms that include a combination of template pattern predictor algorithms and NLP based semantic blog post text identification algorithms.  While this has served me well up to now, a blog template standard will be very beneficial not only to myself but many people who have not figured out how get past the problem.  

Isikold is suggesting that a standard be adopted with the goal of giving blog templates a consistent structure.  This means the adoption of a template standard that identifies the different types of data on the different parts of bogs post. Isikold is suggesting that on a blog post, the template should make it easy to identify the blog post text, the side bar, the name of the author, the data that blog post was published, the tags for the blog post content and the blog posts comments.  I believe an adoption of this simple template will go a long way in helping to bring the next wave of Web 2.0 applications to market faster.  I support a blog template standard.

Add comment February 4, 2008

Correct RSS date format

If you see a date like “01/02/07” in an RSS feed, what do you do?  You write a blog post about it. 

The applications that I am working on are reliant on some calculations using RSS dates.  I have noticed that the RSS date specification is probably the most taken for granted part of the RSS spec.  It is taken for granted because many consumers of RSS program around the date inconsistencies so there is not much of an outcry.  However, when you see a date like 01/02/07, you have to stop and say something. 

To those developers generating RSS feeds, please take a look at the RSS date format specifications as per the RSS specification.  I will summarize it here: 

The RSS date must conform to the RFC-822 (refer to the BNF for “date-time”  in section 5) date time format.  Examples of this format are: 

Wed, 04 Feb 2008 08:00:00 EST

Wed, 04 Feb 2008 13:00:00 GMT

Wed, 04 Feb 2008 15:00:00 +0200 

Do not just execute a stringifying method on your date object before writing it to the RSS feed.  Set the date format to the above mentioned format first before writing it to the RSS feed. 

To validate whether your date is correct, you can use http://feedvalidator.org

2 comments February 4, 2008

Filtrbox is hiring

At Filtrbox, we are on a quest to create software that helps people “know what they don’t know”.  How do we go about doing that, you may ask.  Well, if you want to know how we do that, come and join us because WE ARE HIRING. If you meet the following requirements, you have an opportunity to be part of the best software development team in Boulder, Colorado:

*Solid Java skills
*Solid web application development skills
*Experience with Natural Language Processing concepts (a plus)
*Actionscript 2 or 3 (a plus)
*System administration skills, Linux, Apache, Tomcat, MySQL (a plus)

*Must be energetic, motivated and creative

Send your resume to jobs at filtrbox dotcom

Add comment January 25, 2008

That software may be around for a very long time….write it well.

During the holidays I was surfing the web and discovered forums dedicated to software that I wrote almost a decade ago. It felt really good discovering that there are hordes of consultants out there being certified on architecture, designs and API that I conceived and developed (There is nothing like discovering that people’s passing of a certification hinges upon them knowing the meaning of a phrase or term that you coined).  

Feeling proud of myself and maybe even a little boastful, I decided to anonymously answer a question in one of the free forums since I would “obviously” be the final authority on such matters.  As soon as I posted the “obviously correct” answer to the question, there was a response from one veteran consultant who indicated that I did not know what I was talking about, I had it all wrong and he proceeded to teach me the correct usage of the part of the software under discussion. WHOA!!! Wait a minute!!! But, I created the software!!! You can’t tell me the “correct usage” of my own API. It turns out that after so many years of consulting on the software, many consultants have come up with very creative workarounds and ingenious uses of the software.  I tip my hat to them because they are now doing things with the software that I did not even imagine at the time that I designed and developed the software.  I was both proud and humbled after reading the response from the consultant.  

This experience reminded me of the importance of architecting, designing and developing enduring software because you never know how long your code will be out there making a difference in people’s lives.

Add comment January 13, 2008

“simplicity of solution” is an essential element of quality

Last month, I posted a response to Brad Feld’s question, “Why Do Computers Suck So Much?”, which I think deserves its own posting here because I feel so strongly about it:

Computers do not suck so much, it’s the software engineers. Software engineers suck even more. Being a software engineer myself, I think over the years the quality of engineering in software has been slowly going downhill and it has been no secret. While I know some of the reasons why software is sometimes deliberately designed so that is difficult to use with certain third party products (forced migration strategy anyone??), I would like to take your question as an opportunity rant on my dissatisfaction with the quality of software design and engineering in general. It’s just something that has been irking me for some time.

Software engineers tend to forget that software engineering is a craft. It’s a craft whose beauty is in the detailed attention paid to the relevance of the functionality and the quality of the finished product. Engineers tend to forget this fact and their managers tend to be ignorant of it. Engineering managers have been lacking in identifying great engineers. More often than not, engineering managers do not understand the composition dynamics of a great software engineering team. Most managers mistake the person who can whip up the code the fastest (and rant differences between technologies and languages) to be a great engineer and they stack their teams with a whole lot of those types of engineers. A great engineer is a person, who can not only find smart solutions to complex problems but one who can also simplify the solutions to complex problems. The smartest guy in the room is not necessarily a great engineer but just a smart guy. In my working with some of the engineers and their “program/product managers” from some of the “tier one” software companies, I noticed that a lot of them were simply just smart guys and not necessarily great engineers because they failed to make abstractions of software that were any less than “smart abstractions”. (The average user is better served not by smart abstraction of software but rather by simplified abstractions.) The end result is software that just does not “simply work” as the average user would expect. Instead, users must know that this and that needs to be done before they can complete their tasks because to a geek engineer, his/her tester and their program/product manager, that is “so obvious”. What they fail to understand is that users have an affinity towards simple abstractions as evidenced by the fact that when all else fails, the user falls back on the simplest abstraction of all “turning it off and turning it back on”.

I have not been around long enough to say software is not longer being developed the way it used be, but I have been around long enough to know that software is neither being designed not engineered the way it should be. Software engineers, who view their work as a craft, should view “simplicity of solution” as an essential element of quality of their work.

Add comment June 24, 2007

Getting Real – the forgotten chapter

After listening to an agile software development presentation at Rally Software that included excerpts from the book du jour for web 2.0 developers today, “Getting Real” by 37 Signals, I went back to a thought that I had after I read the book several months ago. While digesting all the information that I had read, I realized that “Getting Real” offers a great template for agile web app developers; however, it also puts a lot of responsibility on the software developer above and beyond what has been traditionally expected of developers. After reading the book , you realize that a develop is no longer a person who possesses code cranking skills only but a person with a conglomeration of skills in order to satisfy the principles in “Getting Real”.

What skill will I be looking for in our next hire person for our web 2.0 which follows the “Getting Real” agile software development model? As a former, software architect, I identified that the closest skills for a person that fits the bill are that of a software architect (in addition to the developer skills, of course). Now, I am not trying to turn the whole “Getting Real” on its head and introduce the role of an architect, I am merely saying that the skill that should be innate (or trained) in person who can successfully thrive in a startup company implementing the “Getting Real” principles is commensurate with the skill that is required of a software architect.

In addition to suggestions in Chapter 8 of the book, I would hire a person who possesses the following skills that are often associated with software architects:

Constant understanding of a system’s organizational structure

Since we have no req specs with “Getting Real”, the developer must constantly keep in mind the overall view of the system as well as its constituent functional components and their relationships. This goes a long way in cranking out code faster because one must constantly understand how changes affect other parts of the system.

Ability to curb unbounded complexity

It takes a certain level of skill to deliver software that provides value, is simple to use and is powered by a non-complex system. Simply saying ’no’ to feature requests does not necessarily equate to a less complex system.

Leadership

The ability to influence and inspire, is a quality that is continually evident in the 37 Signals guys themselves. “Getting Real” principles result in a product that is characterized by an impassioned boldness of the product and the people; and leadership is definitely an essential quality in achieving both.

Effective communication

A developer that successfully follows these principles outlined in the book must be able to communicate on three axis: X-axis - horizontal communication with the other members of the team, Y-axis - communicate with the your startup management, board, advisors, investors and any other developers under him/her , Z-axis – communicate with the product users. “Getting Real” emphasizes the importance of engaging with your users, this will yield the best results if your developers have effective communication skills.

Understand and appreciate business strategy

In order to “hire the right customers”, “have an enemy” or “underdo your competition”, a developer must understand the business strategy.

Political Skills

While traditional software architects must possess the skill for political navigation through an organization while championing the product, I think the political skill required here is a little different. A developer must be able to identify when politics start to affect the product in ways that are not in the best interest of the product and put a stop to it. Even small startup teams such as those suggested by “Getting Real” have a certain level of political dynamics not only within the team itself but also among other entities that interface with the team such as investors, advisors etc.

In addition to the “Staffing” recommendation in the book, I will definitely be using all the above-mentioned to evaluate our next hire because developers just cannot afford to be simple code crankers anymore.

1 comment June 21, 2007


Calendar

November 2009
M T W T F S S
« Sep    
 1
2345678
9101112131415
16171819202122
23242526272829
30  

Posts by Month

Posts by Category