Opinions and discussion on content management and document management by two of the biggest guys in the business. *Measured by combined weight

The Myth of Unstructured Data

Author

Follow Lee

Disclaimer

The opinions shared here represent those of the contributor themselves and not those of their employers nor that of Big Men On Content as a whole.

In a recent post Joe Shepley examined the conflict between structured and unstructured approaches to data management.  This is a bit of a recurring theme for me, primarily because I hate the term unstructured data. There are a number of reasons why but it boils down to the fact that there is no real line between the two. There are practical distinctions but more often than not the terms are used by one group or the other to limit their own scope and throw problems they don’t want to deal with at another box on the org chart.

From a technical perspective – the distinction between the two concepts is the same idea that I examined in my post Mandelbrot and BPM.  In that piece I reference the coastline paradox which explains the fact that the length of a coastline is different depending on how it is measured. The closer you examine it, the longer the coastline becomes.  To paraphrase myself, the degree of structure data has is a function of an undefined measurement representing the distance of the content from an observer/modeler and the unit of measure employed in the design.

No content is unstructured. Some content simply lacks the uniformity of rectilinear (boxlike) structures. Design decisions on how best to manage content should  hopefully be made on pragmatism but too often are based on habit. All content management systems have access to and manage structured data about and in content. It is simply more efficient to treat some data as too complex to manage. Therefore you trust that unit of work to a consuming application or a human mind.

I much prefer the terms structured and complex as I think this pair better represent the technical challenge.

One might argue that a raster image (tiff) of a document, as a simple one dimensional representation of a higher form lacks structure. The file formats for this and similar data  have a great deal of structure as do the rendered images themselves. It is true that the ability of the document to communicate to a machine its own purpose by simple means is lost when the data fields become ink on a page and that are transformed into  pixels in a really long string embedded in a file.

In some respects it is analogous to hieroglyphics where pictographs carry abstract meaning.  The data may not be easily interpretable by machines but document recognition and capture technologies improve daily. The fact that an error rate still exists in recognition does not mean that the content lacks structure.  Simply that the form it takes is too complex for simple processes to understand.

Content often has a highly complex structure that is beyond the capabilities of mere databases. Unstructured data as a term however is a myth and content is anything but simple. Believing this myth can lead to ignoring the complexity of what we call content and this denial of structure is often the root cause for the difficulties we face in implementing systems to manage it.

Tagged as:

8 Responses »

  1. Lee, same comment I posted to Joe’s blog —

    Check out my 2005 article, Structure, Models and Meaning, http://intelligent-enterprise.informationweek.com/showArticle.jhtml?articleID=59301538

    “The imprecise unstructured label and the focus on management divert attention from the real issue: extracting and exploiting the information within binary (as opposed to fielded) data objects. The challenge of modeling and making sense of information content falls in the analytic rather than data management domain. Most unstructured data is merely unmodeled…”

    And you’ll find an October 19 conference I’m organizing, Smart Content, to be a great venue for discussing these topics: http://smartcontentconference.com .

    Seth

  2. well said – excellent article. Seems as a market we have made little progress in convincing the world outside out content management community that unstructured is an invalid term. Shifting to analytics over or rather in addition to modeling is an interesting approach that appears to have more traction.

  3. I also advocate for using terms like uniform/non-uniform or regular/non-regular. Unstructured/structured are so widely used we can’t easily ignore them, but introducing the more precise terms does help the uninitiated appreciate the wealth of capabilities that have emerged for managing and processing content that doesn’t fit neatly into rows and columns.

  4. Jerry
    You make a good point. It is all about audience. Unfortunately my experience has been that far too many people in the structured data world consider themselves “initiated” and it is challenge to get them out of their right angled paradigms.

    The completely untrained pick up on this much faster as there are no preconceived notions to unlearn.

  5. Lee, Just followed a link to your article in a February 2011 tweet from @Middlewave.

    It’s all about degrees of structure as I wrote in a 2004 post titled “The Structured-Unstructured Information Continuum”.

    We don’t need new terms. We simply need to understand that most assets fall somewhere along a continuum of structure and should be managed accordingly.

Trackbacks

  1. Process and Business Quotes of the week « Adam Deane
  2. cd.tweakcodera.in» Blog Archive » The mythical unstructured data!
  3. tips.tweakcoders.in» Blog Archive » The mythical unstructured data!

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: