Site icon Big Men On Content

The Myth of Unstructured Data

In a recent post Joe Shepley examined the conflict between structured and unstructured approaches to data management.  This is a bit of a recurring theme for me, primarily because I hate the term unstructured data. There are a number of reasons why but it boils down to the fact that there is no real line between the two. There are practical distinctions but more often than not the terms are used by one group or the other to limit their own scope and throw problems they don’t want to deal with at another box on the org chart.

From a technical perspective – the distinction between the two concepts is the same idea that I examined in my post Mandelbrot and BPM.  In that piece I reference the coastline paradox which explains the fact that the length of a coastline is different depending on how it is measured. The closer you examine it, the longer the coastline becomes.  To paraphrase myself, the degree of structure data has is a function of an undefined measurement representing the distance of the content from an observer/modeler and the unit of measure employed in the design.

No content is unstructured. Some content simply lacks the uniformity of rectilinear (boxlike) structures. Design decisions on how best to manage content should  hopefully be made on pragmatism but too often are based on habit. All content management systems have access to and manage structured data about and in content. It is simply more efficient to treat some data as too complex to manage. Therefore you trust that unit of work to a consuming application or a human mind.

I much prefer the terms structured and complex as I think this pair better represent the technical challenge.

One might argue that a raster image (tiff) of a document, as a simple one dimensional representation of a higher form lacks structure. The file formats for this and similar data  have a great deal of structure as do the rendered images themselves. It is true that the ability of the document to communicate to a machine its own purpose by simple means is lost when the data fields become ink on a page and that are transformed into  pixels in a really long string embedded in a file.

In some respects it is analogous to hieroglyphics where pictographs carry abstract meaning.  The data may not be easily interpretable by machines but document recognition and capture technologies improve daily. The fact that an error rate still exists in recognition does not mean that the content lacks structure.  Simply that the form it takes is too complex for simple processes to understand.

Content often has a highly complex structure that is beyond the capabilities of mere databases. Unstructured data as a term however is a myth and content is anything but simple. Believing this myth can lead to ignoring the complexity of what we call content and this denial of structure is often the root cause for the difficulties we face in implementing systems to manage it.

Exit mobile version