How AI Will Kill Unstructured Content

UntitledThere are a plethora of marketing videos and analyst’s posts about new AI product’s popping up everyday. While products could be evaluated ad nauseam, what I would like to focus on is what I think is the most interesting implication that this wave of affordable Enterprise AI suggests. The death of unstructured content.

First some context. I was privileged to see Forrester’s Mike Gualtieri present a fantastic introduction to AI. I highly recommend his blog. He began the session by establishing that there are two very distinct classes of Artificial Intelligence. Pragmatic and Pure.

Pragmatic AI represents that set of capabilities where algorithms can process data and predict outcomes based on models as well as or in most cases faster and with a higher degree of accuracy than humans. The practical application of machine learning, one of the components of pragmatic AI, is improving predictive computing. This is giving us such life changing advancements as self-driving cars, natural language processing and robotic process automation. The more data that is ingested, the more the system is able to improve its accuracy in predicting outcomes. It learns and improves without additional programming.

Pure AI is still something in the future. Pure AI does not simply process data. It perceives, processes and formulates new concepts or solutions based on conditions it was not expressly programmed to encounter and is essentially indistinguishable from human interaction. This is plainly the stuff of science fiction at this point. Fun to talk about but not something that is likely to impact your bottom line in the short term.

Pragmatic AI or cognitive computing as it now more commonly referred to is finding its way into the enterprise though offerings like IBMs Watson and Microsoft’s Cognitive Services. As traditional practitioners of ECM though it is important to understand the implications of affordable AI in the enterprise.

Not tomorrow, and not next week but someday in the relatively near future, AI will make unnecessary a term many have built their entire career on. AI will one day kill Unstructured Content. As far as I am concerned good riddance.

To be fair, what dies is not the content itself but rather the distinction. I have long complained about the term Unstructured data. I have never liked it because while it had a practical purpose in guiding how to build a system twenty years ago, it always seemed like you were giving up. The “content”(data), whether it be a letter or a contract or a policy did not neatly fit into a row and a column therefore you give up and treat it as blob with metadata. A file. It is still data. What does it matter if the data is a table or a document and why should there be less discipline in managing it or capability expected from it. Unstructured data always had lower expectations because it is “just files.” At least that is what we thought until we figured out really important stuff was buried in them.

Surrendering to the “idea” of unstructured data had implications of its own. While you can get business value from tracking state and transition of the entity, eventually whole industries developed around trying to understand the data inside the files themselves. SGML and later XML reinserted structure into file formats but that proved too hard and bulky. OCR too was the magic bullet that always missed its target in the eyes of the customer. ECM exists as a practice largely because dividing data into these categories was necessary to manage the sheer scale of content created by any organization of reasonable size. It was necessary because of a machine’s limited ability to understand information that had not been pre-interpreted into fixed contexts (tables, records and fields – oh my).

Enter artificial intelligence. At the lowest level, predictive algorithms have been creeping into the processing of documents for years in capture technologies. We are at a point of critical mass however where the machine can now “know” what a document is, understand what it says, recognize errors, infer corrections and decide what actions to take all without a human in the process to interpret those conditions and make those decisions. Document centric transactions have progressively diminishing dependence on humans for error correction and these forms of work start to be assumed by Robotic Process Automation. This frankly was the easy part.

The ability of AI to analyze big data sets if applied equally to tables and content changes everything. The fact that data is structured or unstructured stops being important if you have a unified means of interpreting that data, extracting insight and taking action.

The next level of advancement is when documents cease to be necessary at all. To an AI documents are simply containers. Multiple vendors for example are leveraging pragmatic AI processes to not only assist in contract creation but also to ingest contracts and extract the terms and conditions on their own. Eventualy integrations can then implement the contract in an ERP system without humans (or lawyers) every having to be involved.

With this level of intelligence,  why does the contract file itself need to be managed. You can enforce, track, and control cost of the instrument at the term level providing more discreet control of the business relationship. When all of the data within a file is interpretable by the machine the container itself is rendered obsolete. Extend this to the purchase order or the invoice and again the container matters less than the individual line item records.

Contracts are in fact some of the most structured documents we manage. It makes sense this is one of the first document classifications to be affected by this “deconstruction” and transformed into smart contracts. Once you give up the idea of unstructured data across the enterprise data set you can start to see the implications of technologies like block chain can have on how we actually build these systems in the near future.

Documents as a discreet, trackable unit of work may always be convenient. We are approaching tipping point though where documents as a concept stops being needed by the machines. It is the humans that need the container because we simply cannot deal with data at the level of analytical resolution that an AI can. We must be careful not to let paradigms like document limit the progress of efficiency in design. With this change the distinction of unstructured versus structured becomes obsolete and we need to take advantage of it. Raise your expectations. Expect more from this data.

Expecting more from your data is where AI leads you. To be successful in getting business value from the content we manage, we need to start focusing on a new skill. Instead of simply managing the creation through deletion lifecycle, the skill of asking questions about data will grow in importance. A topic I plan on covering in my next post.

Unstructured Content Is Dead – Long Live the Question

4 thoughts on “How AI Will Kill Unstructured Content

  1. Interesting post Lee. As you say Invoices and Contracts are at the structured end of the content world and our brains can start to break those down into a more structured view i.e. Terms. What about the other end of the scale though, e.g. Videos or Images, would you say it is our inability to break those down into a more structured view conceptually which will throw up the biggest challenge? Humans think about an image by what we ‘see’ in it, e.g. a photo will be broken down by concepts such as the person in it, the colour of their hair, the colour of their eyes etc….whereas machines take a less emotional view of the Image?

    1. I chuckled to myself when I saw this as the first comment. I started to address non-document formats but the post would have just been too long. In short – yes I do agree that is the greater challenge. Again the premise is that when we start from a presupposition that data is “unstructured” we stop looking for structure and miss value. Video surveillance is a good example. Every video stream has enormous amounts of data associated with it. Geolocation, angle of perspective, date/time, weather conditions, etc. Objects and people moving in and out of frame establishing points of correlation with other sources. There is a great scene in the movie Patriots Day where a room full of agents are combing through millions of minutes of video footage and once the anomaly of a person looking the “wrong way” at the moment of the explosion is spotted, they walk backward in the timeline from perspective to perspective to get a clear picture of the bombers faces. The data isn’t unstructured. Our ability to correlate it was limited by the toolset. What I want people to think about is the importance of not segregating the analytical tools and their user experiences by arbitrary classifications like unstructured content simply because we store the data in a file rather than a table. To paraphrase an old adage – we need to think out of the file. AI will make this possible. One day.

  2. Hi Lee

    Fantastic post, and it was great to chat with you at Enterprise Word, but… there is always a but eh? I don’t agree with much of what you state here, if even just on a conceptual level:

    “…..documents as a concept stops being needed by the machines. It is the humans that need the container because we simply cannot deal with data at the level of analytical resolution that an AI can.”

    So, you can’t get rid of “unstructured information” because of stupid humans ! Yes I agree with your block-chain enabled smart contracts example, machine to machine, cognitive computing enhanced, robotic process automated work that does not need a lawyer or contract negotiator. However, there are many, many other forms of contracts that do still require considerable negotiation, and because humans at this point are doing the negotiation the contract remains a document – a format they understand and can work with. As such it will be stored in a document management system, or content management system, which to your point is “unstructured information”. I have done big XML Structured Authoring projects, where the “document” is indeed a container for an XML manifest and XML information files, but they were stored in a “traditional ECM ” platform. So is that structured information, or unstructured ? Perhaps it is just the terminology that is at fault ? Perhaps a document with a ton of metadata is semi-structured information ???

    So just because Watson, Magellan, iManage RAVN, KIRA Systems, and no doubt many other products can lever the rich history of search and text analytics and add cognitive processing to crack open documents, index the contents, analyse the contents, and generate actionable insights, does not to me mean that these atoms of unstructured information are suddenly structured, it just means that thanks to advances in technology, we can treat them differently, and finally maybe run the same kind of analytics against them that the worlds of ERP, CRM etc have done for a while.

    1. well if we agreed on everything we wouldn’t need each other would we. My war is against the word unstructured. It inspires mediocrity. My point is not that there is not a point of diminished return to the indexing. There obviously is. AI however has the potential to eliminate the need to assume an indexing scheme need exist at all. We all go back and forth on the tag versus pre-defined metadata debate. Its a religious argument. In one corner are the Tagonites (a recent denomination) who think simple entity extraction is all that is necessary for navigation. The more orthodox Metadatians (rather like that one) follow a data model design principle to content attributions and labor over taxonomies as scripture. Both are providing contexts for humans to classify and understand data.

      AI “may” not need this one day. This article from the Atlantic recounts an experiment where two AI instances essentially invent their own language to collaboratively solve a problem. On what basis should we believe that an AI will find it necessary to pre-classify/structure data (tag or index) in order to perform a given task.That is how we think but not necessarily the only way TO think. As I said in a post long ago – call it complex – but don’t call it unstructured simply because we cannot yet take full advantage of the order within. Yes – this can degrade into semantics but words create boundaries seen and unseen.

      As to your negotiation example – who wouldn’t want an impartial algorithm to simplify the language and tell me what I just agreed. A harder problem is things like novels. How does one conceptualize and then visualize plot and character development – or does that dissection kill the art. That is another topic entirely. Back to business. My hope is that AI lets us take advantage of structure we cannot see without it.

Comments are closed.

Create a website or blog at

Up ↑

%d bloggers like this: