Walk through any expo floor and chances are that you will see a half a dozen vendors touting “text analytics.” In document management, it’s the cool “new” topic. The problem is that text analytics is not new, and often, many of the vendors promoting its support use it themselves. Text analytics has been around for almost 18 years. It can change the way organizations work not only with their documents but the information stored in those documents.
It is often easier to explain what text analytics is not than what it is.
Text Analytics Is Not Zonal OCR or Full-Text Search
One of the biggest challenges in understanding text analytics is that too many technologies want to say that they perform text analytics. So, let’s start with a definition: Text analytics uses tools and techniques to extract meaning out of documents or other content sources. Using a very loose definition, things like zonal optical character recognition (OCR) or full-text search would fall into this category, but these weren’t called “text analytics” two years ago.
With zonal OCR, individuals define a location on a page to be scanned to search for specific information based on recognized patterns. Full-text search gets a little closer. Concept searching performs pattern recognition, but it often includes vocabularies to recognize concepts like tenses. However, these technologies have been mainstream in document and content management for several decades. The attempt to make them fresh by capitalizing on this term hinders the adoption of real text analytics.
Text Analytics Is Based on Computational Linguistics
The current generation of text analytics uses concepts of computational linguistics to look at content. Computational linguistics is a field of its own, which uses computer science principles to analyze written and spoken language. It looks at grammar tense, or morphology, and sentence syntax. It also considers computation semantics—the fact that a single word may have multiple meanings. For instance, a “bank” can mean a financial institution, the edge of a river, the turning of an aircraft, or using the rail in a game of billiards.
The most common use of computational linguistics has been in the field of machine translation—also called automated linguistic translation. Machine translation looks at the entire sentence to translate one language into another. Computational linguistics is used to process the sentence or phrase into the different parts of speech and then return a translated copy (e.g., US English to Canadian French).
In the past, text analytics has been used to group like documents for categorization or classification (affinity groups) or to summarize large documents (document essence). The latest examples is found around identifying document outliers or fraud detection. When identifying a document outlier, text analytics attempts to match a document to its expected classification. If a document falls too far outside this classification, it is flagged as an outlier. The key to mapping the document relationship comes from Word2Vec.
Text Analytics Is About Word2Vec
The primary concept used to perform such processes is Word2Vec (think words to vector). It looks at each individual word and performs an analysis to plot it in a three-dimensional space, or vector, based on a computed algorithm. The computations are such that when these words are plotted in a three-dimensional space, similar words are mapped in close proximity to each other. Because of this, even jargon can be plotted using Word2Vec. These vector maps are then used to identify like terms as well as outliers. Word2Vec concepts have been extended to Doc2Vec, which looks at documents in the same way.
Putting Text Analytics to Use
The challenge with Text analytics is that people are looking for existing use cases. Classification and categorization are simple examples that exist today. For text analytics to really take off, it needs ideas for implementation. Information technology (IT) can tell if documents are similar, find like documents, and find documents with irregularities. These same calculations can be used on the words inside of documents, which is where text analytics shows its power. We need to understand its potential, and only then can we understand the questions that text analytics can help to answer.
A showcase solution in document management was how it helped to bring Pfizer’s Viagra to the market three months earlier by improving the FDA submission process. Those three extra months meant millions of dollars in revenue, but Viagra was accidentally discovered from reactions that were reported in clinical trial documents. A showcase solution for text analytics could be finding the next new drug by reviewing collections of millions of clinical trial documents.
This article was originally published on DOCUMENT Strategy Media. Please visit here for the original post. I have since updated this article to fit the more common term “Text Analytics.”