It Is More Than Metadata – The Meaning Behind The Mining June 9, 2013Posted by Lee Dallas in Content Management.
Tags: Barack Obama, Content Management, Guardian, Internal Revenue Service, IRS, Metadata, National Security Agency, NSA, Verizon
I am asked on a regular basis, “what do you do for a living” and I resorted long ago to just saying I work with computers. Most are just being polite and they really do not want to know any more than that. Every once in a while though my real job will pop up in common culture in the form of a single word or concept and suddenly what I do has relevance.
The first time this happened was during the Clinton/Lewninski scandal and suddenly the word ‘parse’ was in every journalist’s story. The idea that to know what the president “really” meant you had to take apart every word and phrase and analyze it in context. Now it is common practice to devote hours of discussion to parsing every line in a president’s speech.
Today the content management world has a new term moving into the culture. The word metadata.For those of us who work in content management metadata is such a familiar concept that we often forget that most people have no idea what it is, or more troubling just how much you can do with it.
We have learned in recent days about a secret program to collect call data from major carriers and surrender it to the government for analysis. The politicians rush to calm the public by reassuring us that no one is listening to the actual content (data). That only the metadata – information about the calls is being captured for analysis. I would argue that this is the greater risk. Not because it goes too far but that it permits interpretive bias with the illusion of analytical certainty.
The power of metadata is two-fold. First it allows for classification of information it describes. This classification makes possible the filtering of information that is not interesting to the analyst. This much data is useless without a way to group information into logical divisions and then essentially ignore everything else you are not interested in. The second power is that it makes possible identifying relationships between information that you would otherwise not see as connected.
Crime shows have been educating the public on this for years. The detectives get a list of suspects as starting point with the same modus operandi.(classification) They then cross-reference that list with the location data from their phones at the time of the crime.(phone call metadata) Now from the list of all possible suspects that you could look at the list is narrowed to a manageable few to investigate with other means. You can also use other metadata, who else have they texted or called to establish relationships. Then associate those people with location data to identify possible accomplices. All of this done from the comfort of your data center.
Notice in this example I use a past event – the crime – as the context for the relationships. What happens though when you try and use this same data in a predictive mode. You can still create the web of relationships and touch points but without more information about the context why these relationships matter must be inferred. I text a man every seven days who happens to be a drug dealer. I could be scheduling a drop or he could just be running a lawn service too and I need help with a different kind of grass.
The experts will tell you that analysis of the other data in our respective networks will increase or decrease the likelihood that I am involved in illegal behavior. In order for that to work though you must have everybody’s data, not just those that are of immediate interest. The point here is that while the tools have an increasing degree of accuracy in predictive capability with more data – it remains an inference. A guess. An accusation without anymore evidence than a passing business relationship.
How do you solve for that? The natural next step is more data. Data that will provide the context needed to establish a threat. The content of the calls and messages themselves and it will be what is captured next. The technology for capturing and automatically analyzing that is an advancing as well and this is critical. As impressive as the numbers around the records being collected are they are trivial when compared to the number of one and zeros that make up calls themselves.
I suspect many people are willing to trade this level of observation of our daily lives so long as we believe it is for benevolent purposes. Disturbingly though in the last few weeks we have not only learned about this data collection but also about the abuse of administrative power along ideological lines by the IRS. It is one thing to catch terrorists. It is quite another to mine such data on adversaries because you disagree with their politics. The act of accusation is itself punitive with the powers the IRS possesses. The agency has proven in a very public way this week its capacity for abuse. It begs the question, do you want that same IRS to know everyone you text or call and give them the power to infer guilt by association alone?
The potential for abuse is not diminished because the content itself is not included. If anything that abuse is enabled more by the degree to which interpretation must be used to establish the intent of the calls and the relationships they represent. There is clearly a trade-off between security and privacy and this type of collection may be a part of the price to maintain our safety but do not believe it is not a threat if the government does not listen in directly.
The metadata really does matter and is enough to endanger what privacy remains.