You never give me your answers... You only give me your funny data

Photo – Iain Macmillan.
Warnings about data analysis using examples from my experiences with police data - and the three things you need for data analysis.  Plus lots of Abbey Road references. 

Disclaimer: All of the ideas expressed in this article are my personal statements and opinions, and do not reflect the opinions/statements of the City of Urbana.
Getting injected into the Police data discussion has been fascinating.  I started as a City IT Director 7 months before the summer of 2014 exploded, with the deaths of Eric Garner (New York) and Michael Brown (Ferguson). As fits a progressive City, Urbana’s Traffic Stop Data Task Force was slightly ahead of the curve, convening for the first time in June, 2014. So police data was a focus for our IT Division from the beginning.

In IT, our role is querying data - hopefully to clarify issues and facilitate decision making. Unfortunately, too much of the time our job is explaining that data is NOT available. Or that the data is not structured well. (A recent example: our data on building inspections captures paragraphs of text. It’s incredibly useful when you want to read the history of a single house – but you can’t analyze across the records.)

One sweet dream came true today...

Of course, sometimes we do have the data - and it is sublimely satisfying when our work lets people better understand an issue. In March 2016, our City Council reconsidered the local ordinance for Cannabis possession. People accused the City of using these violations (a $300 ticket) to get easy money.  We queried, and found that exactly 50 people paid fines in 2015 for a total of $15,000.  Council debated and enacted a policy change to drop the fine to $50. While doing so, they knew the financial impact was an estimated loss of $12,500.

But not every situation is this clear-cut. Recently I spoke to a University class that was partnering with the County Racial Justice Task Force. The students were excited, but unsure of where to start. I was happy to catch them at the best time: before they’d met with the Task Force.  This was an excellent opportunity to lower their expectations. (If that sounds bad to you… “to right-size their expectations.”)

Greeting them as fellow data analysts, I emphasized three things:
  1. You need questions to answer
  2. You need data captured in a useful way
  3. You need permission to use the data

He say one and one and one is three…

Start with the first point. Many of the times people talk about data, they are looking at a broad topic and seeking to throw data at it, hoping something sticks out. But that’s not going to work. Data isn’t going to solve problems on its own. All data can do is answer specific questions, like the total amount of Cannabis fines in a year.

But if someone asks a more nebulous question like “are Police stopping too many African-American drivers?” then it’s vastly more difficult to answer. Computers need crisply framed requests.   The hard part is breaking down a question in to something we can query, but that’s what it takes.

However, “vastly more difficult” doesn’t mean impossible. In Urbana’s case, one of the outcomes of the Task Force was the City’s hiring of a police analyst ( As one of her first tasks, she tackled the question of whether traffic stops by race were disproportionate.

So now the hard part… how do you define “disproportionate?” Ultimately, traffic accidents were used as a benchmark, assuming they are the most accurate cross-section of drivers that exists. Census data is less accurate not only because it’s increasingly out of date, but also because Urbana’s businesses and university attract drivers who may not live here permanently. This is explained in her report, section 2:

Once the question is properly framed, a meaningful comparison can be made of the ratios produced from two simple queries of demographics: traffic stops by race and accidents by race. At the end of the process the data showed that African American drivers were stopped disproportionately to accidents.

So the next logical question is: "why?" And that makes the traffic stop data a good example of the second need, also: you need data captured in a useful way.

I want you... I want you so bad... It's driving me mad... It's driving me mad.

The arrest data indicates what violations people were charged with as the result of the traffic stop, but before 2015 it didn’t capture why the person was originally pulled over. Thus, in 2015 the Task Force recommended a new data element to capture the “Motivation for Stop”.  (See Section 3.1 of the report for more on this.)

This is an example of adding data to meet reporting needs, and in many cases that’s all we can do. But we can’t overdo this – every new element directly impacts data entry.  And it’s worth noting that it often only gets you data from that point forward. In this case, the only way to find out the Motivation for Stop retroactively was to read every police report.  Any logic that tried to infer it from the existing data would be impossibly complex, and probably still inaccurate, plus you’re increasing the risk of human error. (Footnote 1)

By the way, the Cannabis example also benefitted from good data – and represents the value in strong data entry procedures. Impressively, our records staff updated the Ordinance Violation record with the actual amount paid, which was not known until after the fine was paid to the Finance Department. (Finance couldn’t help in this case, they don’t distinguish the types of violations when they book the revenue – it’s all just “Ordinance Violations”.)

And though she thought I knew the answer.. Well, I knew that I could not say

Which brings us to the third need, the permission to use the data. This was taken for granted in the examples above because it was City IT requesting data from City Departments for City Council. But it’s a challenge for anyone else who wants to work with government data. Speaking from the inside on this, I feel that government cannot be too careful about protecting privacy, with the acknowledged trade-off being a lack of openness with data. (I’m not even going to talk about redaction here… except for Footnote 2.)

Our default approach to privacy is to obscure all Personally Identifiable information.   What that means in practice is we try to ignore the people in the data. We don’t want to know who these people are, because we are looking for trends in the aggregate from large volumes of anonymous data.

There are grey areas, of course. The students were learning screen-scraping, which is programming that grabs data from websites and stores it. They could scrape the Sheriff’s jail log, and they wanted to learn more information about the arrestees. One idea was to match people’s information with social media data based on their name, age, and home city from the jail log.

This fails the “ick” test with me. My suggestion was not to go beyond the original data that was made public, and that if they wanted something beyond that they should request it properly, and not scrape it from the site. (It’s more accurate and a heck of a lot less work.) Which they eventually did.

Ultimately, that only got them back to the first need – now they had data, but what questions should they ask?  Because they had done this in the wrong order: asking for data first and figuring out questions later. In their defense, the end of the semester was approaching fast.

And in the end… The love you take is equal to the love you make

Back on that early day when I spoke to the students, and I got to the end of my Three Things You Need, I hoped I had sufficiently lowered their expectations. But that’s a downer, so I felt like I needed something upbeat to leave them with.

Here’s some good news, I told them. As data analysts our role is to answer questions with data, not to think up the questions.

What I didn’t tell them was the bad news: analyzing police data is extraordinarily difficult. That’s why we hire people with the specific skill sets to focus on it. Furthermore, the emotionally charged debates makes it hard for people to sit back and dispassionately ask quantifiable questions. (As opposed to questions about City spending, for example.)

Well that’s a downer ending, too. So let me leave you with some upbeat thoughts.
Taking the long view, I see a general trend of improvement. The results of the Government Open Data experiment are becoming visible. The data’s getting better, people have more access to the data, and people are learning how to use the data to answer questions. We have citizen analysts, citizen journalists, and citizen data scientists. The momentum created so far will push Governments to improve what it captures and releases. (Footnote 3)

Better data days are ahead – if only people can think of the right questions to ask.

He got feet down below his knee…

Footnote 1 - Human error in data analysis is something I worry about a lot. There’s room for human error at every step in the process. The software could be creating bad data from programming bugs or bad data entry.   Even with good data, people can screw up the data extraction in many ways. Sure there are the obvious ones like a poorly written query, or a mis-constructed join.   Or sneaky subtle ways, like the time we were downloading to Excel spreadsheets that were cutting off after about 66,000 rows. (FYI – because the extraction tool couldn’t handle Excel formats after Excel 2007.) Thank goodness someone spotted intermittent drop-offs in the data.

Footnote 2 - OK, let’s talk about redaction. Protecting privacy by obscuring parts of the text, audio, or video requires effort in proportion to the amount of data. It takes someone a lot of time to redact, and automated redaction is beyond our current capabilities. (Consider: would you trust your privacy to an automated redaction system?)

And let me please climb on a soapbox for a moment and demand some awareness that time spent redacting, or even analyzing if redaction is needed, is time spent not doing more productive work. It’s not that redaction isn’t important; it is essential - to protect privacy. But it’s a misuse of government resources if the request does not justify the effort. As an example: “asking every state agency to publicly release all emails that don’t require redaction” (i.e. asking them to review every e-mail to determine which ones need to be redacted).

Footnote 3 - The counterpoint to the upward trend of Open Data is the increased workload it puts on Governments. My opinion: we need to be responsive to citizens, but also to protect the privacy of the people in the data. While the larger governments can establish positions dedicated to these functions, for small and medium Governments it becomes just another ball to juggle. As a taxpayer mull this over: in Illinois there is no charge for people, or for-profit companies, to submit as many data requests as they want if they receive the data electronically. (If they request paper copies, there is an ability to charge them 15 cents a page after 50 pages. And if you want it on a CD or DVD, you must pay only the cost of the media.) Once received, we have strict limitations on how quickly we must respond, which can mean stopping our other work to fulfill the request.

Her Majesty's a pretty nice girl, But she doesn't have a lot to say…

I feel compelled to explain the Beatles quotes and picture on this article. While writing it, I went through several titles that I didn’t like, mostly because they’d been used to death. (“Data Data Everywhere” was an early example. As you can see, it’s been done:

Walking the dog one day, playing with different titles in my mind, I hit upon the Beatles lyric that I twisted. Then came the thought of using an alternate Abbey Road picture, which was a more interesting picture by far than the heat map I was originally going to use. And finally, I thought it would be nice to add some intermediary quotes from other Abbey Road songs, because I’m a Beatles freak and that’s the kind of stuff we do. (Want more proof? Here's a whole post about Abbey Road:

For example, “Her Majesty” was the final track on Abbey Road, and came after all the songs. It doesn’t really fit with the rest, and is sort of slapped on at the end. Kind of like this note. (The song was originally included by accident, a lovely bit of Beatles trivia you can read about here: