Skip to main content

You never give me your answers... You only give me your funny data

Photo – Iain Macmillan.
Warnings about data analysis using examples from my experiences with police data - and the three things you need for data analysis.  Plus lots of Abbey Road references. 

Disclaimer: All of the ideas expressed in this article are my personal statements and opinions, and do not reflect the opinions/statements of the City of Urbana.
Getting injected into the Police data discussion has been fascinating.  I started as a City IT Director 7 months before the summer of 2014 exploded, with the deaths of Eric Garner (New York) and Michael Brown (Ferguson). As fits a progressive City, Urbana’s Traffic Stop Data Task Force was slightly ahead of the curve, convening for the first time in June, 2014. So police data was a focus for our IT Division from the beginning.

In IT, our role is querying data - hopefully to clarify issues and facilitate decision making. Unfortunately, too much of the time our job is explaining that data is NOT available. Or that the data is not structured well. (A recent example: our data on building inspections captures paragraphs of text. It’s incredibly useful when you want to read the history of a single house – but you can’t analyze across the records.)

One sweet dream came true today...

Of course, sometimes we do have the data - and it is sublimely satisfying when our work lets people better understand an issue. In March 2016, our City Council reconsidered the local ordinance for Cannabis possession. People accused the City of using these violations (a $300 ticket) to get easy money.  We queried, and found that exactly 50 people paid fines in 2015 for a total of $15,000.  Council debated and enacted a policy change to drop the fine to $50. While doing so, they knew the financial impact was an estimated loss of $12,500.

But not every situation is this clear-cut. Recently I spoke to a University class that was partnering with the County Racial Justice Task Force. The students were excited, but unsure of where to start. I was happy to catch them at the best time: before they’d met with the Task Force.  This was an excellent opportunity to lower their expectations. (If that sounds bad to you… “to right-size their expectations.”)

Greeting them as fellow data analysts, I emphasized three things:
  1. You need questions to answer
  2. You need data captured in a useful way
  3. You need permission to use the data

He say one and one and one is three…

Start with the first point. Many of the times people talk about data, they are looking at a broad topic and seeking to throw data at it, hoping something sticks out. But that’s not going to work. Data isn’t going to solve problems on its own. All data can do is answer specific questions, like the total amount of Cannabis fines in a year.

But if someone asks a more nebulous question like “are Police stopping too many African-American drivers?” then it’s vastly more difficult to answer. Computers need crisply framed requests.   The hard part is breaking down a question in to something we can query, but that’s what it takes.

However, “vastly more difficult” doesn’t mean impossible. In Urbana’s case, one of the outcomes of the Task Force was the City’s hiring of a police analyst ( As one of her first tasks, she tackled the question of whether traffic stops by race were disproportionate.

So now the hard part… how do you define “disproportionate?” Ultimately, traffic accidents were used as a benchmark, assuming they are the most accurate cross-section of drivers that exists. Census data is less accurate not only because it’s increasingly out of date, but also because Urbana’s businesses and university attract drivers who may not live here permanently. This is explained in her report, section 2:

Once the question is properly framed, a meaningful comparison can be made of the ratios produced from two simple queries of demographics: traffic stops by race and accidents by race. At the end of the process the data showed that African American drivers were stopped disproportionately to accidents.

So the next logical question is: "why?" And that makes the traffic stop data a good example of the second need, also: you need data captured in a useful way.

I want you... I want you so bad... It's driving me mad... It's driving me mad.

The arrest data indicates what violations people were charged with as the result of the traffic stop, but before 2015 it didn’t capture why the person was originally pulled over. Thus, in 2015 the Task Force recommended a new data element to capture the “Motivation for Stop”.  (See Section 3.1 of the report for more on this.)

This is an example of adding data to meet reporting needs, and in many cases that’s all we can do. But we can’t overdo this – every new element directly impacts data entry.  And it’s worth noting that it often only gets you data from that point forward. In this case, the only way to find out the Motivation for Stop retroactively was to read every police report.  Any logic that tried to infer it from the existing data would be impossibly complex, and probably still inaccurate, plus you’re increasing the risk of human error. (Footnote 1)

By the way, the Cannabis example also benefitted from good data – and represents the value in strong data entry procedures. Impressively, our records staff updated the Ordinance Violation record with the actual amount paid, which was not known until after the fine was paid to the Finance Department. (Finance couldn’t help in this case, they don’t distinguish the types of violations when they book the revenue – it’s all just “Ordinance Violations”.)

And though she thought I knew the answer.. Well, I knew that I could not say

Which brings us to the third need, the permission to use the data. This was taken for granted in the examples above because it was City IT requesting data from City Departments for City Council. But it’s a challenge for anyone else who wants to work with government data. Speaking from the inside on this, I feel that government cannot be too careful about protecting privacy, with the acknowledged trade-off being a lack of openness with data. (I’m not even going to talk about redaction here… except for Footnote 2.)

Our default approach to privacy is to obscure all Personally Identifiable information.   What that means in practice is we try to ignore the people in the data. We don’t want to know who these people are, because we are looking for trends in the aggregate from large volumes of anonymous data.

There are grey areas, of course. The students were learning screen-scraping, which is programming that grabs data from websites and stores it. They could scrape the Sheriff’s jail log, and they wanted to learn more information about the arrestees. One idea was to match people’s information with social media data based on their name, age, and home city from the jail log.

This fails the “ick” test with me. My suggestion was not to go beyond the original data that was made public, and that if they wanted something beyond that they should request it properly, and not scrape it from the site. (It’s more accurate and a heck of a lot less work.) Which they eventually did.

Ultimately, that only got them back to the first need – now they had data, but what questions should they ask?  Because they had done this in the wrong order: asking for data first and figuring out questions later. In their defense, the end of the semester was approaching fast.

And in the end… The love you take is equal to the love you make

Back on that early day when I spoke to the students, and I got to the end of my Three Things You Need, I hoped I had sufficiently lowered their expectations. But that’s a downer, so I felt like I needed something upbeat to leave them with.

Here’s some good news, I told them. As data analysts our role is to answer questions with data, not to think up the questions.

What I didn’t tell them was the bad news: analyzing police data is extraordinarily difficult. That’s why we hire people with the specific skill sets to focus on it. Furthermore, the emotionally charged debates makes it hard for people to sit back and dispassionately ask quantifiable questions. (As opposed to questions about City spending, for example.)

Well that’s a downer ending, too. So let me leave you with some upbeat thoughts.
Taking the long view, I see a general trend of improvement. The results of the Government Open Data experiment are becoming visible. The data’s getting better, people have more access to the data, and people are learning how to use the data to answer questions. We have citizen analysts, citizen journalists, and citizen data scientists. The momentum created so far will push Governments to improve what it captures and releases. (Footnote 3)

Better data days are ahead – if only people can think of the right questions to ask.

He got feet down below his knee…

Footnote 1 - Human error in data analysis is something I worry about a lot. There’s room for human error at every step in the process. The software could be creating bad data from programming bugs or bad data entry.   Even with good data, people can screw up the data extraction in many ways. Sure there are the obvious ones like a poorly written query, or a mis-constructed join.   Or sneaky subtle ways, like the time we were downloading to Excel spreadsheets that were cutting off after about 66,000 rows. (FYI – because the extraction tool couldn’t handle Excel formats after Excel 2007.) Thank goodness someone spotted intermittent drop-offs in the data.

Footnote 2 - OK, let’s talk about redaction. Protecting privacy by obscuring parts of the text, audio, or video requires effort in proportion to the amount of data. It takes someone a lot of time to redact, and automated redaction is beyond our current capabilities. (Consider: would you trust your privacy to an automated redaction system?)

And let me please climb on a soapbox for a moment and demand some awareness that time spent redacting, or even analyzing if redaction is needed, is time spent not doing more productive work. It’s not that redaction isn’t important; it is essential - to protect privacy. But it’s a misuse of government resources if the request does not justify the effort. As an example: “asking every state agency to publicly release all emails that don’t require redaction” (i.e. asking them to review every e-mail to determine which ones need to be redacted).

Footnote 3 - The counterpoint to the upward trend of Open Data is the increased workload it puts on Governments. My opinion: we need to be responsive to citizens, but also to protect the privacy of the people in the data. While the larger governments can establish positions dedicated to these functions, for small and medium Governments it becomes just another ball to juggle. As a taxpayer mull this over: in Illinois there is no charge for people, or for-profit companies, to submit as many data requests as they want if they receive the data electronically. (If they request paper copies, there is an ability to charge them 15 cents a page after 50 pages. And if you want it on a CD or DVD, you must pay only the cost of the media.) Once received, we have strict limitations on how quickly we must respond, which can mean stopping our other work to fulfill the request.

Her Majesty's a pretty nice girl, But she doesn't have a lot to say…

I feel compelled to explain the Beatles quotes and picture on this article. While writing it, I went through several titles that I didn’t like, mostly because they’d been used to death. (“Data Data Everywhere” was an early example. As you can see, it’s been done:

Walking the dog one day, playing with different titles in my mind, I hit upon the Beatles lyric that I twisted. Then came the thought of using an alternate Abbey Road picture, which was a more interesting picture by far than the heat map I was originally going to use. And finally, I thought it would be nice to add some intermediary quotes from other Abbey Road songs, because I’m a Beatles freak and that’s the kind of stuff we do. (Want more proof? Here's a whole post about Abbey Road:

For example, “Her Majesty” was the final track on Abbey Road, and came after all the songs. It doesn’t really fit with the rest, and is sort of slapped on at the end. Kind of like this note. (The song was originally included by accident, a lovely bit of Beatles trivia you can read about here:

Popular posts from this blog

The Tectonic Speed of Government, Part 1: Procurement

This post is my reaction to conversations about how hard it is to create change in government, and how government projects (and IT projects in particular) take so long from genesis to completion. This is part 1, about procurement; part 2 will address project implementations . PS - there was a surprise Part 3 of this series later! Warning : what follows is an “inside baseball” discussion of government IT procurement. I’m not trying to dissuade you from reading it, but if you’re not enmeshed in this world you might want to consider reading my articles on lighter topics like organizing your electronic life , the greatness of Abbey Road , or the story behind The Room . If you ARE enmeshed in this topic, then please don’t overlook my call to action at the end! Changing the process will take a group effort, and I’m hoping to get feedback on my scheme to create a library of reusable software specifications. By the way, this post’s first title was “The Glacial Speed of Government” bu

The Tectonic Speed of Government, Part 4: Momentum This “Tectonic Speed” series is about why Government IT Projects take such a long time. The name refers to tectonic plates, rubbing against each other. No visible movement for a while then… CRACK! Government change is like that; it can take a long time to build, but when it happens it can be intense.  For more on that here is my 20:50 speech on this theme from the Code for America Summit 2020 , which turned into a virtual event. (I can tell you that it’s 20:50 because of the PechaKucha-ish format: 25 slides for 50 seconds each.) ************************************************************************** Disclaimer: All of the ideas expressed in this article are my personal statements and opinions, and do not reflect the opinions/statements of the City of Urbana. ************************************************************************** One impact of the COVID-19 pandemic is a shake-up of p

But First... Let Me Take a Selfie

Excel 3D Map of my drives home. Colors are waypoints and height is the trip duration. Read on to see what the hell this is. This piece began before the pandemic, but the use of our phones as tracking devices for contact tracing prompts tough questions about privacy and tracking that make this timelier than I expected. See the middle section for thoughts on this. *********************************************************************************************** Disclaimer: All of the ideas expressed in this article are my personal statements and opinions, and do not reflect the opinions/statements of the City of Urbana. *********************************************************************************************** “OK Google… Track Me All Day Long”  Google Location History logs your every move, every day – if you turn it on. (To their credit, it’s off by default.) It’s worth asking - why would anyone want to do this?  Well, I did it for a few years because I was curious about the r

Why the Tectonic Speed of Government?

The original name ("Glacial Speed of Government”) is both cliché and inaccurate, as it implies a faster pace than it used to. I decided that “Tectonic Speed” is more accurate because change in government shows tremendous resistance and moves slowly, but when it happens progress can occur in significant outbursts - and in those moments, there is great opportunity!     Click on "Read the Whole Thing" to access these links: Part 1 |  Part 2   |  Part 3  | Part 4 | Wait there's a Video?!  

How To Videos: Lucity Queries with Microsoft SQL Server and Excel

What follows is not a blog, but some suggestions on using Microsoft SQL Server "Views" to query your Lucity data using Excel.   This information is intended to assist Lucity software users, and not for any nefarious purposes. I recommend watching the videos in Full Screen view and with HD resolution.  They're not as blurry as they look on this page!!  Each of these about two minutes long, but the original actions only took 50 seconds each.  (After recording them, I decided to slow them down to make them more watchable.) 1. How to create a SQL Server "View".   The video shows how to create a new View from the core Work Order table.  (WKORDER - see the data dictionary here .)  The video first shows the simple method of creating a view with all fields, then shows the more effective method of including only needed fields, and re-labeling them with their on-screen names. Music: "A View to a Kill" - Duran Duran 2. How to Connect to the Vi