But First... Let Me Take a Selfie

Excel 3D Map of my drives home. Colors are waypoints and height is the trip duration. Read on to see what the hell this is.

This piece began before the pandemic, but the use of our phones as tracking devices for contact tracing prompts tough questions about privacy and tracking that make this timelier than I expected. See the middle section for thoughts on this.

******************************************************************************

Disclaimer: All of the ideas expressed in this article are my personal statements and opinions, and do not reflect the opinions/statements of the City of Urbana.

******************************************************************************

“OK Google… Track Me All Day Long”

Google Location History logs your every move, every day – if you turn it on. (To their credit, it’s off by default.) It’s worth asking - why would anyone want to do this? Well, I did it for a few years because I was curious about the results, in a narcissistic kind of way. What would it say about me? (Spoiler alert: not much but see below for details.) But also, as a data geek: what kind of data do they collect?

Last July I stopped tracking after reviewing some of the results. Since it wasn’t that interesting (blame me, not Google) it didn’t seem worth it to let any company track me wherever I go while my phone is on. And Apple people, don’t feel smug: with the iPhone 11 you had to wait until 2020 for an iOS release that let you completely turn off tracking.

Reviewing my travels while Sheltering in Place

Stuck at home during the pandemic, I downloaded the results and did some final analysis before I purge the data. Why? Because the data's there, and it seemed like a good opportunity to get some practice with different tools. And hey, did I mention the narcissism?

I’ve broken the rest of this post into three sections: what I found out, some thoughts on privacy and tracking - especially as it relates to the pandemic, and more detail about how I did this analysis. Detail about parsing the JSON files will come last because I’m a realist about attention spans.

What I Found Out About Myself

Ultimately, nothing unexpected. I spend most of my time at (in order): home, work, exercising, restaurants, temple, movie theaters, Scout meetings, coffee shops, and beer establishments. That’s a list I could have made even without Google! I had to dig deep to find even a few tidbits of interest.

My Preferred Movie Theater

Focusing on the years 2016-2018 (see the third section below – about my methodology - for why I used this range), my attendance by movie theater shifted from The Art Theater to the megaplexes over three years. Probably, this had more to do with my personal relationship with the Art during 2017-2018 than anything else. (Let’s just say that I didn’t feel welcome and leave it at that.) However, on a broader scale it could point to a general trend of lower attendance at Art House theaters due to competition from Video-on-Demand and streaming services, something I’ve written about recently: here was my dire January 2020 post, and here was an even more depressing COVID update.

Another datapoint in the Google data is the duration of your stay. Actually you must calculate the duration, but the data has the start and end times.

Adding up the data, gets this interesting fact: in 2016 I spent 2.3 days of time (55 hours) at the Art Theater, and 3.6 days in theaters overall. (In 2017 it dropped to 1.9 days overall and was back up to 3.2 days overall in 2018.) That’s a lot of movie-going!

Where Did I Spend Long Stretches of Time?

Sticking with duration, the places I spent the longest continuous times were hotels and vacation spots. Camping sites with Scouts are under-represented in this list because there was often no cell signal, so I turned my phone off. (Just one example of how seemingly clean data can be inaccurate!)

Here was one benefit of the Location History: looking over this list was a nice reminder of past events, an especially wistful sentiment during the pandemic. For my GMIS Illinois friends: the Bloomington-Normal Marriott was #1 on this list, with visits that averaged 32 hours – indicating how much I stay in the hotel continuously during the GIANTS training symposium… demonstrating the quality of the event!

Another data corruption: my home and office time weren’t anywhere close to appearing in the longest durations, for different reasons. At home, we had a landline so I shut off my phone at night. At work, it appears that different parts of our building are considered different Google locations – so moving around for meetings broke up my durations.

If I was really interested in finding out my total time spent at work (and I’m not…) I would need to aggregate the different locations at our buildings into a “work” grouping. I could do the same thing with the various coffee shops & bars I frequent, etc. to get total time in each type of business – that would be a normal step of data cleansing and standardization.

What’s the Best Route to Work?

Thinking about the questions I wanted to ask my data, this was the first one that came to mind! I live about 8 miles from my office, and I have several different ways that I can get to work or home. Which one of them is best?

Since I’m sure you’re dying to know, here are the results, but please see the Methodology section for a discussion about the convoluted process I took to determine this (believe it or not, this took many hours to figure out!):

Going to work, there was more gradation:

The fastest route was Curtis (average: 25.2 minutes, max: 29, min: 19)
Second was Windsor (average: 26.6 minutes, max: 51, min: 15)
Third was Kirby – the most direct, but it cuts through the University campus (average: 29.8 minutes, max: 46, min: 18)

Take that, Google maps, which always recommends Kirby to me!

It’s a classic case here of risk/reward. Curtis Road is the longest route, but the least populated – so it’s a more dependable trip. (Except when I got stuck at the dreaded freight train crossing.) The others have a better upside due to shorter distances but are more prone to traffic due to the University.

Coming home, the differences were slight:

The fastest (Kirby) and second place (Windsor) averaged only 1 second apart at 21.3 minutes! The ranges were close, too. Kirby: 16-32 minutes and Windsor 18-30.
Curtis brought up the rear with 23.7 minutes. The range was 20-26 minutes.

The trips home are faster and have smaller ranges because I often drive home later in the day, after other cars have cleared out. In the morning, I’m driving at the same time as most other people. (I know that my NoVa friends are grimacing that I’m worried about commutes that average 25 minutes. All I can say to that is “come live in an amazing place like Urbana-Champaign!”)

Privacy, Tracking, and You

Let’s be clear: Google is not providing Location History as a benevolent free service. I'm assuming it’s a byproduct of the tracking they’re already doing for advertising. If you shut it off, my guess is that they don’t keep the history, but they do share your current location with advertisers. I’m going to give Google the benefit of the doubt, that they’re not secretly keeping your History if you opt-out. (Although they did remove “Don’t be Evil” from their Code of Conduct. https://gizmodo.com/google-removes-nearly-all-mentions-of-dont-be-evil-from-1826153393)

Try Not to be Evil

Here’s a question to consider: do you care that Google (and its advertisers) know which businesses you frequent? What if they’re using it to send you useful coupons for stores that you like? It’s not like any person is reviewing your history and deciding to send you coupons, your place visit automatically triggers a coupon offer because the advertiser set up a rule to do so.

It's a privacy dilemma that reminds me of the use of our phone call history by the National Security Agency (NSA) during the early 2000s, made famous by Edward Snowden. Leaving aside the more egregious steps by the NSA (day zero exploits, reading e-mail contents, hacking data centers, etc.), I’m thinking about the first revelation that the public learned about: the NSA searched call histories between phone numbers to analyze the web of phone contacts.

A lot of people (myself included) had a “Meh” reaction to this. My thinking was: I’m not a terrorist, I don’t have anything to hide, and this seems like an effective tool for investigations. Setting aside the groups that DO have something to hide in their phone records - a group that goes beyond terrorists - many people objected immediately to this as a violation of privacy rights. (I’m avoiding the discussion of whether this was constitutionally legal under the PATRIOT Act, as that’s an open question AND that portion of the Act has expired.)

Ultimately, I agree with both sides: yes, it’s a privacy violation, but I believe it’s an acceptable trade-off for the benefits to our country. (Keep in mind these facts about me: I grew up around Washington DC where mine and many other families worked directly or indirectly for the government. I work for a City government now. I’ve spent my entire career on government systems. I believe that having a government is better than not having one – and I’ve spent my career trying to make government function better.)

What’s different with Google Location History is that it’s not the government doing it; it’s a corporation – although one you choose to engage with. Unlike the NSA, who could at least fall back on the argument that they were using the data to secure the country, the use of this data is to better analyze customers for advertising.

I’m going to get to the point. I think that data collection (not just location data) should follow these guidelines, in which I’m invoking the Nudge theory to keep people secure who simply accept the default settings:

It’s turned off by default.
If you turn it on, you have two options: keep it personal (the default choice) or share it with “corporate partners”.
If you choose to share your information with “corporate partners” you should receive financial benefits. (This falls under the more general theory of monetizing our personal data, which seems like the best compromise that will let companies like Google and Facebook continue to practice their business models without being regulated out of existence.)

COVID-19, Contact Tracing Apps, and Privacy

During the current pandemic, governments are pushing apps that monitor where you were for contact tracing. There are different solutions (e.g. Bluetooth handshakes or keeping your phone's geolocations) but all of them track your history and share that information so that the app can notify you when you’ve been in proximity to someone who was sick, or can trigger notifications to other people if you get sick.

How do you feel about that? Is it an acceptable trade-off of your privacy to keep your health? More altruistically, how about to notify others if you become sick?

I’m torn on this one. If we were certain that the data were anonymized and never to be used for a different purpose, then I could support this for the public good. (But how can we ever be sure? Sorry, but I live in the real-life version of IT, not the theoretical one…)

Most of the scenarios I’ve heard make the app voluntary – but what if Google and Android (not to mention Huawei…) did this for everyone, automatically? Other countries are better able to force this on their citizens than we are in America. I can’t see the US government carrying through with this.

Ultimately, this is the problem – making contract tracing apps involuntary would be the only viable way to roll these out. (Again… I deal with the real-life aspects of IT.) If it’s voluntary then you’ll have low participation – some people will defy because they don’t want to give up their privacy and others will never install the app due to technical challenges or lack of motivation. Much like herd immunity, these apps simply aren’t effective if they only cover a minority of the populace.

My Methodology

In this section, I’ll make some observations about the data – and its issues – first, because like ANY dataset this is an example where over-confidence in clean data could lead to wrong conclusions. Then I will describe how I approached this, because you might be interested in doing some of this analysis yourself.

Problems with the Data

As with most real-life datasets, the Google Location History had some problems.

First, the downloaded data was simply missing for chunks of time. I don’t think that I was shutting the service on and off (it’s possible, but I can’t recall that), but my 2014 and 2015 Location History is piecemeal: the number of locations per month ranges from 1 to 115. Definitely something was wacky from July 2014 - Oct 2015!

Starting in 2016, it became more consistent – and that’s why my analysis above only used 2016-2018. In 2019 I shut it off, so I only have part of that year.

Second, as mentioned earlier, I created some data issues through my own use. By shutting off my phone at times (at night or while camping), I impacted the results by under-counting time in those locations. It’s not as critical in this analysis, but it’s a good example of how seemingly accurate data can be skewed unless you really understand what’s in it. (This is a good place to direct you to two of my other pieces on data analysis and the problems with real-life data: You Never Give Me Your Answers... and A Data River Runs Through It. And a wonderful place to highlight an amazing data story: how the world was saved from nuclear war by one brave man’s insight that the data seemed inaccurate: https://en.wikipedia.org/wiki/1983_Soviet_nuclear_false_alarm_incident)

Just a few examples that corrupted this information:

I moved during the time covered, so my “home” location changed.
Walking around at work caused me to appear in different locations, even though I was always “at work”.
If I was trying to find my favorite lunch place, the fact that I go to 3 different Subway locations would splinter their counting. (Not that Subway is my favorite, but I do go there a lot.)

Earlier, I described the solution to all of these problems: data aggregation. Think about this as adding a column to a spreadsheet of locations visited, and then filling out a higher-level code to describe locations. I would tag both of my home addresses as “home”, all of the work locations as “work”, all of the Subways, etc. This can be a slow and laborious process (it requires considering every location value), although there are ways to automate it with the useful VLOOKUP in Excel or - in a more elegant approach - with Artificial Intelligence.

Unpacking the Google Data

This section is about dealing with the downloaded Location History. The Google data comes as a series of JSON files, one per month.

Here is an excellent article about the structure of the Google Location files – this one from gifGuide2Code helped me immensely in getting started: https://gifguide2code.com/2018/02/12/json-how-to-read-your-google-location-data/

My reaction to looking at the JSON files was (in this order):

Someone must have written something already to do this analysis – right? To a certain extent they have (https://github.com/Scarygami/location-history-json-converter and https://gifguide2code.com/2019/03/24/python-what-the-heck-is-regex/) but these are more code libraries than a simple tool.
OK, this is a code library – so what a great excuse to learn Python! But after wasting a morning loading libraries and fumbling around, I bailed. Maybe someday I’ll learn Python properly, but not now.
There have got to be other tools to read JSON, right? Yes, and one of those tools is PowerBI, of which I have a rudimentary understanding. So, I used this as a chance to dig deep on PowerBI.

Ultimately, I used PowerBI to transform the data from packed JSON files into something useful – but I found PowerBI hard to use for the queries I wanted! At that point I fell back on Excel because I am comfortable with it and I was able get through the final analysis that was taking me so long in PowerBI. I’m going to blame myself here – I’m sure that PowerBI can do what I wanted but I gave up when it got hard, because I already knew how to do it in Excel. (When you only have a hammer…)

Cracking Open the JSON Files

The first problem with the JSON file is that it’s a series of nested objects. (See the gifGuide2Code link above for more on this.) So I used the PowerQueryEditor in PowerBI to transform the file – and it worked well for this once I got the hang of it.

The first thing I had to do with the file was create a series of Expand Value steps to open the nested objects. This was as easy as clicking on the header of any field that had the icon with two arrow pointing in different directions! Nice. Once you do this, it becomes a big table that you can deal with.

To show what the steps looked like, I placed the full text of my expansion here: https://blog.tectonicspeed.com/p/expanding-google-location-history-in.html

Development tip: I worked on one month’s data at first, then I brought in all the monthly JSON files later. In the text that’s shown in the link above, the first section was added later to read all of the files in a directory. Definitely… get one month’s working first!

Two Types of Data In One Place

The next challenge is that once you expand out the nested data, you will find that there are two types of records commingled.

ActivitySegment records show a series of snapshots as you move from place to place. Since Google isn’t sure how you’re moving (WALKING, CYCLING, IN_PASSENGER_VEHICLE were common for me), it will provide the same ActivitySegment data for different methods – and give you a probability rating for each.
PlaceVisit records capture where you were, and the start and end-times. These were easier to deal with than the Activity Segment records because there was only record per PlaceVisit.

At this point, I branched into two PowerBI transformations – one for each of the two types of records to prep the data for analysis. Each of them used a simple filter statement to get rid of the other kind of record.

Data practitioners know that data is messy, so some cleanup is inevitable to make for good analysis. This data was no different:

All times are in TimeStampMS format. (This format counts milliseconds since the beginning of Jan 1, 1970 – the “epoch date”. Somewhere there is someone who was born at exactly that time… and may not realize the importance of their birthday. Fun fact, dates before 1970 are negative!) I added a step to my PowerQueryEditor to convert those, using logic I found on the Web, although first I had to convert those Time fields to Numbers. The two steps together look like this for Location Start Time:

#"Changed Type1" = Table.TransformColumnTypes(#"Filtered Rows",{{"Value.placeVisit.duration.startTimestampMs", Int64.Type}}),
#"Added Custom" = Table.AddColumn(#"Changed Type2", "Value.placevisit.duration.starttime", each #datetime(1970,1,1,0,0,0) + #duration(0,0,0,[Value.placeVisit.duration.startTimestampMs]/1000)),

PlaceVisits had the start and end times, but I had to calculate duration. Easy enough, it’s (End Time) – (Start Time).
Latitude and Longitude values are presented without a decimal. I had to divide them by 10 million.
To avoid having duplicate sets of ActivitySegment records, I filtered it to only include ones where the probability was 90 or higher. This might leave some data out of the analysis, but for the purpose of analyzing my driving routes to & from work, that was fine. It was generally clear that I was driving – the probabilities were 98 or higher. For shorter trips it was less confident, so a different analysis would need a different probability filter.

Show Me the Way(point) to Go Home...

I did my analysis of PlaceVisits first. This one was easier, although here’s where my frustration with PowerBI manifested. I knew what I wanted to do (in Excel it would be a Pivot Table) but found working with the same idea in PowerBI to be much harder. After futzing around some, this was where I bailed and fell back to Excel. In the free version of PowerBI you can’t extract to an Excel format (really, Microsoft?) so I had to export a csv, then bring it into Excel. At that point, I pivot-tabled my way to exactly the analysis wanted, such as: comparing place visits by quantity, by the sum of duration, and visits to the same place over several years.

Working with my driving routes was a bigger challenge, and required a more interesting solution.

Once I used PowerBI to filter out only the high-probability records, I had a series of records for each trip. I then had to figure out which trips to include in my analysis.

I wound up vetting the list by a combination of Activity (high probably of “IN_PASSENGER_VEHICLE”), starting and ending longitude (I drive from the West side of my area to the East side), and starting time (before 8 AM to go to work, after 5 to come home). This was effective in giving me a list of records that I was pretty confident represented me driving to work or home. Then I split those up between trips going to work and trips coming home.

So, how do I know which route I was driving on any particular day?

Here’s where it gets tricky. The ActvitySegments are a series of snapshots called “waypoints”. All waypoints from the same trip have the same start and end values for latitude, longitude, start time, and end time. But you do not know the time that the waypoint was captured! I also couldn’t find much of a pattern in the number of waypoints. See the graph below, which shows the duration of all of my ActivitySegments (X-Axis) and the number of waypoints (Y-Axis). Notice that there’s almost no correlation between the length of the trip and the number of waypoints! FYI – I found online that the maximum number of waypoints is 10, but I didn’t have any trips with more than 7 in my filtered dataset.

So that left me with data that shows an irregular series of waypoints for each drive, as separate records. Like this (Lat/Longs are truncated for privacy…)

But an irregular series of records isn’t useful. So I flattened the records into a single record per trip. This took a little more Excel work that I care to admit, but afterwards the 7 records above turned into one record, like this (for space, I’ve removed Lat/Long sets 3-6):

But still, how do I know which trips follow each route? For this one, I turned to Excel 3D Maps. If you haven’t played with this feature yet, then just know that it’s an incredibly awesome tool for visualizing geographic data.

Glossing over several steps, I’ll just say that I ultimately wound up mapping several waypoint lat/long pairs into separate layers in an Excel 3D map. Ultimately, I only did the original location and waypoints 1, 2, and 3. If I used later waypoints (like 4, 5, 6, or 7) that would be more accurate, but I would be omitting a lot of trip records that didn’t include that many waypoints.

However, if I looked at only the first waypoint, it’s also misleading. The picture below is an Excel 3D Map of waypoint 1 values. The purple squares on the map is where waypoint 1 was geo-located, and the height of the block is the duration of the ENTIRE trip that day.

Looks good right? But it’s entirely missing the Kirby Road route! When I add in waypoint 2 (green) then suddenly that route is visible, too. (It’s the East-West route North of the other two that only shows green records in the picture below.)

I decided to bring in the third waypoint for verification – that’s the picture at the top of the blog, although as you can see it introduces some “noise” where I might have veered off to other locations on the way home. The third waypoints are blue in that picture.

Ultimately, the 3D maps helped me understand two things:

That I should use waypoint 2 for my analysis to “know” which route I was taking and still maximize the number of records in my analysis.
By visually inspecting the records, I learned the latitude of the three routes I was comparing. Happily for me, they run parallel to each other, so this was easy. If you live in an area that isn't a big grid, this will be harder.

At that point, I went back to the Excel data and filtered trips based on tight ranges of latitudes for waypoint 2 to figure out which trips followed each route. Once I had the data filtered, it was just one final step to get the average times and ranges. And that was it!

Hey, if you’ve read this far, then you have my thanks! This piece started as some musing last summer about my Location History… who knew that it would explode into this monstrosity of detail - or that (thanks to the pandemic) I’d have the time to do all of this!!?

Tectonic Speed of Government

Search This Blog