A Data River Runs Through It

Photo borrowed from this site: http://chucksadventures.blogspot.com/2015/10/
How do you connect data from different systems? (Hint: not easily...)   This article originally appeared in Government Finance Review magazine as "Using Data Analysis for Better Decision Making" (June 2019). I changed the name because their title was assigned, and it seems appropriate that "data rivers" feed "data lakes" 

**********************************************************************
Disclaimer: All of the ideas expressed in this article are my personal statements and opinions, and do not reflect the opinions/statements of the City of Urbana.
**********************************************************************

Let’s start with the bad news: there are no magic solutions here. Data Connectivity is an aspirational goal, like Operational Efficiency.  As defined in this article, “Data Connectivity” is the act of taking information from different computer systems and combining it to gain better insights. As you have probably witnessed already, it’s harder than it sounds.

Why is Data Connectivity hard? It makes sense that computer systems should be able to talk to each other. And in fact, they can. There are endless ways for computer systems to share information, starting with the prosaic csv file (csv = comma-separated values) and going all the way through real-time direct access. The problem is the data. 

You are probably familiar with one way that systems talk: when one system imports or exports a file of information from/to a second system – an “Interface” as we IT people like to call them. Financial systems have lots of them; they import files for posting to the General Ledger and export formats like bank files or IRS layouts.   (Since this article appears in Government Finance magazine, the examples will all be for Financial Systems, but the same ideas would apply no matter what data is involved.)

Data Connectivity is slightly different from an Interface - it’s the idea of pooling data from two or more systems so that you can ask questions about the combined information.  Think about it as two different systems both exporting information to a third location, which is a reporting database, where people can run queries against it.

The challenge of Data Connectivity is finding the commonality between systems. How do you make a Financial System “speak” to a Property Database?   Or Police Arrests? Usually they have different transaction formats, reference codes, and even inconsistent code values to represent the same thing. (There’s an example later with organization codes.) The rest of this article explains how to approach this problem, although of course actual results will vary because every situation is different.

Before we begin, some terminology… Computer systems generally have two types of data: transactional data and reference codes. Transactional data is the detailed history of events, which are classified using the reference codes. Financial systems are full of transactional data: general ledger postings, invoices (paid or billed), budget requests, etc. Reference codes are the pre-defined values that group transactional data such as object codes, organization codes, vendor codes, funds, grants, etc. 

Step One – Know Your Data 

Connecting data starts with an understanding of what’s there and how it’s stored.  Actually, let’s back up – it should start with having a business question that’s worth the effort. “Let’s just mix all the data together and see what we find” is not a good approach.  Before anyone expends significant effort connecting data, make sure that there’s a clear statement of purpose and that there is a consensus from the people who understand the data that the information is available. 

Notice the key phrase “people who understand the data” – that’s the necessary ingredient to create any meaningful data connectivity. That’s especially true of Financial Systems, which have lots of transactions and have data structures that are more complex than most.  (Extracting the entire General Ledger is useless if every query against it adds up to zero!)  I’m not trying to score points with finance folks here – it’s a fact that understanding how to exclude accrual postings from a revenue query is harder than querying something concrete like the number of multi-family Building Permits.  

For example, pretend someone wanted to combine business tax collections by address with other data about the location such as crime incidents, 311 calls, etc. You need to know how to find revenue transactions in the General Ledger or a subsidiary file, depending on your system.  To get the address you might need to combine the business address from a Customer file (“reference data”) with the tax collection amounts (“transaction data”). Finally, it’s important to know if you expect the dollar amounts to be negative because they came from revenue postings.

It’s not easy, which is why the first step of connecting the data is to know what you’re dealing with!

Step Two – Find the Commonalities

OK, let’s assume that there’s a worthwhile business question and the appropriate data has been extracted from each system. Now, you must find the common reference points that let you tie the sets of data together.  

Let’s go back to the example of business tax collections – we want to link collections to other geographic data about the business location. Street Address is a widely used data point, but also demonstrates a common problem: the same information can be stored in many ways!  One system might have street numbers in a separate field, while another system might mix them in one field with the street name. Worse are street directions… even if they show up, they may be inconsistent: “W Main” vs. “W. Main” vs. “West Main”. (If addresses are a total mess, then the best approach is to pass them all through a geo-coding process, and let it try to match them all to geographic coordinates.)

Addresses are relatively easy, however, because at least there is a standard set of values that most people agree to. (Another example: purchasing commodity codes, although there are multiple standard sets.) That’s not true for most data – so matching up values is more work.

Finance data experts should expect lots of these requests. Finance Systems are chock-full of data, especially the money data that’s part of many business questions and Finance’s Chart of Accounts are often wonderfully precise and can slice data in many directions: objects, organizational structure, budget lines, and other elements. The problem is that other systems might not be structured the same way.  

Here’s a real-life example: Urbana’s Public Works Department delivers services across many programs, such as: Traffic Signals, Facilities, Forestry, etc.  Those appear as 16 organizational codes in the Financial data. However, Public Works’ Work Order system was set up with a different organizational coding structure – with 12 divisions. While some of the 12 match the 16 Finance codes, not all of those are one-to-one so matching work orders to the program budgets requires a crosswalk that includes more detail then the Division codes, such as the type of work performed. It’s possible, but it ain’t easy. The good news is that the Operations Manager came up with the best answer: change the Work Order system to use the exact same organization codes as Finance! (When two systems clash, the bigger one that’s harder to change often wins… good news for Finance Systems.)

You might find that there are no common reference points at all. If that’s the case, then that data’s probably not going to connect - not everything does! This is why you have the conversations before you spend the effort building anything.

Sometimes the result of this effort is to realize that you need to collect additional data fields in one or both systems. That’s an entirely acceptable result – as long as you start the effort now to capture it. (Because it will take time to get the data field added, let alone accumulate enough data to be meaningful.)  

Here’s another real-life example from our Pubic Works Department. Urbana has a Landscape Recycling Center, which recycles yard waste into mulch and other products it sells. In the Financial System, it is an Enterprise Fund and uses the Citywide object codes to track expenditures. Sales are entered in a cash register, which produces a nightly extract file that is transformed and loaded into the Finance System’s billing module to post revenue, with different billing codes assigned for every individual product sold.  This gives Finance a clear picture of overall costs and detailed revenue by item. However, Public Works wants to know the profitability of each item, to determine if any should be discontinued (or repriced).  The problem is that the Finance data doesn’t have the costs per item – only the costs in the aggregate. A great question was asked: Can the Work Order system data be combined with the Finance Revenue to calculate the cost of product?   

It turned out that the answer was “No, we’re not tracking work at that level of detail.” That led to discussions about how that work could be tracked in the Work Order system, although there is no silver bullet solution here because the workers do many tasks all day and it’s not a great environment to carry around tablets for remote entry. (A soapbox comment: there’s a common dilemma demonstrated here between capturing better data and letting people get their jobs done.  When this happens, it’s important to find the right balance – and that requires conversations with everyone involved. It’s a sadly ironic situation if the goal of gathering data to analyze efficiency creates more inefficiency in daily work!)

Step three – Verify the Results

If you’ve been through IT projects, you know that testing is an important part of any computer system – and data analysis is no different. It’s especially important to verify the data if you expect to share it publicly, because once data is out there it’s very difficult to retract. 

There are lots of ways for something to go wrong when creating extracts and transforming the data. “Transform” is the catch-all term we IT folks like to use for making changes to the data after it’s extracted. This could range from reformatting a date to adding new calculated fields, but it’s all still programming by humans – so mistakes are inevitable. 

The best way to test extracted data is to run queries on the data in its new home and run the same query on the original data. You should get the same results. In addition to summary totals, you should also spot check a few records field-by-field to verify the data mappings. So, in the Business Tax revenue example, you would run monthly collection totals to verify the amounts extracted and choose a few records to verify that the address and reference codes are correct compared to the originals. 

Another time to verify the data is if someone finds surprising results. One of the reasons for data collection is to gain new insights, but if you find something suspicious… then don’t forget to be suspicious of the data! Here’s a true story: Urbana’s Open Data included police records pulled down from the source system as Excel files, then transformed and loaded to the public portal. A data expert in Police noticed a pattern of periodic drop-offs in incidents that no one in IT had spotted. The culprit? The Excel files were an older format that limited sheets to 65,536 rows – so exported records after those were being lost! (I love Excel, but it has some nasty habits that make it bad for data transformations. Leading zeroes can be lost – a problem when you have reference codes like “0102” - and Excel likes to change Parcel Numbers to scientific notation. And don’t get me started on Excel Dates... really, Excel, you need your own date format? Really?)

Final Thoughts

This article started with the bad news that there is no magical fix to connect data, but let’s end on a more positive note about the possibilities that lie ahead. Even 10 years ago the idea of mashing up data from different systems was too expensive and complex for most government organizations, but now anyone can find open-source (i.e. free) tools to extract, merge, and host datasets.  We need these tools because demand for connected data is growing - from internal managers, elected officials, and our citizens.  As people gain experience with data analysis it is only natural that they will start asking broader data questions

You might be one of those people receiving those questions. (If so, congratulations – you’re a data scientist!) Even if you’re not getting questions yourself, you might know a lot about the Financial System, the Chart of Accounts, or other software packages run in your organization. This puts you in a position to help those requesters when the questions come. 

Your goal is to help the requesters clarify their business problems based the available data and determine what data should be used and how to interpret it. The steps above can give a structure to the approach, but every situation is different – and the best place to start is always getting the experts and requesters in the same room to have a discussion. (Whiteboards are optional, but useful.)
 
If you want to learn more about your data, a great approach is learning the reporting tools. Try to create a query on your own and gain some experience about how the data is stored for reporting. (The good news: most commercial software does a great job making it accessible.) Most data extracts can be addressed through basic queries and an export to a commonly used data format.

Let’s hope that the trend towards better use of data for decision making will continue. Data analysis can be a wonderful tool to inform and guide problem solving and can provide unexpected insights. Data analysis can also be abused, manipulated, or incorrect. Our goal as data experts is to encourage the informed use of accurate data – an aspirational goal, indeed.