The infrastructure, skills and processes required to analyze data are very different from those needed to simply collect and organize it. Here’s what enterprise IT leaders need to know about building the foundations of data-driven decision making.
Enterprises have an abundance of data. Most of it is what’s known as transactional data, hoovered up by applications such as web apps, ERPs or CRM systems. Its collection helps to automate daily processes and make them more trackable or auditable. It helps companies to capture incremental sales or make incremental improvements to their systems. And it helps preserve a historical record of actions and services delivered.
But as the volume of this transactional data has grown to mountainous proportions, many companies have realized that this mountain also contains insights into trends and patterns that can guide decision making and innovation. If, that is, they can do two things: Ensure it’s of sufficient quality and get it to the right place for further analysis.
Doing this requires us to move from a transactional data collection and organization mindset – an area where most companies I encounter are already highly savvy – to an analytical mindset. Along the way, we must ensure that those using the data feel confident in it and are comfortable basing decisions on it.
This journey can be difficult. The skills, processes and infrastructures related to data analytics are very different from those related to data collection and organization. And the architecture typically needs to be purpose-built.
So if you’re one of those companies that knows they could be doing more with their data, before you jump in, let’s take a look at some of the decisions you’ll need to make on your journey to becoming data-driven.
Decision 1: Identifying your data
The first place to start is identifying the data you currently have.
This is not as obvious as it sounds. It involves understanding not only what this data is but also where it is and how you can best get to it. You also need to understand its provenance: How did it come to be there, and what decisions and processes along the way might have impacted its quality?
Quality can be a particular challenge with transactional systems because human error, shortcuts and omissions at the point of entry can accumulate over the years to have significant impacts. And as data lakes built by different teams get added to these transactional flows, their best guesses at the intentions of the original builders may also introduce quality issues.
You also need to understand whether or not you can run your analysis on the system in which it currently resides. Typically you can’t, or shouldn’t, because of the risks posed to day-to-day operations of running these computations on database systems serving front-end applications.
So, some of the earliest decisions you make will be based on the answers to these two questions: Where do you need or want that data to live? And what are the operational factors and regulatory conditions that might influence this?
Decision 2: Identifying the opportunity
Enterprises likely have at least a dozen use cases where they suspect their data could be better leveraged. But it’s cost prohibitive and organizationally very difficult to take on too many projects at once.
Identifying the most achievable use case with the biggest impact is an important early decision. Key questions to ask include: What does your desired end result look like? Is it better dashboards and visualizations, automating report generation for month-end financials, or leveraging predictive analytics to support management and executive decision making?
Whatever your specific use case, it likely falls into one of three broad categories with an associated user profile, which will also influence some of the decisions you make later about infrastructure. These categories are:
- BI/visualization: This use case is centered on enabling better reporting and decision making, and users will tend be non-technical. They won’t be building features on the data lake or adding to your IT infrastructure.
- Automation and machine learning: In this use case, you might have your operational and reporting data someplace else already, but you want to make it available to machine learning processes to drive prescriptive and predictive insights. That requires making large, historical and often very specific data sets available to data scientists quickly.
- Feeding other transactional systems: This final use case is centered on making data from system A available to system B, to drive additional business processes and outcomes. A system like this will be packaging and preparing small pieces of data from the data lake, and sharing them with the destination system.
Decision 3: Identifying current and future infrastructure needs
The use case you’re looking to solve will influence your immediate technology decisions around accessing the data from your data lake, and building the pipelines for delivering that data to the relevant systems and users.
But when making those infrastructure decisions, it’s important to bear in mind the need for adaptability. It’s highly likely that in the future you’ll want to serve one of the other use cases. Organizations that start to see wins with data analytics tend to quickly develop a large appetite for more and wider applications. For example, we helped a customer in the oil and gas industry take data from an existing financial forecasting system and make it available to a wider audience via its data lake. The success of that use case soon led it to want to take that same data and put it into a specific financial modeling system for executive planning.
So your next key decision is whether to build an infrastructure around your data lake that’s used only for your specific system or use case (which is unlikely), or to build a foundation that accommodates future use cases, too. Key early questions to ask that help establish this adaptability mindset include: Can this same data also help build predictive models or drive automation elsewhere in the business? If so, what are the additional systems to consider?
What you ultimately decide to build will be based on the subtleties of what you want to accomplish now – and your best attempts at anticipating future uses of that data.
Common missteps to avoid
Among the most common missteps to avoid is ensuring that your data is good quality and that it has a clear use case – before you start building your data pipelines. Also, be sure that when your pipelines are built, they’re built in accordance with software engineering best practices .
Assessment of data quality is tricky because it can be a subjective measure, but the minimum benchmark is that users can trust it enough to confidently make decisions based on the insights it generates. Meanwhile, carrying over best practices from software engineering is becoming more urgent as the disciplines of data analyst and developer begin to converge. For example, too often analysts-turned-developers overlook basic CI/CD processes, execute code in production systems by hand, or architect their projects in ways that make them difficult to maintain and evolve.
But the golden rule, imparted to me by one of our Strategic Account Principals, is this: The number-one success factor for any data project is sustainable political will.
The number-one success factor for any data project is sustainable political will.
These are long-horizon projects. To generate and sustain the required political will for the long haul, you need to quickly show value to your users and maintain the engagement of someone at the leadership level who is willing to make the investment work.
A Field Guide to Data Bodies of Water