Editor’s Note: In 2021, organizations will continue to adapt to the “new normal.” Across recommendation engines and analytics for sales and marketing, health records management and vaccine research in healthcare, and Industry 4.0 and logistics in manufacturing, data is at the center of it all. Data practices and use cases are constantly evolving, prompting advancements across IT operations.
What will 2021 hold for the data landscape? We asked Juan Riojas, our Chief Data Officer, and Narenda Chennamsetty, Principal Architect, to weigh in with their top five data trends and predictions for the coming year.
Data becomes a primary enabler for customer experience
Data will become the conductor of customer experience. Organizations are already using AI, testing and optimization and real-time personalization to drive customer experience. To win at customer experience, you need to have information readily available. The business case might be personalization for better customer experience, but the actual enabler of that is data, machine learning and the data science that supports it. To deliver the hyper-personalized experiences that consumers want, organizations need to be right there with the right data points to drive the experience in the moment. And that can only be accomplished with the right technologies, injection frameworks and AI/machine learning capabilities.
Real-time data capture, analysis and response, also referred to as continuous intelligence, will become the norm. Netflix-type recommendation engines are going to become more mainstream. Especially after COVID-19, organizations better understand the benefits of being able to react quickly to changing user behaviors. As more users embrace online capabilities, the entire demographic has shifted from leisure (shopping, news, social media) to necessity (work, healthcare, schooling). That shift will result in more tools that go beyond data analytics to generate a better understanding of what the data means for the business, how to act on it and what insight it provides.
Advances in data engineering drive adoption
Skills and resource gaps are no longer a barrier to entry to advanced data insights. Cloud providers, like Amazon, have developed pre-built AI/machine learning solutions so that you don’t have to figure out how to put all of the puzzle pieces together yourself. Production automation tools are doing the same for the manual and engineering-dependent process of operationalizing data models. KubeFlow was designed to automate software delivery processes by orchestrating machine learning workflows on Kubernetes. Layered with AIOps tools, organizations can add operational intelligence around machine learning and data science pipelines to move from model to production faster.
The need for speed is also changing the traditional data warehouse from batch-oriented to streaming data. There are no more batch jobs — it’s all streaming architecture. Large-scale data processing centered around batch processing has evolved from nascent years of mapReduce algorithms, through Hadoop technologies and to the emergence of the widely adopted Apache Spark distributed computation framework. Not only did Spark achieve massive performance gains due to its in-memory pipeline computations, but it also came packaged with machine learning and streaming capabilities. With cloud providers augmenting Spark offerings with features like serverless execution and auto-scaling capabilities, it has become an indispensable tool in a data engineer’s toolchain.
However, there now seems to be yet another shift in the landscape — particularly on stream analytics. In contrast to Spark’s streaming approach, wherein streams (unbounded datasets) are processed as a series of micro-batches, the newer stream processing engines such as Cloud Dataflow, Beam and Flink use a true streaming approach where data is processed record-by-record along its journey in a series of operators in distributed data pipelines. Not only do these process engines carry lower operational overhead, they also unlock exciting capabilities.
Unlike in Spark applications where data is processed in the order it arrives, the new engines enable advanced and flexible windowing capabilities on the timestamp attributes on the record.
For example, in the context of analyzing clickstream data in real-time, a user can have short bursts of click-event activity over an indeterminate time period on a web page. Utilizing a session window, you can capture all of that activity in one big chunk, then perform accurate analytics instead of slicing the activity into arbitrarily sized fixed windows like Spark would do.
Recently, Google Cloud Platform(GCP) announced Cloud Dataflow and Apache Beam to provide a unified approach to do both batch and streaming. The combination provides stronger consistency and better semantics on how we process data. If, for example, you have log data from servers going into a Kafka cluster, and you only want to process any event that occurred on the application once, it's called “exactly once” semantics. In a batch-oriented environment, that would require ad-hoc coding and additional software. With Dataflow and Beam, that capability is integrated into the framework. The modern data warehouse, in general, is based on entirely different technologies than before. We see organizations moving away from MS SQL server toward Redshift and Snowflake to take advantage of columnar data structures.
We predict an uptick in the adoption of stream processing engines in 2021. For businesses to apply advancements, like NLP, graph or time-series analysis, the modern data warehouse will start leveraging AI and other advanced analytics technologies. We will no longer have one monolith where you can query and get your reports and analytics. The data warehouse will evolve into a set of tools with very different capabilities, like natural language processing, search, graph analytics, or even overlapping capabilities.
If you're a data scientist, you won't have to hurriedly learn a bunch of new technologies. Instead, just focus on what your model does and everything else is abstracted out for you. Streaming architecture will vastly reduce the latency (technical- and process-related) between data production and actionable data, enabling faster availability of information for decision making.
Data security keeps pace – in a new race
Due to the increase in malicious attacks related to the pandemic, organizations are tightening their security belts. With so many people working from home, security, access and privacy across more endpoints calls for a different approach. IT teams and data professionals are working together to manage the increased exposure and risk of the growing number of endpoints across BYOD and IoT devices.
Traditional siloed approaches, just limiting access to the data warehouse, are no longer sufficient. Data isn’t received in just one system anymore; it's distributed among multiple systems. One team wants to use it for operations, another team wants to use it for reporting, and another team wants to use it for data science. Organizations need a separate security action architecture that can work across all systems and centralize typical activities like authentication, key management and access management.
Unified auditing offers a service-oriented, centralized, converged system for data governance. Users can consume data from or to any type of system, no matter where the data is being processed, because the services are more abstract and not limited to a specific system. For example, a SQL server has its users, authentication and modes. But if it’s being consumed in 50 different places by data scientists and ops teams, it quickly becomes unwieldly.
Rather than having one security policy in place, the responsibility of security will shift left toward a shared responsibility model for everybody.
Trust will be the mantra for 2021
As data becomes more of a differentiator, organizations are starting to see data as a valuable enterprise asset. But that data is a valuable asset only if it’s clean and trusted. To establish trust, you need to establish compliance, data privacy policies and security protocols, then infuse it all with intelligence and automation throughout your environment.
To build innovative customer experiences, businesses first need to win consumer trust. Misuse or mishandling of consumer data breaks a trust that can be difficult, if not impossible, to win back. Tasteful, non-invasive personalization is meant to help you out, but if users lack transparency into the digital value chain, it’s not likely they’ll share the data that you need to build the right experiences for them. Organizations need more transparency in how they collect, use, store and dispose of data, in addition to clear ways for consumers to control their own data.
Internally, to establish organizational trust, data teams need to become a trusted enabler. This means partnering closely with business teams to better understand what they need, then using that feedback loop to produce fast, accurate insights that enable decision making, product innovation and market-share gains. As AI and machine learning gain traction, how we assist adoption and enable it for services is, ultimately, tied to data trust. Dirty data yields faulty results. Data leaders can help ensure data owners understand how to support clean data that the organization can rely on to make big decisions.
Consumers newly aware of privacy issues via documentaries like The Social Dilemma and The Great Hack are focused on privacy and data use. Governments are stepping in with more privacy regulations that will roll out over the next few years. Where public policy fails, consumers will expect business to take the lead by differentiating with privacy and data trust. Imagine a fair-trade-practices type of stamp for your data that establishes and certifies strict protocols for its handling throughout its life in your organization.