Security is a Big Data Problem

By brad.schulteis -

Security is a Big Data Problem

Every morning, you probably follow the same routine: login to your workstation. Accept the corporate monitoring policy. Open your email, read a few, delete some spam. Load your web browser, open the intranet. Browse the internal news, check the latest headlines. Lock the screen and walk down the hall to grab a cup of coffee.

Each of these events is tagged and stored in audit logs. (Maybe not the coffee, but with the proliferation of IoT, you never know…) It’s only five minutes of your day, but already, you’ve created an audit trail of hundreds of events:

That’s a lot of floppies!

Each event is just 1KB of data, but at least 4KB on disk. So in those five minutes, you’ve already created enough log data to fill an early 90s 5.25” floppy disk. By end of the day, you’ve filled a 100MB Zip disk. By the end of the week, you’ve filled a 700MB CD-R. By the end of the year, you’ve filled an entire modern hard drive, an enormous amount of data.

And that’s just YOUR audit history. In an organization of tens of thousands, this becomes an objectively massive quantity of data. A big data problem for any organization, but for the government, potentially an even bigger problem when factoring in mandatory retention requirements.

This is just one source of data. You also badge into your garage, building, elevator and floor. Your mobile phone joins the office WiFi. Your PC renews its IP address. Each of these systems is storing thousands of these events per second at another 4KB per transaction. Of course, this is a bit of an oversimplification, as there are technologies like compression and de-duplication that dramatically reduce the footprint of this data and increase the density. Even still, this is a big data problem.

Your organization may also be paying for copies of others’ data as well, in the form of threat intelligence that contain what are known as Indications of Compromise, or IOCs. These audit events are meant to enrich and expose similar patterns in the raw audit trails employees are responsible for. Another big data problem.

Why do we do this?

Non-security professionals might be thinking, why do we need to do this?

Think back to your own routine. Ninety-nine percent of the time, you perform exactly that routine, between 08:53 and 09:17, like clockwork. Having immediate access to your historical record provides a very accurate means for comparison. It allows for near real-time decisions to be made as to whether a certain set of actions are you or a bad actor pretending to be you. The more data points we have access to, the more confident we can be in determining authorized and unauthorized actions, normal and anomalous behaviors.

Say your credentials are logged into the corporate system at 08:50 a.m., three minutes earlier than your previous earliest login. That’s not all that alarming, but the login was to your coworker’s PC. Your badge was not used in the garage this morning, but you take mass transit some of the time. So again, that’s not itself unusual, but this is getting more interesting as we get more data points. Building logs? You never badged into the front door. Elevator? Nothing. Your floor? Try as you may to follow corporate policy regarding tailgating, you still only manage to badge in about 80 percent of the time. (Whoops!)

Stitching it all together

When we stitch together these increasingly interesting, but still largely innocuous series of events, it gets more compelling. Cross referencing that data with the time-off reporting system, it’s readily apparent that this cannot be you. Someone else is likely using your credentials, as you’ve been on vacation half way around the world for the past week!

Within seconds of your credentials being used, your entire electronic life has been profiled. Thanks to gigabytes of your historical audit history analyzed in perpetuity, an algorithm can make an almost instantaneous decision with nearly complete confidence that any subsequent actions involving your credentials should be blocked. Vast amounts of data were processed in an incredibly short amount of time with highly sophisticated algorithms, massive distributed computation and volume after volume of block storage: a big data solution.

Science or art?

Putting this all together is often more art than science, but it does fall within the rapidly growing field of data science, which helps answer questions such as: which events are most useful? How do I normalize and correlate different types of events from different systems? How long should an event be indexed? When should an event be archived? When MUST an event be purged? How much threat intel should I subscribe to? How much information should I share? How much storage do I need? How much compute do I need? What are my legal, contractual and regulatory requirements?

Without your own data scientist on staff, going it alone is almost a non-starter.


If your organization is looking for help, Rackspace can assist. By turning to Rackspace, you get a team of unbiased experts across a range of leading cloud and infrastructure technologies — built on a compliance-ready framework and backed by ongoing managed operations, continuous monitoring, security services, living compliance documentation and audit assistance.

We are a web-scale managed service provider, delivering 24x7x365 hybrid-cloud management, operational support and security services as a packaged, on-demand, audited and pay-as-you-go service. You get the same commercial services that power the Fortune 100, in a compliance-ready state, with the additional security controls and governance necessary for your unique mission.