Thought Leadership

Aug 26, 2020

Training with Chaos Engineering


Listen to this podcast on Apple Podcasts, SoundCloud or wherever you find your favorite audio content.

About ten years ago, Netflix developed a type of simulation methodology called chaos engineering. Netflix introduced “deliberate faults to distributed software systems in production to test resilience in the face of turbulent or unexpected conditions,” as described by Tech HQ.

Chaos engineering is a valuable method for training security professionals as well as testing systems. Itzik Kotler, co-founder and CTO of SafeBreach, and Matthew Dobbs, Chief Integration Architect for the IBM Security Command Center, join the Security Intelligence Podcast for a discussion about building cyber resilience through “dynamic but controlled chaos.”

Listen Now: Training with Chaos Engineering

Preparing for the Expected and the Unexpected

Kotler makes the point that other professionals in fields with high stakes, such as pilots and doctors, undergo real-world training scenarios and simulations. The cost of a mistake can have dire consequences in security, too, so why not train with this approach?

“Instead of saying ‘if’ we will get hacked, we’ll change it to ‘when’ we get hacked,” Kotler says, “and then we use this mindset [to] understand it’s just a matter of time.”

Chaos engineering training confers the major benefit of teaching participants how to react and adapt to the evolving threat environment. The muscle memory that comes from simulation training helps people handle shock and make decisions in the event of an actual breach.

“It brings a way to learn to adapt to many different things that could happen — the unexpected, the expected, the combination of both,” says Dobbs of the value of chaos engineering for teams. It’s a “way to practice policies and procedures and try and find that kink in the armor that you might not have realized before something chaotic happens.”

Everybody Needs Practice

What makes for a “good” simulation? One that shows gaps in a plan or process, Dobbs says, and allows a team to improve their incident response.

And the team that can benefit from this kind of simulation training isn’t limited to the technical side of the house. Everybody from business leaders to HR to members of the board has a role to play in incident response and can benefit from having a little chaos thrown their way.

Tune in or see the episode transcript below for the full conversation, and learn more about upgrading simulation exercises for more dynamic threat environments.

Read more of the guests’ perspective on chaos engineering

Episode Transcript:

COBB: David, have you seen those insurance commercials with Dean Winters?

MOULTON: Yes, oh, yes. The chaos guy. I think my favorite is the one where he’s the baby in the back whipping everything he can, because it spoke to me.

COBB: Yes. I thought of that series of commercials when we were doing the interview on chaos engineering, and it’s the idea of dynamic but controlled chaos where it’s just, it’s confounding. Just to think like, oh, yes, you can be going down your normal process and then all of a sudden, ka‑blam! Here comes some mayhem and some chaos.

This is the Security Intelligence Podcast where we discuss cybersecurity industry analysis, tips and success stories. I’m Pam Cobb.

MOULTON: And I’m David Moulton.

COBB: I had the chance to speak with Itzik Kotler, the co‑founder and CTO of SafeBreach, and Matthew Dobbs, our Chief Integration Architect for the Cyber Range with IBM Security. And we really dived deep into chaos engineering and its applications in security training. Here’s our conversation.

COBB: I’m excited about the guests we have on the podcast today. So, Matt, would you take a minute and tell us a little bit about who you are and what you want to talk about today?

DOBBS: My name is Matthew Dobbs. I am the Chief Integration Architect for the IBM Cyber Ranges. And we just wanted to talk today about chaos engineering when it comes to security simulations and how we use that to help train security practitioners around the world.

COBB: And Itzik, what’s your role?

KOTLER: Hi. My name is Itzik Kotler and I’m the CTO and co‑founder of SafeBreach, and I’ve been dealing with offensive security for the past 15 years.

COBB: Great. So, could one of you please tell me and our audience what exactly is chaos engineering?

KOTLER: So, the idea of chaos engineering is essentially to help test resilience before an accident happens. We all understand that in the engineering realm mistakes and problems can occur at every different point in the project.

And so, when you’re looking at elements such as scale, you know, the flow of an application, the approach of introducing chaos — and again, in a controlled, orchestrated way — can help prepare the company, the product, the unit to handle those problems better and so when they happen in production the team will be better equipped to handle it either by process or by technology.

DOBBS: In the Cyber Range, we take that, you know, and include the human element and add chaos to the process from the human side in addition to just the…you know, the technical engineering, we like to throw the chaos at the participants as people as well.

COBB: So, where did chaos engineering come from?

DOBBS: So, the original idea that I was first exposed to came from Netflix as they were building out their cloud infrastructure to deliver content on a massive scale and they needed a way to test that all of their systems would stay resilient when, say, like a brand‑new movie or series came out and they knew there would be a massive spike in usage or expanding into different regions. And basically, a way to test what’s in production in unexpected ways.

COBB: So, it feels like anticipating unexpected things is the heart of cybersecurity and it feels like a very natural extension. Can you elaborate a little bit on that?

DOBBS: Yes. At any given moment there are, you know, threat actors from state governments, from organized crime units to people in the basement. It could be a few teenagers who manage to take down or get access to Twitter — one of the largest tech companies out there — and that was through teenagers.

You never know who’s going to come after you. You never know what the result is going to be or how it impacts your business or how your processes are going to be impacted, so it’s just a good way, you know, for security practitioners to get a look at how things could possibly look in a way that they never thought about how things could look.

COBB: So, I get that chaos is just going to happen naturally because, you know, cybersecurity. But how do we artificially create dynamic but controlled chaos for this kind of engineering environment?

KOTLER: So, I think that the secret here is to define the rules of engagement, because again, you want to have it controlled in a sense that a chaos could be, for instance, you know, delete the entire hard drive of the system. Maybe again, maybe that could be a particular side effect of a problem but not one that the team right now is prepared to handle.

So, it’s control in the sense that there is a rule of engagement of what the chaos can contain, but it’s dynamic in the nature that, again, the machine or the area which will suffer from it is unknown.

And that point, again, this is what helps prepare the team and train the team and the processes — and again, to some degree, the technology that wrap around it — in a sense that it’s been proactive, it’s been engaged with, it’s been tested and so it creates, again, the experience for when it will happen in production, again, things will be in place.

DOBBS: Right. And then and same for the people as well. And from the participants’ point of view, things could be very chaotic but that chaos was developed from a script that we use, or you know, a set of systems that break in a very specific ways that the cyber simulation participants weren’t expecting. So, to them, everything is chaotic, things change, they look very weird, but they’re designed in such a way to do that.

KOTLER: Yes, they don’t control the situation. The situation is being controlled, it’s been orchestrated. There is a logic in how things has been unraveled. But as they participate in the process, you are following it rather than controlling it. So, again, trying to anticipate the next steps, trying to understand how to regain control of the situation, this is the challenge.

COBB: So, what’s the value of this kind of training for cybersecurity teams?

DOBBS: It brings a way to learn to adapt to many different things that could happen — the unexpected, the expected, the combination of both — a way to look at things outside of the box of a day‑to‑day operation for both, you know, technical and non‑technical aspects, way to practice policies and procedures and try and find that kink in the armor that you might not have realized before something chaotic happens.

KOTLER: I think that naturally people behave when something as shocking as a breach happens, people and in that state of shock where they’re not being trained or prepared for it, the latency, as Matt mentioned, their go-to approaches, it will take them time to basically become efficient. But again, at that time, again, the adversary chaos is taking place.

And so, by eliminating the factor of time, by instead of saying “if” we will get hacked we’ll change it to “when” we get hacked and then we use this mindset to say if we understand it’s just a matter of time, then let’s try to create this controlled chaos experience and understand how we can a) adapt to it; and b) optimize the way that we’re running, again, people, processes, technology. It has a lot of benefit. It goes to damage control, it goes to expertise, it goes to the outcome, the impact.

This interesting idea that in the IT and security industry, people can go ahead and make decision, they can purchase solutions and they can configure these solutions but not necessarily be trained to do so not from the vendor perspective like understanding how to configure those solutions but understanding really the lay of the land.

And the idea is that, for instance, if I would like to be a pilot and I would like to fly an airplane, right, before someone would allow me to fly an airplane, whether it’s commercial airplane or a private airplane, I will have to go through intensive training and simulations, right? I would fly that airplane just in a simulation state of mind.

And the idea because it’s the cost of mistake is obviously very big, right? If I don’t know what I’m doing, I might get myself killed, I might get other people killed. And so, it doesn’t make sense that I would be able to fly an airplane or become a medical doctor and perform surgeries.

In many professions, this idea of not simulations…not necessarily just simulations but you know, the understanding of how you need to do the job and train against different troubleshooting or different edge cases is part of the practice before you get to do the job itself.

And I think that today when we’re looking at what’s on the line from companies to lose from a breach, I mean, anything from, you know, their business, their employees, the data of their customers that the only wrong thing that they have done is to, again, quote‑unquote, use that service, there’s definitely cause to think should we change this paradigm, should we incorporate simulations before making any changes in production before really making decisions that could have that impact.

And then this anecdote actually came so when I first…when me and my partner, we went to raise our funds for SafeBreach and I had to explain why would you like to have a breach and attack simulation software. I said, because again, today if you want to be a pilot of an airplane you need to go through this school, flight school and then you will have simulation.

And we believe that this change of paradigm — using simulation — will also change significantly the security industry and the companies. So, it’s kind of interesting where in other professions this was all these table stakes but not necessarily in our industry.

COBB: Who benefits from this kind of training? What kind of team do you need to have to do this?

DOBBS: So, obviously, everybody could benefit from it. But typically, what we concentrate on is either from a purely technical, where you have your analysts and your typical security practitioners. But we also try and include everybody from the business and the business process itself — technical leaders as well as business leaders, HR, C‑level, even you know, board members from time to time.

But those who would be in charge of leading a response for the entire thing. So, you would want, you know, legal representatives, you’d want heads of business, public relations, HR, even you know, call center managers and things like that. So, the people that would lead the charge in response to some sort of cyber event.

COBB: I’d love to talk a little bit more about the environment that we’re training for. So, how does the threat environment today compare to where it was, say, a decade ago?

KOTLER: So, I think as Matt mentioned, the adversary landscape in the recent years obviously have grown. We always understood to some degree that state, nation and governments have a stake in this game, but cyber criminals and different types of operations where it’s activist.

Today, there are more eyes that are looking, and either they’re doing it for a monetary perspective, you know, such as a ransomware attack or data leakage, whether they’re doing it from an ideology perspective because they don’t believe in the cause of the company, whether it’s an inside threat that perhaps the recent financial situation has turned them into doing those elements to sustain its own survival.

I think that, again, with the exploding of the technical environment, the IT, you know, the VPNs, the laptop, the mobile phone, the attack surface has grew substantially and also the number of potential threat actors.

COBB: So, what kind of TTPs are our teams training against when they’re using chaos engineering or doing a Cyber Range exercise?

DOBBS: Well, so actually, that changes a lot depending on, you know, what the modern threat landscape is, what has happened most recently and what the company is most worried about.

So, for example, if you are a bank, you might not necessarily be interested in operational technology scenarios; or, if you’re a manufacturer, you’re not really, you know, you don’t have any healthcare data. So, the TTPs will change based upon whom is going through the exercise.

But there’s a few things that are pretty common across it that we try and get across, things like spearfishing, insider threats, misconfigurations, you know, things that most corporations have to deal with that are common techniques.

And then you turn around and then you specialize those types of TTPs that you would see in the simulation based upon whom is in there…if whom is the correct word to say; I never know whom versus who. So, yes, so we’re constantly changing up what’s in there based upon the needs of the participants.

COBB: When you’re talking about how, you know, the situation changes, one of the biggest changes that we’ve seen recently has been this, you know, shift to working from home in light of the situation with COVID‑19. So, how has that really changed the attack surface for organizations?

KOTLER: I think that working from home is a big, the digital transformation for some verticals is actually something completely new. Some employees didn’t have a laptop to take home. They had a workstation, desktops and now they need to work from home. They’re either sharing a laptop with a family member. And so, again, they may be doing very well on their own but then that computer has been shared, somebody downloaded software that came into it that could be a way for the adversary to get in.

Not every company were designing their infrastructure, VPN, zero trust access to those resource. Trying to expose those back-end services so the company can keep functioning is challenging. And of course, misconfiguration at these very critical junctions can prove to be very costly because if it’s not only you accessing it but also everyone on the Internet, then the cost of mistake is very big.

And last but not least, I think that if you look traditionally on the perimeter security of companies and how they invest in different technologies, those obviously don’t necessarily coming into play from working from home where you have your own router, your own Internet provider; and so again, those are increasing your target…your attack surface.

You know, those IoT devices, those may be the ways that adversary will get into your home network; and then again, jump from your laptop into the company’s back end. So, now that everybody works remotely, it has a lot more incentive for adversaries to target the end users than just raw infrastructure.

DOBBS: And I think one of the other things that has happened because of recent events is the massive acceleration of the use of the cloud. When this first happened, you know, we read stories about how certain cloud providers massively got swamped by everybody jumping to this to get some sort of service started or running either stop gap or accelerated plans to move to the cloud because you know, everybody was working from home and they didn’t necessarily have the people needed in a data center to spin up new projects or approve those sorts of things.

And naturally as that cloud adoption accelerates, and it’s not just because of you know, the current crisis but just a natural progression of where technology is going today is you’re seeing the ability to spin up new applications and networks and infrastructure hundreds of times faster both technologically and through process that security teams can’t necessarily keep up with.

In the old days, the olden days, there was a lengthy process for a development team to get new servers and new network and new bandwidth allocations and things like that. And then now, it’s literally within seconds where a development team can spin up a new environment, a new network, a new database.

And if not done absolutely right, that database could accidentally be exposed to the Internet. That application back end could be exposed to the Internet or an insider that shouldn’t have it. So, it adds that complexity that the attack surface on the end user from working remote is added to the new infrastructure which is cloud being spun up at lightning speed.

KOTLER: I want to add to what Matt is saying. This is absolutely correct, and I think that even before the work from home phenomena, you know, has become as what it is today due to COVID, companies had struggled with properly configuring their cloud infrastructure.

All the elements that Matt point out were problems even before. But now when companies are rushing into it, a company is forcing to go to it from a scaling perspective, the chances of those misconfiguration, as Matt points out, or doing things not in the best practice are the chances are just increasing. So, definitely.

COBB: Can you talk a little about the difference between a traditional simulation exercise like we’ve done at the Cyber Range for IBM up in Boston area and chaos engineering? What’s the difference between those two things?

DOBBS: Well, I don’t necessarily know if they’re two different things as we try and practice some of those chaos engineering techniques within the Cyber Range.

So, in the purest form of chaos engineering is you have a system that you…that’s in production and you try and break it in unique and creative ways that appear to be chaotic but are within a very well-defined scope. Whether that scope is huge or not, you know, it still remains, the chaos remains within that scope because you don’t want to bring down an entire production system.

And we try and practice a lot of that engineering within the Cyber Range simulations itself. And we just expand it out, beyond engineering, technology we try to expand it out into the whole people process and the technology.

KOTLER: Yes, I would say that chaos engineering is a concept — again, traditionally brought up for testing IT resilience — and this concept is not confined to IT, it’s not confined even to computers. This concept can be applied to people, to technology, to procedures.

Again, by introducing unexpected situation, by getting people to behave to a certain event. And then again, testing it, orchestrating it, that really is the Cyber Range’s implementing that component for the security function.

COBB: So, what makes one of these experiences or simulations good? Like, what are the qualities of that?

DOBBS: Well, I think that the best results of a simulation is when the company gets an idea of where they might have some gaps, where they’ve thought that they’ve come in with a bulletproof plan or a plan that they thought was pretty solid.

And then something had happened, and they realize that, oh, wait, there’s a glaring hole right here that we never realized before. So, any time that we can help a customer better their process or identify any sort of gaps, anything that helps the customer we think is a successful event.

KOTLER: I agree. I think that that simulation eliminates that personal bias in two ways: one, it could be that the team would feel like this particular event is not worth practicing because in their mind they believe they can handle it.

But then when the simulation takes place, they become aware to how good they really can handle it and whether they really got a grip on the situation. So, one thing is the personal bias of, let’s not do that, let’s do something else.

And the second element is again, is the trying to do things not necessarily the way that they will believe. They may think that the breach will unravel in a certain path, but then in reality, this path is not guaranteed, there’s multiple paths and now there’s this unexpected kind of scenario that takes them into a new realm.

And to that end I will add also that the idea of doing it continuously, obviously, and practicing it and introducing it as a routine is also something that will help eventually building the resilience of the team in the organization.

DOBBS: Yes. And one of the fun things about running the Cyber Range is it is dynamic and if someone is doing very well, there’s always things that we can do to throw in something bad, something chaotic that whether it’s something that we’ve learned from a customer’s previous experience in a Cyber Range or that they had mentioned to us that is now part of their best practice.

You know, we learn almost as much from the customer during each range event as the customer does hopefully that, you know, we can take all of that and we can create as many different outcomes as humanly possible, probably infinite.

And as, you know, an event goes on and things are going really great for the customer, well, we can throw a monkey wrench; and if they deal with that one, we can throw another. If they deal with that… and we can just keep throwing them at them until they either run out of time for the day or we have broken something in their process, we have broken something in the bias that Itzik was talking about.

So, it’s fun because we can always change it in such ways that, you know, the outcome is typically the same — that the bad guys get your data — but different ways so they get there and then how you respond to it.

COBB: So, Matt and Itzik, it’s been great to talk to you both. Thank you so much for coming on the podcast and sharing your expertise and helping us all learn a little bit more about chaos engineering.

DOBBS: Thank you so much for having us, and it was my pleasure.

KOTLER: Likewise, thank you very much, Pam.

COBB: So, coming out of that conversation, one of the things that really reminded me was the idea of when you’re learning to fly a plane, getting a pilot’s license. And I married into a family of aviators who, multiple of them, my father‑in‑law, mother‑in‑law, my husband all have private pilot’s licenses.

And the idea of like training for that where you’ve got the muscle memory ‑‑ which we’ve even talked about on a previous podcast, we’ll have the link to that in the show notes — and the idea that like, you’re prepared in the event of like, oh, an engine went out, oh, my flaps aren’t responding, oh, crap, you know, what’s up with my rudder. And that you’ve trained for that, and that’s why pilots have to log so many hours before they get their license.

And in an interesting enough story, the idea of training for emergency response like really comes into play in everyday life, too. So, plane adjacent, my husband is an aviation inspector and mechanic at a county airport. And oh gosh, about a year a half ago, he had driven to work, parked outside his hangar. It’s the car that we got when I, gosh, it was 16 years ago, I was pregnant with our son when we got the car.

And he went out to get something out of the car in the middle of the day and just, you know, you’re used to airport sounds, you know, what an airport sounds like. There’s buzzing, there’s flying.

MOULTON: Sure.

COBB: Yes. But then you hear something kind of weird and you’re like, oh, well, this does not sound good. And so, my husband turns around and there is a plane flying directly at his car.

MOULTON: Oh, no.

COBB: Oh, no. Now, it’s a smaller airport, so it’s typically, you know, four and six seater Cessna’s, that kind of smaller model. And just what do you when a plane is flying at you David? What should you do?

MOULTON: Run away, duck.

COBB: Exactly, you run and then you yell to the other people that are in the hangar — I won’t get into the four letter words that I’m sure he said — get out the hangar, there’s a plane coming at us. And low and behold, the plane crashes into the car.

MOULTON: Oh, no.

COBB: Now, again, since we’ve had this car since I was pregnant with our first child, like 6,000 petrified French fries flew out of it and like a half a dozen Matchbox cars. Just psssh‑sh!

MOULTON: Was the mayhem guy flying the plane?

COBB: No, it was a student pilot with an instructor. I think the FAA investigation is closed. It was basically bad instruction on what to do. So, there was a wind gust, they were doing touch and go’s like trying to nail the landing. And a wind gust hit right when they should have pulled up, but instead of pulling up, they kind of throttled down and then the wind gust kind of carried them and it looped around. And anyway, my husband has a different car now.

MOULTON: I would think so. It’s not just going to buff out.

COBB: It did catch fire, but not in like a glamorous Hollywood sort of way, just a fuel fell out of the plane and then it caught on fire. Everyone is…

MOULTON: Oh, my God.

COBB: Okay. I mean, there is a concussion and some broken bones; ultimately, everyone was okay. My husband really didn’t sleep well that night, though. So, yes, I don’t know that you need to practice having a plane coming at you to know to run.

But just like a pro tip everyone, if you get nothing else from this podcast, if you see a plane coming at you, you should run. So, that was a little bit of weird news. Is there any good news for us today?

MOULTON: Well, it’s been a busy, busy time in cybersecurity. If you hadn’t noticed, the great Twitter hack that has occurred. They’ve actually made some arrests, so I thought that was amazing because normally you don’t find it to be in the same news cycle that you hear about a breach and then you also hear about some arrests. So, a couple of folks there in Florida, a couple I believe in…or one guy may be in the U.K. and…

COBB: Teenagers, wasn’t it?

MOULTON: Yes. It seems like maybe there’s some talent there at breaking into systems. It does strike me that, you know, even at Twitter’s level no one’s immune from having some problem in their layered defense and their ability to respond.

And you know, this one, if I understand what I’ve read so far and it’s still, in my opinion, early days for fully understanding what’s going on, this was social engineering, right. It was coming in on that weak link of human trust and they were able to really get the world’s attention.

But now they’ve got other things to do called court. And we’ll keep an eye on how that goes. Maybe this is a one off where it’s that quick, but you got to think that the different teams that worked together to pull this arrest off is a great model for, you know, other businesses and law enforcement to work together.

COBB: And I think the visibility on the attack method of like social engineering is serious stuff and it can cause a lot of damage to individuals, as well as businesses. And I think although it’s a horrible lesson to have learned for anyone, I think the visibility of something of this magnitude is helpful in cybersecurity practice to say, like this matters, this is yet another area that we have to train our people on and make sure that they understand and kind of defend against.

MOULTON: Yes, it’s not the technology, it’s the process and the people that end up needing to stay appraised of what’s going on in their world and how to maybe be a little skeptical.

But it’s difficult, you know, we’re human. This is our nature, is to reach out and connect with one another and so, you feel for the folks that were bamboozled. But at the same time, maybe the rest of us take a great lesson away from this and keep our guard up.

COBB: I’m going to bring it back around to January and my resolution. Don’t click stuff.

MOULTON: Don’t click stuff.

COBB: There you go. Well, that’s all we’ve got for this episode. Thanks again to Itzik and Matt for joining us for the show.

MOULTON: Subscribe wherever you get your podcast. We’re on Apple podcast, Google podcast, SoundCloud and Spotify. Thanks for listening.

Get the latest
research and news