8 March, 2021

Molly Struve is Lead Site Reliability Engineer at Kenna Security. Currently, her work revolves around Elasticsearch and MySQL databases, with Ruby and Ansible coming to the rescue. Learn about her experienceg and the difficulties of choosing a technology that truly works wonders!

What does one Lead Site Reliability Engineer do?

Before I talk about what I do as a lead I first what to define what a Site Reliability Engineer is. Site Reliability Engineering(SRE) can mean a lot of different things depending on the company. The SRE team at Kenna is a group of developers that are focused on using software to optimize performance and ensure stability and reliability across all of our systems. When talking with our lead operations engineer, we decided that an SRE is a developer+. The plus stands for some bit of extra knowledge beyond that of just writing code. For me, the plus is my comprehensive understanding of how Elasticsearch works. For others, their plus might be the ability to work seamlessly with a framework like Ansible, or maybe they have a deep understanding of containers. The plus can be almost anything tech related that would help an SRE with their job.

Another trait that I feel characterizes a good SRE, is the ability to look at and understand how an entire system works. It is easy to understand small pieces of a system, but the ability to step back and conceptually understand how all the pieces fit together is key to being an SRE. Having a high level understanding allows us to figure out a system’s weakest points and improve on them to ensure reliability across the entire system.

As the lead of the team I am responsible for not only acting as an SRE myself by writing code but I also get to determine what projects the team will work on based on what would benefit the company and our platform the most.

What technologies/languages do you use & prefer?

My primary and preferred language is Ruby and that is what I have been using for all of my professional career. A lot of people give Ruby a bad rap for being slow, but it has gotten significantly faster recently. Also, how you use it can greatly affect how fast or slow it performs. It is an easy language to learn but like all other coding languages, takes a lifetime to master and really do well.

What project are you currently working on and how many people are in your team?

There are currently 2 other people on my team and we are looking for a 4th. The team itself has a few projects going on.

      • Upgrading Elasticsearch to 6.x. Our last upgrade was rough, so we have some extensive testing plans associated with this upgrade.
      • Defining service level objectives. Our customers are happy now, but what does that mean in terms of metrics? How fast do we need to load searches to keep customers happy? How fast does data processing need to happen? Our goal is to answer questions like these.
      • Wrangling all our new virtual private cloud(VPC) environments. A lot of our large clients want their own virtual private cloud for running Kenna. This means we have a lot of different environments. As you can imagine, working with all of them and keeping them in sync is a challenge. As our VPC numbers increase this year, my team is hoping to make working across all VPCs as seamless as possible.

 

 

Is there a platform, tool, framework etc., in which you see a problem, but keep on using?

That is a GREAT question! I have so many examples of tools we used for the longest time at Kenna and it wasn’t until we got a full time SRE team that we finally replaced them. Each replacement has paid off tremendously! For example, a year ago we were having issues with NewRelic because it was not retaining data long enough to be useful for us. Once the SRE team was formed we took on the task of switching us to Datadog and it was the best decision we ever made. Another example of a tool that was used at Kenna for years and we finally are making the push to get rid of is Resque. Resque is a background processing framework that uses Redis to process and track background jobs. Resque has become very dated and is currently not very performant for our use case so we recently have made a push to move all our background jobs to a new framework called Sidekiq. Sidekiq is very well maintained, constantly coming out with new features, and is much more performant than Resque. Looking ahead, the next system that will likely get replaced is our CI solution CircleCI. As we scale it has gotten tremendously more expensive and we are on the hunt for a more practical solution.

You’re currently trying Ansible. Why did you choose to start this journey & what are the benefits of this tool?  

3 years ago our Operations team made the switch from Chef to Ansible for managing all of our infrastructure. Now that I am an SRE I work a lot more with the Operations team so I have made it a point to learn more about Ansible so I can better understand what they do. It is also great to have another tool in my toolbox that I can use when it comes to building SRE features. Even though I am new to Ansible, I have found that understanding it at a high level and what it is doing is very easy. It gives you the ability to run commands on multiple servers which is incredibly convenient and powerful when you are working and managing a lot of infrastructure.

What is the use of Elasticsearch?

Elasticsearch is at the cornerstone of Kenna’s platform and is used extensively for comprehensive searching of a company’s assets and vulnerabilities. Elasticsearch is actually the reason I ended up becoming an SRE. When I joined Kenna no one had taken ownership over it so I decided to step up and learn everything I could about it. Becoming proficient in Elasticsearch and working a lot with it to improve performance and stability made the transition from Software Engineer to Site Reliability Engineer very natural. Elasticsearch is a great tool for when you have a lot of data that needs to be searched in complex ways very quickly. It is what allows Kenna to stand apart from its competitors because no one else offers the kind of search speed that we do.

What’s the hardest tech task that you’ve encountered?

Oh man, it is so hard to pick just one. Usually what is hard at the time, I look back a few months later and it seems so easy. I think the hardest part of my job in general is debugging a performance issue. For example, if a server crashes that is running our application it is usually my job to figure out why. Figuring out the why involves combing through lots of logs and data and trying to piece together what exactly the server was doing at the time it crashed. Once I have figured out what it was doing, then I have to take my best guess at which task caused it to crash. Putting all the pieces together involves a lot of problem solving and having the ability to step back and really look at the big picture of how everything is running together. It also involves a bit of trial and error. Sometimes the cause is not easy to deduce and I have to take my best guess at fixing it. Sometimes it takes a couple of fixes before I finally solve the root cause of the problem.

Is there any question that every Lead Site Reliability Engineer should know the answer to?

One strategy that I read about in Google’s SRE book that I think is paramount to being a good SRE is when a system breaks the first thing you should always do is work on getting the system back online. Sometimes as a SRE we immediately want to know WHY it happened. We need to fight the urge to figure out why until after we have the system back online.

What’s your motivational phrase that keeps your code running?  😉

Fail forward! This is a saying that our VP of Engineering has instilled in our culture at Kenna and has really hit home with me. Anytime we find ourselves with broken code or an upgrade gone bad we always try to push forward. The mantra also reminds me that it’s ok to fail sometimes, but when you do, keep moving forward. Don’t ever let failure send you retreating backwards.

Tags: , , , , , , , , , , , , , , ,