SRE Site Reliability Engineer | kloia

Description

Kloia is a recognized AWS Partner with a deep focus on Application Modernization and Digital Transition projects.

Our teams are growing rapidly and we’re hiring a Site Reliability Engineer mainly for our managed services that we provide to our customers, but also for our internal projects to build a scalable and reliable platform of common services.

What does SRE do?

In Kloia, SRE Team focuses on eliminating toils on production workloads. Our main aim is to achieve 24x7 SLA with a support system and team that ‘Follow-the-Sun’.

Key parts of this role are to take part in the design and development process and help to make the right trade-offs between performance, cost, security and reliability, as well as to be a reliable escalation point supporting the system in production.

As SRE you will:

- Eliminating toils by automation, re-architecting, and refactoring.
- Approach the incidents with an “Automate Everything” mindset
- Pair with software engineers to troubleshoot incidents.
- Drive complex infrastructure changes with a fantastic level of transparency and communication, with zero downtime.
- Design and implement self-healing, reliable and scalable infrastructure in a cloud-native environment.
- Guide and unblock developers across multiple teams and get the right stuff done to push their product forward.
- Define SLOs and error quotas for services destined to run in production.
- Support and be a critical part of our dev-ops culture, including participation in our follow-the-sun on-call rota.

Position: SRE (Site Reliability Engineer)

Location: Remote - LATAM / APAC

Level: Junior/Medior

What would an average day look like?

As part of the SRE, you will commit proactively to supporting production workloads, troubleshooting the issue to identify root causes. After the incident is fixed, you are supposed to write or review related Postmortem. You will be supposed to identify the weaknesses in infrastructure and observability.

In terms of technical challenges, here are a few challenges our team has solved. If you want to have an idea of what you would work on, give them a try:

What should be the optimum resource allocation values in Kubernetes so that the application performance is not affected
How can we include API Gateway monitoring in APM so that we have full observability
How can we decrease the number of Database query hits
How can we guide the development team to enable data layer caching

Although it varies customer to customer, the typical stack is entirely cloud-native, and it includes technologies such as AWS, Terraform, Docker/Kubernetes, Helm, ELK, Instana, OpsGenie, Node.js, Java, Typescript, Python

While we don’t expect anybody to know our exact stack inside out, and you’ll be given training and help during your onboarding to become fully proficient with it, we expect you to already have a deep understanding of how Linux based distributed systems work at scale, and have covered a similar role in the past.

Who should apply?

This role is ideal for somebody who wants to work with cutting-edge cloud infrastructure at scale and be part of a team always open to new ways of working. The ideal candidate will be passionate about automation and making infrastructure more effective, as well as have a natural flair for explaining complicated concepts in a simple and understandable way.

This all sounds great, what's it going to do for my career?

You will be exposed to new technologies in an environment that will allow you to use them at scale. All our products have a global reach, which means that everything we design has to take this into account. Our infrastructure is deployed in multiple AWS regions and it has to stay fast and reliable at all times.

We always try to solve problems at the right level of the stack, so you will have opportunities to develop both development and operations skills.

You will also be encouraged to invest in yourself and keep learning new things. For example, Friday afternoons can be used to work on different projects that are interesting to you. We also have hack days to disconnect from the day-to-day and explore new technologies and techniques.

Requirements

Fantastic communication skills
Deep familiarity with Linux based distributed systems at scale
Experience with AWS or another cloud provider
Experience with SQL and/or NoSQL databases at scale
Experience with services lifecycle, monitoring
Experience working as a software or platform engineer / SRE
Experience with DevOps practices and culture
A good understanding of Docker
An automation mindset

Nice to have

Experience with technologies in our stack is a strong plus, specifically:
A good understanding of Kubernetes
Experience with Terraform or other IaC tools