Site Reliability Engineer, Toronto
As a Senior Site Reliability Engineer (SRE), you will be playing a critical role in providing technical expertise and leadership to our engineers to ensure reliability, scalability and maximum uptime of OANDA’s systems. You will be responsible for building, integrating and managing tools to continue improving the performance of Company’s products in on-premise and cloud environments. The perfect candidate for this role has a strong systematic data-driven approach and a strong passion for reliability best-practices including observability, automation, high-availability, fault tolerance, and full-lifecycle ownership.
- Champion a culture of shared service ownership within your development team.
- Tap into your passion for eliminating repetitive manual processes (toil) using automation and apply this through infrastructure-as-code and configuration management tools (Ansible, Terraform, Helm).
- Enable your team to make data-driven decisions by pushing monitoring, instrumentation, and observability as core tenets of our development practice.
- Demonstrate best-in-class deployment and delivery methodologies, leveraging Kubernetes, Anthos, Cloudflare, and CNCF tools to drive cloud adoption and standardization across our on-premise and cloud (AWS, GCP) environments.
- Collaborate with product managers and business stakeholders to set and maintain Service level Objectives (SLOs) and metrics that are representative of our customer experience. Tune our approach to alerting to manage alert fatigue.
- Help scale our security function by advocating for security best-practices within your team, and working with OANDA’s security team to apply DevSecOps practices within your workflows.
- Experiment with (and lead the implementation of) new technologies and methodologies gleaned from your involvement in the global SRE, DevOps, and Cloud communities. Attend and contribute to continuing education, conferences, and seminars to stay current with industry and community trends.
- Articulate the SRE ethos to your peers and stakeholders and help educate your colleagues the application of SRE principles to achieve a healthy balance of new feature development and reliability initiatives.
- Participate in a cross-functional on-call rotation to support your code into production. Ensure the health of the on-call rotation to avoid operational overload, lead the blameless post-mortem process, and feed remediation tasks back into our development pipeline.
- Draft playbooks, and conduct tabletop and chaos engineering exercises to avoid operational underload and identify opportunities for improvement.
- Ensure that our development teams and applications adhere to OANDA’s engineering standards. This includes supporting and demonstrating compliance with OANDA’s security and privacy standards..
- Set a great example and encourage others to espouse the culture and values of the company to other internal teams and the general public.
Experience & Skills:
- Experience as a software developer, or in an SRE-related field; a solid development background and understanding of software development practices is necessary to be successful in this role.
- The best candidates will have experience working in cloud-native and on-premise environments, in bare metal, virtualized (VM), and containerized / orchestrated deployments.
OANDA Global Corporation is a diverse and global team with offices around the world. We value the unique skills and experiences each individual brings to OANDA. We are committed to creating and sustaining a collegial work environment in which all individuals are treated with dignity and respect and one which reflects the diverse of the community in which we operate. We provide an inclusive and accessible environment to everyone. Candidates selected for an interview will be contacted directly. If you require accommodation during the recruitment and selection process, please let us know. We will work with you to provide as seamless a recruitment experience as possible.