As a Senior Site Reliability Engineer / DevOps Engineer you will have end-to-end accountability for the reliability of IT services within the internal application portfolio. A prerequisite to the role will be a â€œbuild-to-manageâ€, problem-solving and innovative mindset applied to the design, build, test, deploy, change and maintenance of services drawing from deep engineering expertise. Key measures of success will include service stability, effective delivery and environment instrumentation, deployment quality, technical debt reduction, asset resiliency, risk/security compliance, cost efficiency, as well as proactive and preventative maintenance mechanisms.
We know great Site Reliability Engineers and DevOps Engineers come from diverse backgrounds so no single individual may have all the desired skills on day one. But if you are the kind of software engineer who would have loved to engineer infrastructure solutions for Stripe or Twilio API’s, or the Slack or Zendesk app, or the Snowflake or MongoDB platform – we want to talk to you
- Participate in the overall design and implementation of secure, scalable, and fault-tolerant infrastructure
- Design and implement observability tools used to optimize systems for uptime, performance, and reliability, and provide visibility to internal teams
- Automate infrastructure provisioning, demand forecasting, and capacity planning
- Refine and expand incident response best practices, ensuring that engineers, including yourself, are able to respond efficiently when incidents occur
- Proposes initial technical implementation which supports architectural changes that solve scaling and performance problems.
- 3+ years in a Site Reliability Engineering or DevOps Engineering position at a web-scale company
- Experience creating and editing scripts with Python or Golang
- Hands-on experience with container technologies (Docker, ArgoCD, Helm, Borg, etc.) and microservice architectures
- Experience with monitoring and observability tools and applications, such as Splunk, DataDog, NewRelic, AppDynamics, ElasticSearch, etc.
- Experience implementing AWS/GCP/Azure services in a variety of distributed computing environments
- Proven ability to debug and troubleshoot performance issues across the stack.
- Experience working with development teams in a SCRUM.
- Fluent in spoken and written English communication.