We’re looking for a Senior Site Reliability Engineer who wants to be at the technical core of an organization that’s completely reshaping how distributed applications on blockchains can reach massive audiences.
You will join a Site Reliability Engineering team that has the ability to architect, build, and iterate on resilient, scalable systems. SRE also guides the organization in areas of Observability, Reliability, and Incident Response. The support we provide the other engineering teams enable them to deliver features that wow and delight our customers at a fast pace.
In the role, you can expect to help us launch reliable products and services with your experience and skills. You’ll join an established team with a focus on providing highly technical support to the rest of the Engineering organization. You will be leveraging infrastructure-as-code, submitting code changes via Pull Requests, and finding creative solutions for the unique and varying needs of each Engineering team. You’ll contribute to the improvement of our in-house systems by researching and applying the latest and greatest technology to our stack. You’ll become empowered to fully apply your experience, lessons learned, and technical abilities in an environment with little tech debt, no on-prem servers, and a strong foundation based on cloud-native technologies such as Kubernetes and industry leading cloud platforms. Every day, you’ll collaborate with a world-class team both in our Vancouver office and distributed worldwide.
What we'll accomplish together
- Develop effective infrastructure (cloud platform services, networking, kubernetes, etc.) for our projects to deploy onto, ensuring projects are scalable, resilient, and reliable in support of growing products.
- Build shared observability services including metrics, logs, tracing, and dashboarding as well as embody a center of excellence partnering with other teams to define SLOs and actionable error budgets for everyone’s services.
- Respond to infrastructure incidents and support the larger Engineering team with their product incident response strategy.
- Perform post-mortems and in-depth root cause analysis to ensure we are always improving.
- Enhance tools and automation to fill the gaps in our current systems as well as build entirely new ones as we face bigger and more complex challenges.
- On-call rotation: 1 week every 5 weeks.
A little about you:
- You execute on defined projects to achieve team-level goals and independently define the right solutions or use existing approaches to solve defined problems.
- You understand OS, networking, kubernetes and other cloud native services and can debug system issues and identify system bottlenecks.
- You have experience working with Infrastructure as Code systems like Terraform, pulumi, or CloudFormation.
- You have experience collecting and processing metrics from tools such as Prometheus/Datadog/NewRelic and are familiar with the concepts of SLOs and SLI targets.
- You are comfortable with responding to production incidents and can fight fires with a calm and level head, leveraging post mortems to apply lessons learned.
- You have experience coding and developing applications. Bonus points for Go experience.
- You are comfortable diving into an unfamiliar system and finding your way around.
- While you believe in processes and the power of planning, you understand that you will often have to roll with the punches and prioritize the most impactful tasks on the fly.
- You have a strong ability to collaborate with cross-functional teams and build solid working relationships with everyone in the organization, from individual contributors to the CEO.
- You have experience building and working on deployment systems.
- You have self-awareness about your strengths and areas for development
- At Dapper Labs, we're looking for people who are passionate about what they do.
- You're encouraged to apply even if your experience doesn't precisely match the job description!
- $132,000 - $207,000 CAD base salary + stock options