Responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning. Creates a bridge between development and operations by applying a software engineering mindset to system administration topics. Splits time between operations/on-call duties and developing systems and software that help increase site reliability and performance.
Chaos engineering - thinks laterally about how systems might fail in theory, designs tests to demonstrate how they behave in practice, and then formulate and implement remediation plans, as appropriate.
Use practices from DevOps and GitOps to improve automation and processes to make self service possible.
Pushing our systems to their limits, and then coming up with designs for how to get them to the next performance tier.
Safeguarding reliability. Ensuring that our services are highly available, resilient against disasters, self-monitoring, and self-healing.
Running “game days” to test assumptions about reliability and learn what will break before it matters to customers.
Building systems to proactively monitor the health, performance and security of our production and non-production virtualized infrastructure.
Improving our monitoring and alerting systems to make sure engineers get paged when it matters (and don’t get paged when it doesn’t).
Troubleshooting systems and network issues, alongside our Technical Operations Team.
- BS in Computer Science, Information Technology, Business / Management Information Systems or related field
- No experience required. Typically has a basic knowledge with programming in one or more programming languages and Unix/Linux systems internals and administration (e.g. filesystems, inodes, system calls) or networking (e.g. TCP/IP, routing, network topologies and hardware, SDN).
Skills / Knowledge - Learns to use professional concepts. Applies company policies and procedures to resolve routine issues.
Job Complexity - Works on problems of limited scope. Follows standard practices and procedures in analyzing situations or data from which answers can be readily obtained. Builds stable working relationships internally.
Supervision - Normally receives detailed instructions on all work.