SDE I - Systems, Runtime, and ML Infrastructure (AWS Custom Silicon), Annapurna Labs

Amazon • Seattle, WA, US • 4m ago

At AWS, we're pioneering the future of cloud computing and AI acceleration through innovative hardware-software co-design. Our teams within Annapurna Labs and AWS AI are creating the foundation for next-generation cloud infrastructure that powers thousands of customers worldwide, from cutting-edge startups to global enterprises.

We operate at an unprecedented scale, designing custom silicon chips, advanced networking solutions, and ML accelerators that were unimaginable just a few years ago.

Our work spans from the lowest levels of hardware abstraction to high-performance distributed training systems, creating unique opportunities for early-career engineers to make significant impact across multiple domains.

Key job responsibilities
- Develop and optimize software for custom hardware and ML infrastructure
- Collaborate with hardware teams to understand and leverage chip architecture
- Implement and improve networking, runtime, and system-level software
- Assist in building and maintaining tools for profiling, monitoring, and debugging ML workloads
- Contribute to the development of open-source ML frameworks and infrastructure projects
- Participate in code reviews and implement best practices for software development
- Learn and apply new technologies to solve complex engineering challenges

About the team
Candidates will be routed to specific teams based on their interests and our current needs during the application process:

- The Elastic Network Adapter (ENA) team revolutionizes EC2 core networking, enabling enhanced networking capabilities across AWS's most critical compute instances. Here, you'll work with networking protocols and high-performance drivers that power millions of cloud workloads.

- Our AWS Neuron SDK team develops the complete software stack for custom ML accelerators (Inferentia and Trainium), democratizing access to AI infrastructure. This team bridges the gap between popular ML frameworks and custom hardware.

- The Machine Learning Server Software team maintains and optimizes the world's most advanced ML servers, focusing on system-level software that ensures peak performance of AI workloads. While we don't work directly on ML algorithms, we build the critical infrastructure that makes ML possible at scale.

- The SoC Hardware Abstraction Layer (HAL) team works at the intersection of hardware and software, developing the crucial middleware that manages our custom silicon chips. This team ensures our innovative hardware designs translate into reliable, high-performance solutions.

Apply

Connect with us: