The Software Development Engineer will lead the team in technical strategy, design, build, and operation of infrastructure services including provisioning and availability of AWS Trainium-based AI servers. This role requires expertise in architecting large-scale systems, building micro services, and cross-functional collaboration with several other teams such as capacity management, hardware engineering, and datacenter teams to manage AI/ML infrastructure.
Key job responsibilities
- Design and develop innovative technologies that power the infrastructure supporting AI workloads on Ultraservers
- Lead technical projects establishing EC2 as the pioneer in cloud computing for AI/ML workloads across diverse applications including LLMs, multimodal systems, and emerging model architectures.
- Collaborate with various teams to influence architecture of provisioning systems and improve to operate at scale and efficiently.
- Build customer relationships by investigating complex performance challenges, developing solutions, and publishing actionable best practices through multiple channels.
About the team
The EC2 UltraServer Provisioning team is a high-performing engineering organization responsible for delivering AWS Trainium-based UltraServers infrastructure at scale. We manage end-to-end provisioning workflows from host ingestion through testing, repair, and recovery.