EC2 Infrastructure Services organization is responsible for making EC2 instances available to our customers at all times. We are a key part of what makes EC2 elastic. AI infrastructure has taken a key place in EC2 and we are building systems, services, and automation to operate this at scale.
The Software Development Engineer will design, build, and maintain cloud-based provisioning and recovery systems for AWS Trainium-based AI UltraServers. This role requires expertise in AWS services, system architecture, and cross-functional collaboration with Capacity Management, Hardware Engineering, and Datacenter Operations to manage AI/ML infrastructure.
Key job responsibilities
Key job responsibilities
- The Software Development Engineer is responsible for building and maintaining scalable micro services.
- They are adept at system design that solves the business problem efficiently.
- Work in environments where the technology strategy is defined but the solution design is not
- Build cloud-based solutions using AWS native services for scaling infrastructure frameworks
- Create observable systems with appropriate metrics and alarming
- Collaborate with customers and stakeholders to convert business needs into technical designs
- Participate in code reviews and technical assessments
About the team
The EC2 UltraServer Provisioning team is a high-performing engineering organization responsible for delivering AWS Trainium-based UltraServers infrastructure at scale. We manage end-to-end provisioning workflows from host ingestion through testing, repair, and recovery.