In Annapurna Labs we are at the forefront of hardware/software accelerator solutions for not only Amazon Web Services (AWS), but across the industry. The Machine Learning Acceleration Systems Firmware team is looking for candidates interested in diving deep into our designs of Machine Learning servers and developing world class firmware to support current and future generations of accelerator silicon.
Our team designs and builds Annapurna's fleet of Accelerated Servers using Internally designed silicon. We solve systemic hardware issues and we build hardware and software systems to detect and mitigate future failure recurrences so that our our customers can experience the highest quality of service possible!
In this role, you will lead an organization of software and firmware developers to build reliable server firmware deployed across millions of accelerators across EC2. You will build AI-driven software tooling that root causes failures and identifies causes of system failures—work that directly impacts how our customers leverage AWS Trainium for their machine learning workloads.
Key job responsibilities
In this role, you will lead a team of software and firmware developers to design and develop server software at AWS scale. You'll collaborate with hardware developers and software engineers to design validation strategies that ensure reliability across our entire product line. Your days will include mentoring your team through complex technical challenges, establishing operational procedures that scale across products, and working cross-functionally to integrate design-for-excellence principles into our development process. You'll also participate in technical discussions that shape how we approach system design & validation, ensuring we're catching issues before they reach customers.
This is a fast-paced, intellectually challenging position, and you’ll work with thought leaders in multiple technology areas. You’ll have high standards for yourself and everyone you work with, and you’ll be constantly looking for ways to improve your product’s performance, quality and cost. Using data and key metrics, you will also drive and measure process improvements that enhance our operational effectiveness.
A day in the life
Your day to day responsibilities will include interfacing with our internal and external customers to understand project requirements and facilitate system development ontop of your server design. You will be responsible for learning operational challenges to our existing fleet with the goal of improving the current customer experience as well as developing improved systems for future designs. You will work directly with vendors and ODM/JDM design teams to develop and manufacture your product at scale.
About the team
Our team is dedicated to supporting new members. We have a broad mix of experience levels and tenures, and we’re building an environment that celebrates knowledge-sharing and mentorship. Our senior members enjoy one-on-one mentoring and thorough, but kind, design reviews. We care about your career growth and strive to assign projects that help our team members develop your engineering expertise so you feel empowered to take on more complex tasks in the future.
We're a collaborative group of software engineers and hardware developers united by a shared mission: making Amazon Trainium products more reliable and easier to troubleshoot. Our team values partnership across disciplines—your success depends on building strong relationships with hardware specialists, validation engineers, and other technical leaders. We're focused on establishing best-in-class operational procedures and diagnostic capabilities that set the standard for the industry. By joining us, you'll help shape the future of how we approach system reliability and contribute to products that power some of the most demanding machine learning applications in the world.