Amazon is where women work

Home    Amazon    Jobs    Job

Job is no longer available

Software Developer - AWS Reliability Engineering


Dublin, Ireland


Amazon Web Services is the largest consumer cloud offering in the world, powering cutting edge science, rapidly growing start-ups, and industry-leading companies.

The AWS Reliability Engineering team is building systems to ensure these AWS customers can rely on the highest-availability, lowest-latency cloud platform on the planet. We work closely with the teams who own the largest AWS products, building systems to detect and mitigate operational issues before they impact customers. We are looking for experienced and knowledgeable developers to help us achieve this mission.

As a Software Development Engineer on the AWS Reliability team, you will join other talented software developers in the design and implementation of systems which will reduce customer-impacting issues for all AWS products. You will work with teams across AWS to drive software/systems development practices for new and existing products. You will help service teams meet availability goals across AWS, and define strategies to make these goals attainable with minimal effort. Your goal will be to remove human-error from the day-to-day operations of the massive, always-on, distributed systems which make up AWS. We succeed once these systems detect, diagnose and repair operational defects without customer impact or human intervention.

Within your first year on the AWS Reliability team, you will have met with developers from across AWS, contributed to the design and implementation of at least one new system, and you will have dived deep into the causes of at least one historic external customer impacting event, and understood how to prevent a similar event from ever happening again. As your career continues to develop, you will influence the growth and direction not only of the Reliability team, but of AWS as a whole.

If this sounds like the right challenge for you, then please apply today!


· 3 years’ experience in a large-scale software development environment
· Proficiency in Java, C/C++/C# or another high-level programming language
· Experience with distributed operational health and performance monitoring systems
· Manage directly assigned tasks and on-call duties gracefully
· Ability to work in a diverse team environment


· Experience in Systems and Network Administration, DevOps or Site Reliability Engineering
· Experience specifying, designing, and/or implementing system health, performance monitoring tools
· Experience designing and/or implementing automated software testing, deployment and performance analysis systems
· Experience conducting failure mode analysis in complex distributed systems
· Experience conducting efficiency and duplication analysis across large organizations
· Experience reviewing and refining design and architecture documents presented by partner teams for operational readiness, fault tolerance and scalability
· Experience developing or furthering existing application and system management tools and processes that reduce manual efforts and increase overall efficiency
· Ability to adapt and improve operations management systems and processes to accommodate rapid and increasing growth in systems and traffic
· Monitor the health of the fleet, automating system health, maintenance tasks, and reporting systems as needed
· Experience monitoring the health of fleets, automating system health, maintenance tasks, and reporting systems as needed
· Experience with hardware load balancer administration, network optimization, or other related and demonstrable TCP-level experience


Share this page:

Join the community