Site Reliability Engineer 1
Redmond, WA 
Share
Posted 10 days ago
Job Description
OverviewCome join us in shaping the future of education! Would you like to be part of the team that is building the next gen apps and services to revolutionize the learning experience? Want to join one of the most customer focused organizations? Do you want to build an inclusive team culture? If so, then we might have just the right opportunity for you! The Education engineering team serves a mission critical vertical and acts as an incubation ground for M365 initiatives. We are creating unique experiences for students, teachers, administrators and parents and our goal is to empower every learner in the world to achieve more. We arelooking for a Site Reliability Engineer (SRE)with the right mix of systems engineering, software development, on-line servicesexperience,and passion for quality to envision, design, and deliver our highly scalable services that aim to serve millions of teachers and students across the globe. Our engineering culture is data driven, dynamic and inclusive. Team members are encouraged to explore innovative ideas, establish hypotheses, and implement them iteratively to learn and adapt quickly. We are passionate about pursuing service agility and deploy frequently to production. Our services are distributed RESTful APIs deployed in Azure and built on top of the Office 365 Substrate layer and SharePoint. Our UI experiences are built on top of ReactJs. Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyon
ResponsibilitiesDevelops code, scripts, systems, or platforms that automate moderately complex but repetitive operations processes (e.g., monitoring, alerting, deploying products and updates, debugging) at scale; reviews existing automation code and scripts to evaluate reusability, extendibility, and scalability within an organization. Analyzes data from telemetry pipelines and monitoring tools that detail operations metrics (e.g., availability, reliability, performance, efficiency) of systems, platforms, or productsoperatingat scale. Responds to incidents during regular on-call rotations by identifying the level of impact, troubleshooting complex issues, and deploying appropriate fixes to resolve root cause(s); alerts product teams, owners, and leadership to issues with major customer/business impact and escalates resolution of the overly complex, ambiguous, and impactful issues to include other engineering teams and/or subject matter experts as needed. Shares details related to incidents and their resolution through post-mortem reports and during regular review meetings with more experienced engineers and members of product engineering teams. Embody our Culture and Values

 

Job Summary
Company
Start Date
As soon as possible
Employment Term and Type
Regular, Full Time
Required Experience
Open
Email this Job to Yourself or a Friend
Indicates required fields