Staff Site Reliability Engineer

Other Jobs To Apply

No other job posts for this day.

<strong>About SHEIN<br><br></strong>SHEIN is a global online fashion and lifestyle retailer, offering SHEIN branded apparel and products from a global network of vendors, all at affordable prices. Headquartered in Singapore, with more than 15,000 employees operating from offices around the world, SHEIN is committed to making the beauty of fashion accessible to all, promoting its industry-leading, on-demand production methodology, for a smarter, future-ready industry.<br><br><strong>Position Summary<br><br></strong>We are seeking a Staff Site Reliability Engineer (Official Title: Staff Site Reliability Engineer I) with deep experience operating and evolving large-scale, mission-critical systems where availability and reliability are non-negotiable. At SHEIN, Site Reliability Engineers are hybrid software and systems engineers responsible for keeping production services always on while enabling the platform to scale rapidly and safely. In this role, you will own and support complex services and infrastructure, ensuring they consistently meet reliability and performance expectations. At the Staff level, you will also provide technical leadership, influencing platform architecture, reliability strategy, and operational standards across the organization. The SRE team owns and maintains critical open-source and in-house technologies that underpin the platform and serves as a core contributor to major engineering initiatives. We are accountable for driving platform operability forward by reducing incident frequency, minimizing MTTR, and improving system resilience, efficiency, and resource utilization.<br><br>You will work closely with global, cross-functional teams to design, build, and evolve observability and operational tooling—including metrics, logs, traces, alerting, and automation—providing deep visibility into system behavior. Through hands-on engineering and operational excellence, you will proactively identify risks and failure modes, help prevent incidents before they occur, and lead fast, effective responses when they do. To succeed in this role, you will combine strong software engineering skills, solid to deep expertise in Linux, networking, and distributed systems, and a passion for solving problems of scale, complexity, and reliability. Your work will directly contribute to delivering a stable, scalable, and high-performing experience for customers worldwide.<br><br><strong>Job Responsibilities<br><br></strong><ul><li>Keep SHEIN’s mission-critical production systems running 24/7/365, participating in on-call rotations and acting decisively during incidents. </li><li>Triage and resolve production incidents, driving root cause analysis and contributing to continuous improvements that reduce MTTR and prevent recurrence. </li><li>Monitor and manage capacity planning and resource utilization, partnering with cross-functional teams to ensure systems scale safely while remaining cost-effective. </li><li>Own and operate core open-source infrastructure such as APISIX, Nginx, Kubernetes, Kafka, Elasticsearch, Redis, Consul, Etcd, Zookeeper and other large-scale distributed systems. </li><li>Design, build, and maintain observability solutions (metrics, logs, traces, alerting) to improve system visibility, reliability, and resiliency. </li><li>Automate operational workflows and eliminate manual toil through scripting, tooling, and process improvements. </li><li>Develop and maintain technical documentation, including runbooks, architecture diagrams, operational procedures, and on-call playbooks. </li><li>Work closely with global engineering teams to improve infrastructure reliability and performance through better system design and operational discipline. </li><li>Mentor Senior and mid-level SREs, raising the overall technical bar and operational maturity of the team. </li><li>Lead efforts to modernize the platform in alignment with industry best practices and evolving technology standards.<br><br></li></ul><strong>Job Requirements<br><br></strong><ul><li>Bachelor’s degree in Computer Science, Information Systems, or a related technical discipline, or equivalent practical experience. </li><li>6+ years of experience owning and operating large-scale, high-traffic, 24/7 production systems, ideally in cloud or cloud-native environments. </li><li>Solid foundations in Linux, networking, and distributed systems, with the ability to debug complex production issues end to end. </li><li>Hands-on experience with incident response, troubleshooting, and performance optimization in distributed systems. </li><li>Strong software engineering skills with experience building automation, tooling, or platforms in languages such as Python or Go. </li><li>Experience operating or supporting open-source infrastructure components such as APISIX, Nginx, Kubernetes, Kafka, Elasticsearch, Redis, Consul, Etcd, Zookeeper, etc. </li><li>Experience with observability and monitoring systems (Prometheus, Grafana, Zabbix, etc.) and performance analysis. </li><li>Familiarity with Git, CI/CD pipelines, and configuration management tools (e.g., Ansible). </li><li>A strong sense of ownership, a systematic approach to problem-solving, and a passion for making systems more reliable. </li><li>Strong communication skills and the ability to collaborate effectively with geographically distributed teams.<br><br></li></ul><strong>Nice to Have<br><br></strong><ul><li>Bilingual fluency in Mandarin and English.</li><li>Kubernetes Administrator certification or equivalent real-world experience.</li><li>Experience operating big data platforms (Hadoop, Yarn, HBase, Hive, Spark).</li><li>Experience applying AI/LLM-powered tools to reliability engineering, including designing and building automation or internal tools using AI-assisted development platforms (e.g., Claude Code).<br><br></li></ul><strong>Benefits And Perks<br><br></strong><ul><li>Bonus and RSU eligible</li><li>Healthcare (medical, dental, vision, prescription drugs) </li><li>Health Savings Account with Employer Funding </li><li>Flexible Spending Accounts (Healthcare and Dependent care) </li><li>Company-Paid Basic Life/AD&D insurance </li><li>Company-Paid Short-Term and Long-Term Disability </li><li>Voluntary Benefit Offerings (Voluntary Life/AD&D, Hospital Indemnity, Critical Illness, and Accident) </li><li>Employee Assistance Program </li><li>Business Travel Accident Insurance </li><li>401(k) Savings Plan with discretionary company match and access to a financial advisor </li><li>Vacation, paid holidays, floating holiday and sick days </li><li>Employee discounts </li><li>Free weekly catered lunch </li><li>Dog-friendly office (available at select locations) </li><li>Free gym access (available at select locations) </li><li>Free swag giveaways </li><li>Annual Holiday Party </li><li>Invitations to pop-ups and other company events </li><li>Complimentary daily office snacks and beverages<br><br></li></ul>Pay Range: $108,000 USD - $180,000 USD<br><br>

Back to blog

Common Interview Questions And Answers

1. HOW DO YOU PLAN YOUR DAY?

This is what this question poses: When do you focus and start working seriously? What are the hours you work optimally? Are you a night owl? A morning bird? Remote teams can be made up of people working on different shifts and around the world, so you won't necessarily be stuck in the 9-5 schedule if it's not for you...

2. HOW DO YOU USE THE DIFFERENT COMMUNICATION TOOLS IN DIFFERENT SITUATIONS?

When you're working on a remote team, there's no way to chat in the hallway between meetings or catch up on the latest project during an office carpool. Therefore, virtual communication will be absolutely essential to get your work done...

3. WHAT IS "WORKING REMOTE" REALLY FOR YOU?

Many people want to work remotely because of the flexibility it allows. You can work anywhere and at any time of the day...

4. WHAT DO YOU NEED IN YOUR PHYSICAL WORKSPACE TO SUCCEED IN YOUR WORK?

With this question, companies are looking to see what equipment they may need to provide you with and to verify how aware you are of what remote working could mean for you physically and logistically...

5. HOW DO YOU PROCESS INFORMATION?

Several years ago, I was working in a team to plan a big event. My supervisor made us all work as a team before the big day. One of our activities has been to find out how each of us processes information...

6. HOW DO YOU MANAGE THE CALENDAR AND THE PROGRAM? WHICH APPLICATIONS / SYSTEM DO YOU USE?

Or you may receive even more specific questions, such as: What's on your calendar? Do you plan blocks of time to do certain types of work? Do you have an open calendar that everyone can see?...

7. HOW DO YOU ORGANIZE FILES, LINKS, AND TABS ON YOUR COMPUTER?

Just like your schedule, how you track files and other information is very important. After all, everything is digital!...

8. HOW TO PRIORITIZE WORK?

The day I watched Marie Forleo's film separating the important from the urgent, my life changed. Not all remote jobs start fast, but most of them are...

9. HOW DO YOU PREPARE FOR A MEETING AND PREPARE A MEETING? WHAT DO YOU SEE HAPPENING DURING THE MEETING?

Just as communication is essential when working remotely, so is organization. Because you won't have those opportunities in the elevator or a casual conversation in the lunchroom, you should take advantage of the little time you have in a video or phone conference...

10. HOW DO YOU USE TECHNOLOGY ON A DAILY BASIS, IN YOUR WORK AND FOR YOUR PLEASURE?

This is a great question because it shows your comfort level with technology, which is very important for a remote worker because you will be working with technology over time...