Job Details

Site Reliability Engineer

  2025-06-01     Berkeley Lab     Berkeley,CA  
Description:

Join to apply for the Site Reliability Engineer role at Berkeley Lab.

Lawrence Berkeley National Lab's (LBNL) NERSC Division has an opening for a Site Reliability Engineer to join the team.

The National Energy Research Scientific Computing Center (NERSC) is inviting applications for the position of Site Reliability Engineer. NERSC's mission is to accelerate scientific discovery through high performance computing and data analysis for the DOE Office of Science programs. NERSC provides critical HPC and data systems and support for NERSC's 10,000 users researching alternative energy sources, climate science, energy efficiency, environmental science, and other DOE mission areas. As a Site Reliability Engineer in the Operations Group, you will be a member of a 24x7 team that helps ensure that NERSC is accessible, reliable, secure, and available to our scientific users using our state-of-the-art OMNI data collection and monitoring system.

What You Will Do At Level 2

  • Work 5 shifts per week to monitor the NERSC HPC Facility, which includes 2 - 3 OWL (midnight - 8am) shifts. Some days may be onsite, some may be offsite. The schedule will be determined by staffing needs.
  • Review and respond to alerts from computer systems, storage, network, and other data center/facility-related systems by triaging or calling appropriate on-call staff.
  • Create solutions to improve processes, prevent issue recurrence, and automate routine responses.
  • Respond to alerts from OMNI to ensure continuous data collection for real-time diagnostics.
  • Develop and maintain tools within the monitoring pipeline, collaborating with the Operations Team to create new software programs, configurations, and solve technical issues to enable scalability and reliability.
  • Coordinate center-wide maintenance activities and manage diagnostic and notification software during maintenance.
  • Provide accurate information in the trouble ticketing system for outages, maintenance, and incidents.
  • Analyze data to resolve problems of diverse scope.

What You Will Do at Level 3

  • Provide leadership in developing OMNI monitoring and alerting pipelines.
  • Contribute to the design and deployment of the OMNI cluster.
  • Work closely with other groups to enhance monitoring systems.
  • Resolve complex issues requiring in-depth data analysis.
  • Coordinate activities of other personnel on new assignments.

Minimum Requirements at Level 2

  • Typically 5+ years of related experience with a Bachelor's degree, or 3+ years with a Master's degree, or equivalent experience.
  • Strong Linux command-line skills.
  • Experience with programming languages such as C, C++, Perl, Java, or Python.
  • Knowledge of large data networks and IT infrastructure supporting high-availability systems.
  • Motivated self-starter with interest in technologies like Kubernetes, Prometheus, VictoriaMetrics, alertmanager, and building management software.
  • Effective communication skills and ability to work across technical teams.
  • Experience managing large data centers in a 24/7 onsite environment.
  • Knowledge of network security, firewalls, ACLs, and network protocols.
  • Relevant certifications or advanced education in computing science.

Minimum Requirements at Level 3

  • Typically 6+ years of related experience with a Bachelor's degree, or 8+ years with a Master's degree, or equivalent experience.
  • Expertise in programming languages like C, C++, Perl, Java, or Python.
  • Proven excellence in relevant tools and project leadership experience.
  • Ability to proactively address problems and issues.

Additional Notes

  • This is a full-time, exempt position with shift work (Owl shift 12AM-8AM onsite).
  • Salary ranges and targeted compensation are detailed for each level.
  • Position is subject to background checks and may involve hybrid work arrangements.
  • Reside within 150 miles of Berkeley Lab for hybrid schedules.

Learn more about working at Berkeley Lab at: careers.lbl.gov

Berkeley Lab is an Equal Opportunity Employer committed to diversity and inclusion. All qualified applicants will be considered without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, age, or veteran status.

#J-18808-Ljbffr


Apply for this Job

Please use the APPLY HERE link below to view additional details and application instructions.

Apply Here

Back to Search