Job Details

System Infrastructure / Platform Engineer, HPC Technology Department

  2026-06-12     Berkely Lab     San Francisco,CA  
Description:

System Infrastructure / Platform Engineer

The National Energy Research Scientific Computing Center (NERSC) is seeking a System Infrastructure / Platform Engineer to help build and manage HPC systems and Linux-based infrastructure. NERSC operates some of the world's largest supercomputers, supporting thousands of researchers tackling major scientific challenges.

In this role, you will manage high-performance computing environments, including HPC systems, containers, virtual machines, and core infrastructure services. You'll work with cutting-edge technologies such as CPU/GPU clusters, parallel storage, high-speed networking, Slurm, and Kubernetes, balancing innovation with reliability, performance, and security at scale.

Collaborating with engineers, researchers, vendors, and open-source communities, you will help develop scalable solutions that advance scientific discovery and the future of HPC. If you have Linux experience, an interest in science, and enjoy fast-paced collaborative environments, NERSC would love to hear from you.

We're here for the same mission, to bring science solutions to the world. Join our team and YOU will play a supporting role in our goal to address global challenges! Have a high level of impact and work for an organization associated with 17 Nobel Prizes!

Why join Berkeley Lab?

We invest in our employees by offering a total rewards package you can count on:

  • Exceptional health and retirement benefits, including pension or 401K-style plans
  • Opportunities to grow in your career - check out our Tuition Assistance Program
  • A culture where you'll belong - we are invested in our teams!
  • In addition to accruing vacation and sick time, we also have a Winter Holiday Shutdown every year.
  • Parental bonding leave (for both mothers and fathers)
  • Pet insurance

What You Will Do if hired at a Level 3:

  • Build and manage Linux systems and storage infrastructure
  • Troubleshoot complex technical issues with team members
  • Install, upgrade, and secure systems and services
  • Develop and maintain scripts and automation tools
  • Participate in a 24/7 on-call rotation
  • Lead small projects, upgrades, and service rollouts
  • Collaborate with vendors to improve technologies and user experience
  • Support reliable operations of NERSC's Perlmutter supercomputer and Spin Kubernetes platform
  • Develop and integrate services across NERSC and DOE facilities, including the upcoming Doudna supercomputer
  • Present technical work to the HPC community at conferences and industry events

In Additional Responsibilities if hired at a Level 4:

  • Solve complex technical problems with independent judgment
  • Develop team strategies and project plans
  • Provide technical leadership and mentorship
  • Lead system improvements for performance, reliability, and security
  • Evaluate emerging HPC technologies and capabilities
  • Represent NERSC in HPC and DOE technical communities and advocacy groups

What is Required to be hired at a Level 3:

  • Typically, 8+ years of related experience with a Bachelor's degree; alternatively, 6+ years with a Master's degree; or equivalent career experience
  • 4+ years of experience managing large-scale Linux-based system deployments in a high-performance computing, cloud computing, or hyper-scale environment
  • Mastery of Linux concepts and operations (processes, networking, system logs, performance)
  • Proficiency with bash and Python scripting
  • Experience with some or all of our key technologies:
    • containers (such as Docker or Kubernetes)
    • virtualization (such as Proxmox or VMware)
    • cloud-based deployment (such as AWS, Azure or GCP)
    • identity and access management
    • database administration, tuning, and troubleshooting
    • storage systems technologies (such as iSCSI and NAS appliances)
    • parallel filesystems (such as Lustre, GPFS, or VAST)
    • high-speed networking/interconnect (such as InfiniBand, Slingshot, or RoCE)
    • advanced performance analysis and debugging tools (such as strace, lsof, ebpf, or gdb)
    • DevOps tools (such as Gitlab or Jira) and processes (such as issues, merge requests, and API/automation)
  • Familiarity with automated provisioning systems (such as Chef, Foreman, or Terraform)
  • Familiarity with configuration management systems (such as Ansible or Puppet)
  • Working knowledge of Linux system engineering and security practices
  • Ability to resolve complex issues in creative and effective ways and derive technical solutions in a collaborative environment to meet end user requirements or needs
  • Demonstrated ability to work independently as well as collaboratively in large projects, and contribute to an active and respectful intellectual environment
  • Creative, positive, and collaborative work style
  • Excellent oral and written communication skills

Additional Requirements to be hired at a Level 4:

  • Typically, 12+ years of related experience with a Bachelor's degree; alternatively, 8+ years with a Master's degree; or equivalent career experience
  • Proven ability to lead troubleshooting and resolution of high-impact incidents in complex, large-scale environments
  • Demonstrated leadership in cross-team collaboration and mentoring
  • Experience in software engineering, Linux systems programming, or complex scripting
  • Experience managing one or more of the following:
    • data center networking (TCP/IP, Ethernet, BGP, ECMP)
    • batch workload managers (such as Slurm), including installation, configuration, routine operations, job lifecycle concepts, and troubleshooting common failure modes
    • Cray/HPE HPC ecosystems (e.g., CSM/COS, Slingshot interconnect, and related components)
  • Ability to lead and coordinate projects with traditional or Agile methodologies (such as Scrum or Kanban)
  • Ability to analyze and resolve significant and unique issues requiring evaluation of multiple intangible factors
  • Ability to exercise independent judgment in methods, techniques and evaluation criteria for obtaining results

Additional information:

  • Applications will be accepted until the job posting is removed.
  • Appointment type: This is a full-time, career appointment, exempt (monthly paid) from overtime pay.
  • Salary range:
    • Level 3: The expected salary for this position is $156,864 - $191,724, which fits into the full salary of $139,440 - $235,308 depending upon the candidate's skills, knowledge, and abilities. This includes education, certifications, and years of experience.
    • Level 4: The expected salary for this position is $178,644 - $218,364, which fits into the full salary of $158,808 - $267,996 depending upon the candidate's skills, knowledge, and abilities. This includes education, certifications, and years of experience.
  • Background check: This position is subject to a background check. Any convictions will be evaluated to determine if they directly relate to the responsibilities and requirements of the position. Having a conviction history will not automatically disqualify an applicant from being considered for employment.
  • Work modality: This position requires substantial on-site presence, but is eligible for a flexible work mode, and hybrid schedules may be considered. Hybrid work is a combination of performing work on-site at Lawrence Berkeley National Lab, 1 Cyclotron Road, Berkeley, CA and some telework. Individuals working a hybrid schedule must reside within 150 miles of Berkeley Lab. Work schedules are dependent on business needs. In rare cases, full-time telework or remote work modes may be considered.
  • Multi-level Posting: This position will be hired at a level commensurate with the business needs and the skills, knowledge, and abilities of the successful candidate.
  • Export Control Access: This position will involve access to hardware, commodities, and technical information subject to export control regulations including, but not limited to, the Export Administration Regulations ("EAR") and/or International Traffic in Arms Regulations ("ITAR"). Accordingly,


Apply for this Job

Please use the APPLY HERE link below to view additional details and application instructions.

Apply Here

Back to Search