S3 logo

Senior Systems Operations Engineer

S3
1 day ago
Contract
On-site
Charlotte, North Carolina, United States

Job Description

Job Title: Senior Systems Operations Engineer

Location: Charlotte, NC, Irving, TX, Chandler, AZ

Duration: 18 months

Pay Rate: $73.50

 

Job/Role Description:

  • This role supports application and middleware production operations with a Site Reliability Engineering (SRE) mindset, shifting from reactive operations to proactive reliability engineering across VM-based and container-adjacent environments, including OpenShift (OCP).
  • Provides senior-level application and middleware support for complex, high-availability services and acts as an escalation point for L2/L3 incidents, leading disciplined troubleshooting, recovery, and stabilization.
  • Embeds SRE practices into day-to-day operations by defining reliability signals, improving alert quality, driving blameless post-incident learning, and prioritizing systemic fixes and toil reduction.
  • Implements and continuously improves observability across applications and middleware, including logs, metrics, traces, dashboards, and actionable alerting to enhance detection, diagnosis, and mean time to resolution (MTTR).
  • Designs, develops, and maintains infrastructure-as-code and configuration-as-code capabilities supporting VM-based and container-adjacent workloads, including OpenShift (OCP) enablement.
  • Builds and supports automation for operational actions across middleware components, such as standardized status checks, start/stop/restart patterns, to enable safer self-service and reduce dependency bottlenecks.
  • Designs and implements intelligent automation for platform and middleware operations, including integrating AI/agent-based approaches into workflows with appropriate guardrails for triage assistance, predictive signals, and automated remediation.
  • Monitors configuration drift, supports automated compliance checks, and implements remediation patterns aligned to enterprise change management, security, and risk controls.
  • Integrates infrastructure and operational automation with CI/CD pipelines to enable repeatable, auditable deployments and safer rollouts.
  • Supports core platform components that enable applications and container platforms, including ingress patterns, load balancing integration, and shared supporting services.
  • Develops and maintains runbooks, operational documentation, and validation/testing approaches for automation and platform procedures to ensure operational readiness and consistent execution.
  • Participates in on-call rotations and provides operational support coverage as required, with flexibility to work in a 24/7 environment including weekends and holidays.
  • Delivers assigned operational engineering and automation outcomes with a strong focus on stability, resiliency, and measurable toil reduction.
  • Follows enterprise change management, risk, and compliance processes while continuously improving platform reliability and automation maturity through standardization, documentation, and repeatable delivery.
  • Supports a large portfolio of mission-critical applications and platforms, contributing to capacity building and workload management in a dynamic environment.

Required Qualifications

  • 4+ years of Systems Engineering or Technology Infrastructure/Operations Engineering experience, or equivalent demonstrated through work experience, training, military experience, or education
  • 4+ years of application and/or middleware production support in complex, high-availability environments, including incident response and problem management with strong root cause discipline
  • 4+ years of hands-on automation and configuration management experience (Ansible preferred or similar), plus strong scripting skills (Python, Bash, PowerShell, or similar)
  • 4+ years of Linux administration (RHEL preferred) and/or Windows Server administration supporting enterprise production workloads
  • 4+ years of Git-based version control practices, including pull requests and peer review, with a focus on repeatability and code quality
  • Working experience with infrastructure-as-code concepts, including modular design and environment consistency
  • Experience supporting hybrid/private cloud platforms and container-adjacent hosting models; familiarity with OpenShift (OCP) or Kubernetes-based platforms
  • Experience implementing SRE operating practices, including reliability metrics, reduction of manual toil, and continuous improvement via post-incident learnings
  • Experience supporting common middleware platforms and shared services; ability to build automation patterns that standardize operational actions and reduce manual intervention
  • Familiarity with enterprise observability and operational support practices, including service health dashboards, alert engineering, and actionable telemetry
  • Exposure to responsible AI usage in operations, including security, validation, accuracy, and appropriate guardrails for automation and agents
  • Strong cross-functional communication skills and experience operating in regulated environments
  • Proven troubleshooting, architecture understanding, automation, observability, and scripting skills with experience in containerization and cloud platforms
  • Ability to understand capacity planning, identify bottlenecks, and implement effective solutions in production environments
  • Hands-on technical expertise with strong adaptability, learning agility, and a collaborative team-oriented mindset
  • Well versed in crisis management, root cause analysis techniques, and blameless post-incident reviews
  • Experience with tools such as Splunk, PowerShell, Bash, Python, and familiarity with Elastic or similar observability technologies a plus