Job Description
Job Title: Senior Systems Operations Engineer
Location: Charlotte, NC, Irving, TX, Chandler, AZ
Duration: 18 months
Pay Rate: $73.50
Job/Role Description:
- This role supports application and middleware production operations with a Site Reliability Engineering (SRE) mindset, shifting from reactive operations to proactive reliability engineering across VM-based and container-adjacent environments, including OpenShift (OCP).
- Provides senior-level application and middleware support for complex, high-availability services and acts as an escalation point for L2/L3 incidents, leading disciplined troubleshooting, recovery, and stabilization.
- Embeds SRE practices into day-to-day operations by defining reliability signals, improving alert quality, driving blameless post-incident learning, and prioritizing systemic fixes and toil reduction.
- Implements and continuously improves observability across applications and middleware, including logs, metrics, traces, dashboards, and actionable alerting to enhance detection, diagnosis, and mean time to resolution (MTTR).
- Designs, develops, and maintains infrastructure-as-code and configuration-as-code capabilities supporting VM-based and container-adjacent workloads, including OpenShift (OCP) enablement.
- Builds and supports automation for operational actions across middleware components, such as standardized status checks, start/stop/restart patterns, to enable safer self-service and reduce dependency bottlenecks.
- Designs and implements intelligent automation for platform and middleware operations, including integrating AI/agent-based approaches into workflows with appropriate guardrails for triage assistance, predictive signals, and automated remediation.
- Monitors configuration drift, supports automated compliance checks, and implements remediation patterns aligned to enterprise change management, security, and risk controls.
- Integrates infrastructure and operational automation with CI/CD pipelines to enable repeatable, auditable deployments and safer rollouts.
- Supports core platform components that enable applications and container platforms, including ingress patterns, load balancing integration, and shared supporting services.
- Develops and maintains runbooks, operational documentation, and validation/testing approaches for automation and platform procedures to ensure operational readiness and consistent execution.
- Participates in on-call rotations and provides operational support coverage as required, with flexibility to work in a 24/7 environment including weekends and holidays.
- Delivers assigned operational engineering and automation outcomes with a strong focus on stability, resiliency, and measurable toil reduction.
- Follows enterprise change management, risk, and compliance processes while continuously improving platform reliability and automation maturity through standardization, documentation, and repeatable delivery.
- Supports a large portfolio of mission-critical applications and platforms, contributing to capacity building and workload management in a dynamic environment.
Required Qualifications
- 4+ years of Systems Engineering or Technology Infrastructure/Operations Engineering experience, or equivalent demonstrated through work experience, training, military experience, or education
- 4+ years of application and/or middleware production support in complex, high-availability environments, including incident response and problem management with strong root cause discipline
- 4+ years of hands-on automation and configuration management experience (Ansible preferred or similar), plus strong scripting skills (Python, Bash, PowerShell, or similar)
- 4+ years of Linux administration (RHEL preferred) and/or Windows Server administration supporting enterprise production workloads
- 4+ years of Git-based version control practices, including pull requests and peer review, with a focus on repeatability and code quality
- Working experience with infrastructure-as-code concepts, including modular design and environment consistency
- Experience supporting hybrid/private cloud platforms and container-adjacent hosting models; familiarity with OpenShift (OCP) or Kubernetes-based platforms
- Experience implementing SRE operating practices, including reliability metrics, reduction of manual toil, and continuous improvement via post-incident learnings
- Experience supporting common middleware platforms and shared services; ability to build automation patterns that standardize operational actions and reduce manual intervention
- Familiarity with enterprise observability and operational support practices, including service health dashboards, alert engineering, and actionable telemetry
- Exposure to responsible AI usage in operations, including security, validation, accuracy, and appropriate guardrails for automation and agents
- Strong cross-functional communication skills and experience operating in regulated environments
- Proven troubleshooting, architecture understanding, automation, observability, and scripting skills with experience in containerization and cloud platforms
- Ability to understand capacity planning, identify bottlenecks, and implement effective solutions in production environments
- Hands-on technical expertise with strong adaptability, learning agility, and a collaborative team-oriented mindset
- Well versed in crisis management, root cause analysis techniques, and blameless post-incident reviews
- Experience with tools such as Splunk, PowerShell, Bash, Python, and familiarity with Elastic or similar observability technologies a plus