Job Description

Senior Site Reliability Engineer

Location: Ottawa ON (Canada)

Duration - Fulltime

Job Qualifications:

Position Overview:

The Instant Financial Issuance as a Service (IFIaaS) Cloud Service includes a wide array of components including web services, application servers, and databases hosted in an on-prem environment. The Sr. Site Reliability Engineer (SRE) will be responsible for ensuring that the SaaS platform is reliable, available, and performant, as well as scalable, secure, and cost-effective. Ultimately, the individual will be responsible for the platform uptime, functional management of all the IFIaaS cloud environments, applications, networks, scoping projects, and the resolution of application and network issues.

How You Can Make an Impact:

The Instant Financial Issuance as a Service (IFIaaS) Cloud Platform spans multiple on‑prem environments. The Senior Site Reliability Engineer (SRE) will play a critical role in ensuring the platform’s reliability, scalability, security, and operational excellence across these geographically distributed environments. Given the asymmetric nature of our data centers, the SRE will design and operate systems that prioritize local HA while ensuring effective, tested, and compliant failover for DR scenarios. This role includes responsibility for platform uptime, environment management, network and application reliability, observability, automation maturity, compliance, and operational excellence.

Responsibilities:

Own SLOs/SLIs for availability (99.9%), latency, error rate, and quality of service across microservices.
Design/operate end‑to‑end observability: metrics, logs, traces, synthetic checks, real‑user monitoring (RUM).
Instrument services (Windows services, APIs, background jobs) with structured logs and trace context.
Build health probes and SLA monitors for critical transactions and cross-service dependencies.
Monitor system issues using various metrics, such as uptime, latency, error rate, throughput, and availability
Deploy and maintain monitoring and on-call tools i.e.: Splunk on-call, Prometheus, Datadog, etc.
Lead incident response (triage, comms, coordination, real-time mitigation) and conduct blameless postmortems with actionable follow-ups.
Maintain and continuously improve runbooks, escalation paths, on call rotations, and paging policies.
Implement MTTA/MTTR reduction programs.
Stand up war room protocols and ensure stakeholder updates during incidents.
Forecast compute, storage, network needs, track headroom against growth and peak patterns.
Conduct performance profiling and bottleneck analyses (CPU, memory, I/O, thread pools, connection pools).
Optimize resource allocation on VMware (DRS, affinity rules, reservations) and Windows VM tuning (kernel, TCP stack, NICs).
Validate scaling strategies (horizontal vs. vertical) and implement auto-scaling where supported.
Standardize gold images, configuration baselines, and desired state for Windows Server (PowerShell DSC or equivalent).
Manage patching (OS, middleware, runtime) with maintenance windows aligned to error budgets.
Ensure backup, snapshot, and restore strategies meet RPO/RTO; regularly test restores.
Maintain secure baselines (CIS benchmarks for Windows/VMware), vulnerability management, and patch cadence.
Support compliance audits (PCI-CP, PCI-DSS, SOC 2/ISO 27001), produce evidence (configs, logs, access reviews), and remediate gaps.
Automate provisioning (VM templates, DSC/Ansible for Windows, Terraform for VMware) and configuration drift detection/correction.
Build runbooks to reduce toil (deploy, scale, rollback, etc)
Create reliability guardrails (pre‑flight checks, change freeze rules, policy controls) as code.
Continuously refactor scripts/runbooks into idempotent automation.
Collaborate with development teams and other stakeholders to identify potential risks, such as security vulnerabilities, performance bottlenecks, deployment issues, or configuration errors
implement various risk mitigation strategies, such as patching, backup, redundancy, encryption, or testing
Collaborate with product teams and other teams to understand the user needs, expectations, and satisfaction.
Coach engineers on SRE principles, incident handling, and reliability centric design.
Lead knowledge sharing, runbooks quality, and postmortem culture (blameless, action-oriented).
Provide after-hours support for production issues on a rotational basis with other team members to ensure system availability 24/7/365.

Basic Qualifications:

5+ years of experience in SRE, DevOps, or Software Engineering roles supporting distributed, production-grade environments, with strong skills in troubleshooting microservices, Windows/VMware systems, and on‑prem hybrid infrastructure.
Hands‑on experience with automation and observability, including Terraform/Ansible/DSC, CI/CD pipelines, logs/metrics/tracing systems, and enterprise monitoring tools such as Datadog, Prometheus, or Splunk.
Demonstrated capability with infrastructure automation tools (Terraform, Ansible, Jenkins, Octopus, PowerShell DSC, etc.).
Proficiency in VMware, Windows Server administration, networking fundamentals, and system‑level performance analysis.
Hands‑on experience operating and troubleshooting enterprise microservices, APIs, and distributed application stacks in on‑prem/hybrid infrastructure.
Must have : Ability to provide after-hours production support on a rotational basis to ensure 24/7/365 system availability.

Preferred Qualifications:

Demonstrated integrity and accountability, including reliability, ownership of mistakes, and commitment to high operational standards across compliance-sensitive environments (PCI‑DSS, PCI‑CP, SOC2).
High self‑confidence, strong presentation and communication abilities, and a history of leading through example, helping establish a culture of operational excellence and continuous improvement.
Leadership behaviors, including initiative, thoughtful risk‑taking, reflective decision‑making, and the ability to take action confidently amid uncertainty.

Job Tags

Full time, Local area

Similar Jobs

Confidential

Marketing Materials Planning Job at Confidential

1. Responsible for creating English marketing material to attract users, including brand advertisements, landing pages, product pages, social media, etc. 2. Responsible for proofreading and editing English content before publishing, including brand advertisements, landing...

Taco Bell - B&G Food Enterprises

Shift Leader Job at Taco Bell - B&G Food Enterprises

...Franchisees are independent business owners who set their own wage and benefit programs that can vary among franchisees." The Taco Bell Shift Manager supports the Restaurant General Manager by running great work shifts and meeting Taco Bell standards. You take ownership and...

The York Group, Inc.

Truck Driver - CDL Job at The York Group, Inc.

...POSITION SUMMARY The Truck Driver CDL operates assigned truck for delivery of finished product and picking up of raw materials or defective product, unloads and loads trucks, stores product and raw materials, maintains all log books. Starting pay rate: $...

Succor Delivery Services LLC

Standard Delivery Driver Role - Immediate Hiring-Package Delivery Driver - $20.50-$22.00 per hour Job at Succor Delivery Services LLC

...your hard work is rewarded. We are currently looking for full-time candidates to work 4 shifts per week each lasting approximately 10 hours. Start time is approximately 09:20 a.m. and shifts are 10 hours per day 4 days per week. Shifts are available 7 days per week....

Greystar

Leasing Professional - The Lodge Apartments Job at Greystar

...the Asia-Pacific region. Greystar is the largest operator of apartments in the United States, managing more than one million units/beds... ...position is responsible for coordinating the communitys marketing, leasing, and renewal strategies to achieve occupancy, revenue, and...

Senior Site Reliability Engineer Job at Themesoft Inc, Ottawa, IL

RXJwK05XQ3JsYTBqa2ZZRStOY01hQTR2QXc9PQ==