Site Reliability Engineer

SRE & Platform Roles

Failure That
Never
Reaches User

Core Disciplines

Reliability Engineering.Practiced with Precision.

Observability, automation, and incident engineering — applied to systems that businesses stake their reputation on. This is the work that prevents headlines, not the work that makes them.

Observability Architecture

Grafana · Zabbix · Custom Alert Pipelines

I don't build dashboards. I build intelligence layers that surface anomalies before they become incidents. Every alert is designed to be actionable. Every metric is chosen to eliminate noise, not add to it.

Incident Engineering

Runbooks · SLA Compliance · Blameless Post-Mortems

From first signal to post-mortem — I own the full incident lifecycle. Structured runbooks, escalation trees, and blameless reviews that extract the failure pattern permanently. Every incident leaves the system stronger than it found it.

Toil Elimination

Ansible · CI/CD Pipelines · Configuration-as-Code

Manual processes are reliability debt that compounds silently. I audit operational workflows for toil density, then automate them out of existence — converting engineer hours into system intelligence.

View Reliability Cases

How I Work

Reliability Engineeredby Design.

Four principles that govern how I approach every system, every incident, and every automation decision. Not methodology for its own sake — engineering with measurable outcomes.

OBSERVABILITY STACK · LIVE

metrics ingested:14,200/min

active alerts:3 of 847 rules

noise ratio:< 2%

Observe Everything. Alert on What Matters.

Most monitoring systems drown engineers in noise. I build observability stacks that distinguish signal from symptom — so when an alert fires, it demands action, not investigation.

GrafanaZabbixCustom Threshold Engineering

Manual

Automated

Manual Task

Ansible Playbook

4 hours

12 minutes

Error-prone

Version-controlled

Undocumented

Runbook + CI/CD

Automate the Toil. Engineer the Exception.

I audit workflows for toil density and replace repetition with automation, freeing the team to focus on problems machines can't solve yet.

AnsibleCI/CDInfrastructure-as-Code

DETECT4m 12s

TRIAGE3m 21s

RESOLVE17m 44s

POST-MORTEM✓ Closed

Blameless post-mortem · loop permanently closed

Every Incident Closes a Loop.

Incidents aren't failures — they're the system revealing its own blind spots. I treat every outage as a structured learning event that permanently eliminates the failure mode.

Incident RunbooksSLA ComplianceBlameless Reviews

SLO Dashboard

Active

Availability SLO

99.9%✔ Within budget

Latency P95

97.8%✔ Within budget

Error Rate

94.1%⚠ Approaching limit

Error Budget Left

68%Burn rate: normal

SLOs Are Contracts. I Honor Them.

SLOs aren't internal metrics — they're promises to the business. I define error budgets, track burn rates in real time, and make the case to engineering leadership when reliability is at risk.

SLO FrameworksError Budget TrackingReliability Reporting

Reliability Stack

Built on Tools ThatRun Production.

Every tool in this stack has been used in live environments — not tutorials, not sandboxes. This is the ecosystem I operate in daily to keep infrastructure observable, automated, and resilient.

Reliability Maturity

Where Is Your InfrastructureRight Now?

Most teams don't have a reliability problem. They have a visibility problem. Here's how I diagnose where you are — and exactly what changes when I'm involved.

Stage 01

Reactive

Your team finds out about incidents when users do.

Monitoring exists but alerts are ignored

No defined SLOs or error budgets

Runbooks are undocumented or missing

Every incident starts from zero

High toil · High stress · Low trust in systems

Stage 02 · Most Common

Fragile

Your systems work — until they don't. And no one knows why.

Deployments are manual and anxiety-inducing

Configuration drift across environments

Incidents are resolved but never closed

Toil is normalized, not measured

Unpredictable · Unscalable · Unsustainable

Stage 03

Scaling

Your infrastructure is growing faster than your reliability practices.

SLOs defined but not enforced

Error budgets exist on paper, not in decisions

No reliability roadmap aligned to business goals

Dev velocity and ops stability in constant tension

Growing fast · Breaking often · Cost of failure rising

Credentials

Validated by the Industry.Not Just Claims.

Every certification here was earned through hands-on practice, not passive study. These represent the technical foundations I apply daily in production environments.

Databricks Data Engineer – ProfessionalDatabricks

Key Skills

LakehousePerformance TuningData Governance

AWS Solutions Architect AssociateAmazon Web Services

System DesignHigh AvailabilityDisaster Recovery

Verify Credential

AWS Cloud PractitionerAmazon Web Services

Key Skills

Cloud FundamentalsComplianceCost Optimization

Experience

The Work BehindThe Metrics.

Two roles. One company. A clear trajectory — from building the observability foundation as an intern to owning reliability outcomes across production cloud environments as an engineer.

April 2025 – Present

CURRENT

Parkar

Platform Operations Engineer

SREAWSAnsibleIncident EngineeringGrafanaZabbixCI/CDRunbook Design

Full ownership of monitoring architecture, incident lifecycle management, and automation strategy across production cloud environments. Not a support function — a reliability engineering role with measurable outcomes and direct impact on SLA compliance and operational efficiency.

Impact

−40%Incident DetectionMTTD improvement across all environments

−35%Unplanned DowntimePost observability overhaul

+50%Operational EfficiencyAutomated deployment and config workflows

−80%Toil EliminatedAnsible automation framework

Jan 2025 – Apr 2025

Parkar

Platform Operations Intern

GrafanaZabbixAWSLinux AdministrationTechnical Documentation

Started with zero production access. Left with Zabbix monitoring deployed across Linux server fleets, Grafana dashboards live for development teams, and an AWS Cloud Practitioner certification earned mid-internship. Built the observability foundation the team still operates on. Promoted to full-time in four months — not at review time, because the outcomes demanded it.

Sep 2021 – Jun 2025

Gujarat Technological University

B.E. Computer Engineering

Computer EngineeringAnsiblePythonLinuxInfrastructure Automation

The foundation. Final year project: an Ansible-based Linux server automation framework that cut manual deployment time by 80%. Not a student project — a production principle, prototyped early. Everything that followed was built on this thinking.

Quick Answers

Things You're ProbablyWondering.

The questions hiring managers and engineering leads ask most. Answered directly, without the interview performance.

I'm open to all working arrangements - remote, hybrid, or in-office. My focus is on contributing meaningfully to the team and the systems we're responsible for, wherever that work happens best.

Blog

Insights andupdates

Thoughts on reliability engineering, infrastructure automation, and building systems that last.

Vivek Pillai01 Jan 2025

TRENDINGZero-Downtime Deployments

How to ship code to production without your users noticing. Lessons learned from operating infrastructure at scale.

Vivek Pillai14 Sep 2024

POPULAROn-Call Without Burnout

Practical strategies for sustainable on-call rotations that protect your team's health and your system's reliability.

Vivek Pillai24 Aug 2024

NEWAutomating Runbooks with Python

Turn repetitive incident response into automated workflows. Stop doing the same thing twice at 3 AM.

Learn more about SRE practices by reading my blog

View all blogs

Vivek Pillai

Let's Connect

Designing resilient cloud infrastructure and self-healing systems that prioritize reliability, scalability, and operational excellence.

Navigate

CredentialsAWS Solutions ArchitectDatabricks ProfessionalDatabricks AssociateAWS AI PractitionerAWS Cloud Practitioner

Connect

Ahmedabad, India · Open to Remote & Global Opportunities

Failure ThatNeverReaches User