Site Reliability Engineer
SRE & Platform Roles

Failure That
Never
Reaches User

SRE Dashboard
Pipeline Run Successful
To-do list
Core Disciplines
Reliability Engineering.Practiced with Precision.

Observability, automation, and incident engineering — applied to systems that businesses stake their reputation on. This is the work that prevents headlines, not the work that makes them.

Observability Architecture

Grafana · Zabbix · Custom Alert Pipelines

I don't build dashboards. I build intelligence layers that surface anomalies before they become incidents. Every alert is designed to be actionable. Every metric is chosen to eliminate noise, not add to it.

Incident Engineering

Runbooks · SLA Compliance · Blameless Post-Mortems

From first signal to post-mortem — I own the full incident lifecycle. Structured runbooks, escalation trees, and blameless reviews that extract the failure pattern permanently. Every incident leaves the system stronger than it found it.

Toil Elimination

Ansible · CI/CD Pipelines · Configuration-as-Code

Manual processes are reliability debt that compounds silently. I audit operational workflows for toil density, then automate them out of existence — converting engineer hours into system intelligence.

View Reliability Cases
How I Work
Reliability Engineeredby Design.

Four principles that govern how I approach every system, every incident, and every automation decision. Not methodology for its own sake — engineering with measurable outcomes.

OBSERVABILITY STACK · LIVE
metrics ingested:14,200/min
active alerts:3 of 847 rules
noise ratio:< 2%

Observe Everything. Alert on What Matters.

Most monitoring systems drown engineers in noise. I build observability stacks that distinguish signal from symptom — so when an alert fires, it demands action, not investigation.

GrafanaZabbixCustom Threshold Engineering
Manual
Automated
Manual Task
Ansible Playbook
4 hours
12 minutes
Error-prone
Version-controlled
Undocumented
Runbook + CI/CD

Automate the Toil. Engineer the Exception.

I audit workflows for toil density and replace repetition with automation, freeing the team to focus on problems machines can't solve yet.

AnsibleCI/CDInfrastructure-as-Code
DETECT4m 12s
TRIAGE3m 21s
RESOLVE17m 44s
POST-MORTEM✓ Closed
Blameless post-mortem · loop permanently closed

Every Incident Closes a Loop.

Incidents aren't failures — they're the system revealing its own blind spots. I treat every outage as a structured learning event that permanently eliminates the failure mode.

Incident RunbooksSLA ComplianceBlameless Reviews
SLO Dashboard
Active
Availability SLO
99.9%✔ Within budget
Latency P95
97.8%✔ Within budget
Error Rate
94.1%⚠ Approaching limit
Error Budget Left
68%Burn rate: normal

SLOs Are Contracts. I Honor Them.

SLOs aren't internal metrics — they're promises to the business. I define error budgets, track burn rates in real time, and make the case to engineering leadership when reliability is at risk.

SLO FrameworksError Budget TrackingReliability Reporting
Reliability Stack
Built on Tools ThatRun Production.

Every tool in this stack has been used in live environments — not tutorials, not sandboxes. This is the ecosystem I operate in daily to keep infrastructure observable, automated, and resilient.

AWS
Ansible
Prometheus
GitHub
Zabbix
Terraform
Grafana
Reliability Maturity
Where Is Your InfrastructureRight Now?

Most teams don't have a reliability problem. They have a visibility problem. Here's how I diagnose where you are — and exactly what changes when I'm involved.

Stage 01

Reactive

Your team finds out about incidents when users do.

Monitoring exists but alerts are ignored
No defined SLOs or error budgets
Runbooks are undocumented or missing
Every incident starts from zero
High toil · High stress · Low trust in systems
Stage 02 · Most Common

Fragile

Your systems work — until they don't. And no one knows why.

Deployments are manual and anxiety-inducing
Configuration drift across environments
Incidents are resolved but never closed
Toil is normalized, not measured
Unpredictable · Unscalable · Unsustainable
Stage 03

Scaling

Your infrastructure is growing faster than your reliability practices.

SLOs defined but not enforced
Error budgets exist on paper, not in decisions
No reliability roadmap aligned to business goals
Dev velocity and ops stability in constant tension
Growing fast · Breaking often · Cost of failure rising
Credentials
Validated by the Industry.Not Just Claims.

Every certification here was earned through hands-on practice, not passive study. These represent the technical foundations I apply daily in production environments.

Databricks Data Engineer – Professional
Databricks Data Engineer – ProfessionalDatabricks
Key Skills
LakehousePerformance TuningData Governance
AWS Solutions Architect Associate
AWS Solutions Architect AssociateAmazon Web Services
System DesignHigh AvailabilityDisaster Recovery
Verify Credential
AWS Cloud Practitioner
AWS Cloud PractitionerAmazon Web Services
Key Skills
Cloud FundamentalsComplianceCost Optimization
Experience
The Work BehindThe Metrics.

Two roles. One company. A clear trajectory — from building the observability foundation as an intern to owning reliability outcomes across production cloud environments as an engineer.

April 2025 – Present
CURRENT

Parkar

Platform Operations Engineer

SREAWSAnsibleIncident EngineeringGrafanaZabbixCI/CDRunbook Design

Full ownership of monitoring architecture, incident lifecycle management, and automation strategy across production cloud environments. Not a support function — a reliability engineering role with measurable outcomes and direct impact on SLA compliance and operational efficiency.

Impact

−40%Incident DetectionMTTD improvement across all environments
−35%Unplanned DowntimePost observability overhaul
+50%Operational EfficiencyAutomated deployment and config workflows
−80%Toil EliminatedAnsible automation framework
Jan 2025 – Apr 2025

Parkar

Platform Operations Intern

GrafanaZabbixAWSLinux AdministrationTechnical Documentation

Started with zero production access. Left with Zabbix monitoring deployed across Linux server fleets, Grafana dashboards live for development teams, and an AWS Cloud Practitioner certification earned mid-internship. Built the observability foundation the team still operates on. Promoted to full-time in four months — not at review time, because the outcomes demanded it.

Sep 2021 – Jun 2025

Gujarat Technological University

B.E. Computer Engineering

Computer EngineeringAnsiblePythonLinuxInfrastructure Automation

The foundation. Final year project: an Ansible-based Linux server automation framework that cut manual deployment time by 80%. Not a student project — a production principle, prototyped early. Everything that followed was built on this thinking.

?
Quick Answers
Things You're ProbablyWondering.

The questions hiring managers and engineering leads ask most. Answered directly, without the interview performance.

I'm open to all working arrangements - remote, hybrid, or in-office. My focus is on contributing meaningfully to the team and the systems we're responsible for, wherever that work happens best.

Blog
Insights andupdates

Thoughts on reliability engineering, infrastructure automation, and building systems that last.

Vivek Pillai01 Jan 2025
TRENDINGZero-Downtime Deployments

How to ship code to production without your users noticing. Lessons learned from operating infrastructure at scale.

Read more
Vivek Pillai14 Sep 2024
POPULAROn-Call Without Burnout

Practical strategies for sustainable on-call rotations that protect your team's health and your system's reliability.

Read more
Vivek Pillai24 Aug 2024
NEWAutomating Runbooks with Python

Turn repetitive incident response into automated workflows. Stop doing the same thing twice at 3 AM.

Read more

Learn more about SRE practices by reading my blog

View all blogs
Vivek Pillai
Let's Connect

Designing resilient cloud infrastructure and self-healing systems that prioritize reliability, scalability, and operational excellence.

CredentialsAWS Solutions ArchitectDatabricks ProfessionalDatabricks AssociateAWS AI PractitionerAWS Cloud Practitioner
Ahmedabad, India · Open to Remote & Global Opportunities
© 2026 Vivek Pillai · Built with intention, not just code.