Bryon Keck

Scottsdale, AZ 85257 • (520) 220-1171
hi@brk.dev • www.linkedin.com/in/bryonkeck

Site Reliability Engineer

Engineering Leader specializing in Reliability and Incident Management for distributed, internet-scale environments. Expert in Traffic Shifting, Distributed Systems Recovery, and AI/ML infrastructure. Delivered 95% reduction in SLA payouts ($400k to $4k) through incident response, problem management, and root cause analyses (RCAs) in on-prem and cloud settings. Demonstrated strong problem-solving and prioritization in high-stakes, ambiguous environments, with effective communication across teams to drive continuous improvement.

Key Qualifications

Software: JavaScript, Python, Go, Ruby, Node.js, Linux, Bash, Docker
Cloud & Infrastructure: AWS, GCP, Terraform, Kubernetes, Helm, Prometheus, Grafana, Datadog, PagerDuty, Ansible
Leadership & SRE: Incident Management, Problem Management, Root Cause Analyses (RCAs), SRE Principles, Error Budgets, Blameless Postmortems, SLO/SLI/SLA, AI-Driven Incident Analysis, Traffic Shifting, Cross-Functional Governance, On-Call Rotations, AI/ML Infrastructure, HTTP

Professional Experience

Head of Incident Response & Prevention EngineeringJuly 2025 to Present
MX Technologies – Lehi, UT

Reports directly to the VP of Engineering. Leads prevention-focused incident management in distributed environments, emphasizing SRE principles like error budgets, toil reduction, and blameless postmortems. Automates detection/mitigation, uplifts observability, and enforces governance to minimize MTTR/MTTD and prevent issues. Owns client SLA reporting, reliability enhancements, and AI-driven analysis in on-prem and cloud infrastructures.

Selected Contributions:

Demonstrated strong problem-solving by commanding 50+ critical incidents (SEV1/SEV2) in on-call rotations, coordinating 15+ engineers for rapid mitigation via rolling restarts and traffic shifts, reducing engineering fatigue and MTTR by 50%
Owned incident response and problem management, conducting thorough RCAs to prevent recurrence, resulting in 15% reduction in recurring incidents
Mastered Traffic Shifting strategies in distributed systems, using multi-region load balancing to mitigate failures (e.g., 90% traffic diversion), maintaining 99.99% uptime
Enforced deployment governance (e.g., N-1 compatibility checks), preventing critical failures in framework upgrades (e.g., Rails 7) across on-prem and cloud environments
Designed and tested Disaster Recovery (DR) plans for infrastructure resilience, ensuring continuity during outages
Led observability initiatives with Golden Signals (Latency, Traffic, Errors) to differentiate vendor vs. internal failures
Automated toil with AI tooling for incident timelines, reducing RCA draft time by 75% and freeing 20+ engineering hours weekly
Collaborated with cross-functional teams using clear communication to resolve complex distributed system failures, reducing recurrence by 15%
Applied prioritization skills to advise CTO via Operational Reviews, identifying risks and securing resources for debt remediation in high-stakes settings
Regularly engaged with clients regarding platform reliability, uptime improvement roadmaps, AWS cloud migration, maintenance timing, and enhanced incident communication, improving customer satisfaction and alignment

Senior Software Engineer, Site Reliability EngineeringMarch 2023 to June 2025
MX Technologies – Lehi, UT

Owned ingress infrastructure for applications in distributed, internet-scale environments, ensuring 99.99% availability via monitoring and multi-region shifting in on-prem and cloud (AWS/GCP) setups. Led observability driving APM adoption and tracing. Enforced governance in incident response, escalating as SME for RCAs. Oversaw Datadog rollout, improving reliability and reducing payouts.

Selected Contributions:

Drove 95% reduction in SLA credit payouts, from $400,000 in FY23 to $4,000 in Q1 2024 through effective incident management
Executed multi-region traffic shifting to mitigate outages in <5 minutes in high-stakes environments
Implemented real-time Load Balancer Golden Signals Dashboard, reducing MTTD from days to minutes
Onboarded 8+ applications into Datadog APM, enhancing telemetry in distributed systems
Enabled span attribution, improving RCA and reducing MTTR in AI-related workloads
Achieved 45% of days with 99.99%+ availability via ingress monitoring and canary testing
Established escalation paths in on-call rotations, reducing MTTM by 50%
Contributed to RCAs as SME, driving availability improvements across hybrid infrastructures
Thrived in fast-paced settings by troubleshooting bottlenecks in AI workloads, showcasing work ethic and motivation to maintain 99.99% availability

Staff Site Reliability Engineer, Embedding (MTS1)February 2020 to March 2023
PayPal – Scottsdale, AZ

Part of tiger team redesigning Command Center observability and alerting into key SLIs in distributed environments. Handled SLIs into actionable visualizations, reducing Mean Time to Detect by five minutes via monitoring and troubleshooting.

Selected Contributions:

Reduced Mean Time to Detect by five minutes through enhanced alerting and observability
Delivered effective alerting and observability to Command Center in internet-scale setups
Identified key incidents, enabling mitigation before large TPV impacts in high-stakes scenarios

Staff Software Engineer, Front End (MTS1)June 2014 to February 2020
PayPal – Scottsdale, AZ

Designed and engineered user-facing web applications for realtime processing and alerts on merchant performance in distributed systems. Oversaw code reviews, release cycles, and maintained reliable codebase using systems programming.

Selected Contributions:

Led team to debug merchant-impacting issues in fast-paced environments
Provided company-wide JavaScript support and organized community events
Promoted to Staff after exceeding expectations for 3 years

Web DeveloperMarch 2013 to June 2014
Biz Anytime (now SquadPod) – Tucson, AZ

Designed and architected products in digital collaboration space, focusing on scalability.

Senior EngineerOctober 2006 to March 2013
Brink Media – Tucson, AZ

Managed developers, implemented MVC in front-end with Backbone.js for reliable applications.

Selected Contributions:

Interfaced with clients like PayPal and MPAA, exceeding expectations

Education

Associate of Applied Science in Computer Programming/Analytics (2009) - Pima Community College, Tucson, AZ
This foundational education, combined with 15+ years of hands-on expertise, equates to a Bachelor's-level proficiency in Computer Science as required for advanced SRE roles, including distributed systems and AI/ML infrastructure.
Honors: Webby Award Honoree 2013 • CIW Site Development Associate 2012 • Addy Tucson Gold & Best of Show 2011 • Addy Tucson Silver (2) 2011 • PSAid.org First place 2010 • Webby Award Honoree 2010 • Boy Scouts of America, Eagle Scout 2007