Scottsdale, AZ 85257 • (520) 220-1171
hi@brk.dev •
www.linkedin.com/in/bryonkeckEngineering Leader specializing in Reliability and Incident Management for distributed, internet-scale environments. Expert in Traffic Shifting, Distributed Systems Recovery, and AI/ML infrastructure. Delivered 95% reduction in SLA payouts ($400k to $4k) through incident response, problem management, and root cause analyses (RCAs) in on-prem and cloud settings. Demonstrated strong problem-solving and prioritization in high-stakes, ambiguous environments, with effective communication across teams to drive continuous improvement.
Key QualificationsSoftware: JavaScript, Python, Go, Ruby, Node.js, Linux, Bash, Docker
Cloud & Infrastructure: AWS, GCP, Terraform, Kubernetes, Helm, Prometheus, Grafana, Datadog, PagerDuty, Ansible
Leadership & SRE: Incident Management, Problem Management, Root Cause Analyses (RCAs), SRE Principles, Error Budgets, Blameless Postmortems, SLO/SLI/SLA, AI-Driven Incident Analysis, Traffic Shifting, Cross-Functional Governance, On-Call Rotations, AI/ML Infrastructure, HTTP
Head of Incident Response & Prevention EngineeringJuly 2025 to PresentMX Technologies – Lehi, UT
Reports directly to the VP of Engineering. Leads prevention-focused incident management in distributed environments, emphasizing SRE principles like error budgets, toil reduction, and blameless postmortems. Automates detection/mitigation, uplifts observability, and enforces governance to minimize MTTR/MTTD and prevent issues. Owns client SLA reporting, reliability enhancements, and AI-driven analysis in on-prem and cloud infrastructures.
Selected Contributions:
- Demonstrated strong problem-solving by commanding 50+ critical incidents (SEV1/SEV2) in on-call rotations, coordinating 15+ engineers for rapid mitigation via rolling restarts and traffic shifts, reducing engineering fatigue and MTTR by 50%
- Owned incident response and problem management, conducting thorough RCAs to prevent recurrence, resulting in 15% reduction in recurring incidents
- Mastered Traffic Shifting strategies in distributed systems, using multi-region load balancing to mitigate failures (e.g., 90% traffic diversion), maintaining 99.99% uptime
- Enforced deployment governance (e.g., N-1 compatibility checks), preventing critical failures in framework upgrades (e.g., Rails 7) across on-prem and cloud environments
- Designed and tested Disaster Recovery (DR) plans for infrastructure resilience, ensuring continuity during outages
- Led observability initiatives with Golden Signals (Latency, Traffic, Errors) to differentiate vendor vs. internal failures
- Automated toil with AI tooling for incident timelines, reducing RCA draft time by 75% and freeing 20+ engineering hours weekly
- Collaborated with cross-functional teams using clear communication to resolve complex distributed system failures, reducing recurrence by 15%
- Applied prioritization skills to advise CTO via Operational Reviews, identifying risks and securing resources for debt remediation in high-stakes settings
- Regularly engaged with clients regarding platform reliability, uptime improvement roadmaps, AWS cloud migration, maintenance timing, and enhanced incident communication, improving customer satisfaction and alignment
Senior Software Engineer, Site Reliability EngineeringMarch 2023 to June 2025MX Technologies – Lehi, UT
Owned ingress infrastructure for applications in distributed, internet-scale environments, ensuring 99.99% availability via monitoring and multi-region shifting in on-prem and cloud (AWS/GCP) setups. Led observability driving APM adoption and tracing. Enforced governance in incident response, escalating as SME for RCAs. Oversaw Datadog rollout, improving reliability and reducing payouts.
Selected Contributions:
- Drove 95% reduction in SLA credit payouts, from $400,000 in FY23 to $4,000 in Q1 2024 through effective incident management
- Executed multi-region traffic shifting to mitigate outages in <5 minutes in high-stakes environments
- Implemented real-time Load Balancer Golden Signals Dashboard, reducing MTTD from days to minutes
- Onboarded 8+ applications into Datadog APM, enhancing telemetry in distributed systems
- Enabled span attribution, improving RCA and reducing MTTR in AI-related workloads
- Achieved 45% of days with 99.99%+ availability via ingress monitoring and canary testing
- Established escalation paths in on-call rotations, reducing MTTM by 50%
- Contributed to RCAs as SME, driving availability improvements across hybrid infrastructures
- Thrived in fast-paced settings by troubleshooting bottlenecks in AI workloads, showcasing work ethic and motivation to maintain 99.99% availability
Staff Site Reliability Engineer, Embedding (MTS1)February 2020 to March 2023PayPal – Scottsdale, AZ
Part of tiger team redesigning Command Center observability and alerting into key SLIs in distributed environments. Handled SLIs into actionable visualizations, reducing Mean Time to Detect by five minutes via monitoring and troubleshooting.
Selected Contributions:
- Reduced Mean Time to Detect by five minutes through enhanced alerting and observability
- Delivered effective alerting and observability to Command Center in internet-scale setups
- Identified key incidents, enabling mitigation before large TPV impacts in high-stakes scenarios
Staff Software Engineer, Front End (MTS1)June 2014 to February 2020PayPal – Scottsdale, AZ
Designed and engineered user-facing web applications for realtime processing and alerts on merchant performance in distributed systems. Oversaw code reviews, release cycles, and maintained reliable codebase using systems programming.
Selected Contributions:
- Led team to debug merchant-impacting issues in fast-paced environments
- Provided company-wide JavaScript support and organized community events
- Promoted to Staff after exceeding expectations for 3 years
Web DeveloperMarch 2013 to June 2014Biz Anytime (now SquadPod) – Tucson, AZ
Designed and architected products in digital collaboration space, focusing on scalability.
Senior EngineerOctober 2006 to March 2013Brink Media – Tucson, AZ
Managed developers, implemented MVC in front-end with Backbone.js for reliable applications.
Selected Contributions:
- Interfaced with clients like PayPal and MPAA, exceeding expectations
Associate of Applied Science in Computer Programming/Analytics (2009) - Pima Community College, Tucson, AZ
This foundational education, combined with 15+ years of hands-on expertise, equates to a Bachelor's-level proficiency in Computer Science as required for advanced SRE roles, including distributed systems and AI/ML infrastructure.
Honors: Webby Award Honoree 2013 • CIW Site Development Associate 2012 • Addy Tucson Gold & Best of Show 2011 • Addy Tucson Silver (2) 2011 • PSAid.org First place 2010 • Webby Award Honoree 2010 • Boy Scouts of America, Eagle Scout 2007