3.8 Release retrospective

“Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.” –Norm Kerth, Project Retrospectives: A Handbook for Team Review

Previous retrospective: 3.7 Retrospective

Action Items

  • [x] @loic Add to release captain docs:
    • [ ] Loic: Release captain should aggressively revert features that may cause delays.
    • [ ] All: Default to tagging regressions as release blockers.
    • [ ] All: Add high-level details like customer impact on issues so Christina can help determine whether or not it’s a release blocker.
  • [ ] @thorsten create an issue with the debugging information that would have been helpful
  • [ ] @nick figure out how to prioritize work on flakey alerts
  • [ ] Individual teams: look at the release testing grid, and decide whether further automated testing work should be prioritized & tracked in 3.9.
  • [x] @nick @christina: flip the default for RFCs to make them public by default.
  • [x] @loic update release captain docs so that release testing items are updated earlier in the milestone

Discussion Items

Loic

  • Release testing assignments
    • I assigned items in the testing grid randomly to engineers in the team that owned the item. It turned out some engineers did not have the bandwidth to go through their assigned tests (because of customer work or other obligations) and items had to be reassigned after day 1 or 2 of testing. Could we go through the testing grid quicker if it was up to each team to distribute the items they own between team members before release testing started? (+3 Farhan, loic +2, Hadrian +3) 8
      • Discussion:
        • Assignment was random, but some engineers didn’t have bandwidth, so items had to be reassigned after 1-2 days of testing.
        • Would it be better to have the teams do these assignments themselves ahead of time?
        • Thorsten: better to have something assigned by default. Depending on who was available with my team we reassigned it. If it had been up to us to do the initial assignment would have been harder.
        • With everything being assigned by default, what do we need to do to make sure things are reassigned early enough (before release testing day).
        • Create Monday grid for 3.9 now, but can cover most of it, so people know in advance what they are going to test.
    • It seems like it’s currently implicit that @distribution will pin the k8s.sgdev.org docker images to the RC version at the start of release testing day. Should this be made an explicit step in the release issue template instead, owned by the release captain, so that release testing can start async (in #best-timezone, for instance) without a dependency on the distribution team? Similarly, should the release captain own unpinning the images after the release? (**+2 Thorsten, +1 Kevin) 3**

Thorsten:

  • CI flakey (Nick +4, Chris +10, Uwe +3, Kevin +5, loic + 3, Farhan +1, Eric +10, Stephen +10, Joe+5, Hadrian +1, Felix +1) 53
    • We’re slowly hammering out the kinks in the e2e test suite, but there still seem to be e2e flakes that produce a lot of noise but are really hard to debug and thus to fix (example: timeouts when reading from database)
    • In addition to e2e test suite, there are ephemeral golangci-lint errors
    • Discussion:
      • We all want this to be fixed, it feels like more issues showed up this iteration than ever before.
      • Do we have a list of the flaky tests?
      • Two things: make them not flaky, make the error messages more clear
      • Causes:
        • Many test failures are solvable.
        • In the past we had an e2e tests that were flaky and we didn’t know why. Now we have a setup that allows us to fix the flaky issues.
        • There are docs on how to write good, not flaky e2e tests (link please is it this one)
        • Could we write ESLint tools to help with this?
        • Felix helped Tomas write e2e tests and it was really helpful for him. Perhaps we can do this on a team scale
        • Get to the bottom of why these are hard to debug
          • Lots are related to jest and using mocha. Jest keeps running if the test failures, so there is lots of noise from the other failures that weren’t the root cause. This adds up to it being very confusing and frustrating.
          • Logs from the docker container are hard to find.
      • Another thought Keegan posed: Was there ever a valid test failure? Was there ever a case where an e2e test caught something? Is the ratio in favor of the test failures vs. flakes?
        • Thorsten: Experience is that only noise has come from these tests.
        • Felix: maybe frontend failures happen more than from the backend. Find them valuable even if they all pass, because with dependency upgrades it saves me a lot of time of manually testing when builds all pass.
        • Chris: we would end up with the same amount of pains, but distributed differently. Right now we have a ~5% failure rate, so we can more easily correlate changes with failures.
        • Thorsten: have test clusters that report failure rate changes to flesh out issues vs. flaky tests

Stephen:

  • Setting up zoom for retrospectives is really difficult, I first noticed this as release captain several months ago and saw Loic run into the same problem today. The release captain documentation states to do it 15 minutes before – but in reality only a few know or have the ability to do it.
  • OpsGenie alerts are flaky (Uwe is working orthogonally on standard monitoring, which would eventually address this) (Thorsten + 5, Uwe +2 Nick +2, Joe +3, Hadrian +3) 15
    • Discussion:
      • This is confusing for new team members having intuition about what tests fail
      • Is this from site 24x7 or e2e test failures?
        • Site 24x7
      • Reevaluate other solutions or monitoring there.
      • Clearly there is some downtime happening probably as a result of deploys. Fundamentally, are we ok with that? Or is this something we can/need to fix?
      • Downtime is largely due to deploying gitserver changes. Don’t think it can be a rolling change.
        • Verifying if gitserver’s code has actually changed
      • Only solution is to spend time to investigate.
        • Time to prioritize this since it hasn’t gotten very much love in the last year or so. If anyone wants to volunteer let Nick know.
  • About half of all new RFCs are not being made public, them being public has obvious benefits (and is something we have already said we want to do) – how do we get there? (+3 Christina, +2 Nick, Hadrian +3) 8
    • Discussion:
      • What can we do to make sure everyone is creating these RFCs to be public by default.
      • Nick: should RFCs still have “public” in the title?

Christina

  • There are not a lot of retrospective items added ahead of time (for the last two retrospectives). Is that because things are going well? Or is it a symptom of another problem? (+3 Farhan, +3 loic, +2 Nick, +4 Christina, +2 Joe, +1 Felix) 15
    • Discussion:
      • For the last two milestones we didn’t have a calendar event for feedback due?
      • We did end up with feedback eventually before the retrospective, so not everything is going well apparently
      • People don’t have the opportunity to think about the feedback proactively.
      • The more people we have, the less things that are relevant to the whole group here, rather than specific team issues.
        • There have been some retrospectives that are team based that.
        • There are less things to concern every team member with
      • Side effect of growing an organization. Since there is still feedback coming through, this appears to be valuable still.
      • Get to a point where we don’t have to vote and we can just discuss everything.
      • Keep monitoring this until we reach a size where we need to formalize separate team retrospectives.

From action items (Uwe +5, +3 Farhan, +3 Christina, loic + 2) 13

  • Do we still need to explicitly track automated tests in tracking issues, or are we at the point now where we can just do this during release testing week?
    • Discussion:
      • There is a lot automated on the release testing grid. Is there value in specifying the tests we want to tackle in the iteration, or is there enough time during the release testing week to tackle a few remaining tests.
      • How many manual items remain on the release testing grid?
        • All tests on auth are not automated (~12 tests)
        • Most of code intel is automated
        • Search a lot is automated, still 7 items that need to be automated
        • About 25 items that are not automated in total
      • It is hard to automate class of tests around authentication. I don’t see how they could be reliably automated.
      • There are some teams that should continue focusing heavily on automated testing (seeing what is possible there) e.g. authentication it is unclear if we can automated it. Is there something else we can do to gain confidence where we don’t have to automate it. Things around external services and site admin pages, extensions could use more tests.
      • It might make sense to take a look at the testing grid and see if it makes sense as part of a release at a team wide planning level.
      • Part of what can be added to a tracking issue - do we need to build something to make auth easier to automate? Are we missing the right test harness to make automation easier?
      • When teams are going to look at the grid and decide what to work on this milestone. Don’t say, “we’re not going to automate because it is hard” but consider “what would it take to automate” and consider doing that work.