3.4 retrospective

“Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.” –Norm Kerth, Project Retrospectives: A Handbook for Team Review

Prev retro action items * (x) (Quinn) Document user flows and owners of those user flows (3.2 retrospective) * Should we just use the release testing owners for this? Yes * (x) (Stephen) Update PR template to remind devs to update docs. * (x) (Quinn) When Quinn sends PRs, attach a line clarifying priority (“This should be reviewed as if it came from any other team-member” or “This is a priority because XYZ”), so there is not implicitly assumed incorrect priority.

Dev process * (Beyang) Remove PR approval requirement (annoying for docs, small changes, P1 patches)(+1 Ryan)(+1 Geoffrey) (+1 Issac (discuss)) (+1 Loic)(+2 Beyang)(+3 Keegan) * Ideas: * Require for different areas of the code base, not for others * Consensus: Make it easier to contribute to certain areas of the code base (docs), but generally desire is to keep it the way it is * Honor system: “you need approval” with the culture * Old retrospective docs have issues that caused problems * E.g. Deployed buggy behavior to customers * Make it explicit in the template * Proposals: * Keep requirement to create a PR (cannot push direct to master), but PRs can be merged without a review, but you MUST get a review * Preserve requirement, but with some technical exception for docs * (Beyang) What is the meaning of milestones on GitHub? (+3 Felix)(+1 Christina) (+1 Issac)(+1 SQS) * Should all the issues in a given milestone be treated as release blocking? * Which repository’s issues can block a release (just sourcegraph/sourcegraph or others, as well)? Different teams appear to have different definitions here. It is hard for the release captain to determine if a release is “ready”. * What about browser extension and Sourcegraph extensions? * Is the tracking issue the source of truth or is the milestone? * (Beyang) Gaps in team ownership (+1 Thorsten, +1 Ryan, +1 Beyang) (+1 Issac)(+1 SQS)(+1 Loic): * Who to assign frontend web app issues like https://github.com/sourcegraph/sourcegraph/issues/4063 to? * (Stephen) Team-specific channels have become hidden gold mines: https://sourcegraph.slack.com/archives/C07KZF47K/p1558126598217900 (+2 Christina, +4 Stephen, +1 Geoffrey, +1 Ryan, +1 SQS) * Options: * Important discussion in team channels, someone takes responsibility of that to put it into the dev chat channel * Require everyone to join the 4-5 channels - put the responsibility on the person to handle notifications, reading etc * We used to have a lot of channels, and over time these got shut down. The conversations that used to happen publicly started moving to private conversations * Project specific channels * You don’t have to subscribe or read all the notifications, but the point of having the channel is having things be available to everyone, but so that dev chat isn’t too noisy * Don’t think we need to force people into different channels * If it is something the team should know, share it in dev chat or product * This part is unclear - when to share * Continue discussion in breakout group * (Ryan) If your PR requires an addition to the changelog, also consider whether new/updated documentation is required, e.g. #3786, #4017 * (Ryan) For easier blog post writing, communicate in Slack and update the tracking issue (and roadmap if applicable) if something planned will now not make it into the release. (+1 Issac)(+2 Christina)(+2 Ryan)(+2 SQS) QA * How to make release testing more efficient / automatic? (+1 Christina, +1 Stephen, +1 Beyang, +2 Loic, +2 Ryan) (+2 Issac)(+2 Geoffrey) (+2 Thorsten)(+3 Chris)(+2 SQS) * (Felix) How to convert manual test plans into automated scripts? Manual testing wastes time, but we want automated scripts to be as good as manual tests. Should devs QA-test their own features, so they are incentivized to write good tests? * (Stephen) Massive number of manual release testing processes needed for confidence shows how lacking in automated tests we are. * (Loic) The 3.4 test cycle was great, both for quality and knowledge sharing. However, I don’t think repeating it on the same scale at every iteration is sustainable. * Is there a way we could spread the necessary manual testing throughout the release cycle, rather than have a testing “crunch time” at the end? * Ideally, manual testing would be limited to new features as most as possible, with regression testing being automated. * (Stephen) Worried about how quickly we can get to a state of “stable releases AND regular feature development” and what we prioritize first. What is the right tradeoff here? (+2 Loic)(+3 Christina) (+2 Stephen)(+3 Geoffrey)(+3 Chris)(+1 Beyang)(+1 Ryan)(+3 SQS) (+1 Issac) * (Farhan) There is an interesting tradeoff as we approach the release branch being cut between code quality and getting your feature into the release candidate. I had my saved search UI changes up in a PR, arguably ready to be merged one day before the release branch was cut, but I did not merge it as I did not have an approval for the PR. It ended up getting merged at ~2pm the day release testing started, and it was a low-risk change, so I expected I would perhaps be able to get it into the next release candidate. Understandably, my request was rejected. But, in the future, I am incentivized to force the reviewer to approve my change and add fix-up commits after the release candidate is cut. Is this something we are OK with? How do we balance this? (+3 Stephen)(+1 Christina)(+2 Chris)(+2 Felix) (+1 Issac) * Keegan: missing the branch cut shouldn’t be a big deal. Even releases are major, odd releases aren’t? * Push customers to use insiders? Ops & monitoring * OpsGenie noise (+1 Beyang)(+1 Chris)(+1 Geoffrey)(+2 Keegan)(+1 Loic) (+1 Thorsten) (+1 Issac) * (Stephen) Flappy OpsGenie alerts everywhere: gitserver disk issues, latency warnings firing and auto-resolving, etc. * (Thorsten) Signal vs. noise ratio for production alerts is bad * “Production is down” does not mean production is down * Alerts often “fix themselves” * Alerts come in batches that all fix themselves at the same time again * (Stephen) A strong need for standard monitoring in all deployment environments still exists. (+2 Geoffrey)(+2 Ryan) * (Thorsten) Deployment workflow: automatic deployment is great but could be even better (+1 Chris)(+5 Keegan) * There’s a significant delay between merging something into the master branch and it being deployed * Unclear what will be deployed when. Hard to understand what renovate/pulumi are currently doing * How do I easily find out which commit is currently deployed to which environment * Manual deployment/rollback is not a first-class feature in the current workflow * A lot of noise in #bots-production due to seemingly duplicate deploy-sourcegraph-dot-com (release) messages * (Thorsten) “Gitserver disk full”, “CI running out of disk space” * Happened multiple times in 2 weeks * Let’s automate this * (Thorsten) Unclear which tools are best used for debugging production/on-call issues * Some links on sgdev.org are broken * Discovered Lightstep “by accident”, not on sgdev.org * “What’s the best way to find out which requests are causing latency spike?” * (Thorsten) Stephen just linked me the on-call.md doc Action items: * ( ) (Felix) Add to iteration plan that supporting experimentation in browser ext. is a goal. * ( ) (Beyang) Better define code owners * Identify gaps in the team structure, as well, and make sure everything is owned(a) * ( ) (Beyang and Christina) Document agency, ownership, expectations of engineering owners(b) * ( ) (Beyang and Christina) Create checklist that product delivers to eng. Eng responsibility is to verify all requirements are met.© * ( ) (Beyang) Discuss how to improve QA / release testing: Chris, Loic, Ryan, Geoffrey, Stephen, Issac, Felix, Christina(d) * ( ) (Christina) Figure out how to prioritize features vs. testing automation/stability (Stephen)(e) * ( ) (Christina) Release cadence, and how to handle features not making the release cut in time(f) * ( ) (Keegan, then Beyang and Geoffrey) Look into improving alerting, stability, “ops love” * ( ) (Christina) make a decision and communicate what we’re going to do about the PR pain point challenges(g) * (x) (Beyang) Make people aware of the slack channels that exist and make it part of onboarding, bot in announce that announces new channels * (x) (Christina) what do milestones mean * ( ) (Christina) keep blogpost up to date over the course of the iteration, keep tracking issues up to date(h) * ( ) (Distribution team) Ops noise and monitoring(i) * ( ) (Thorsten) Deployment workflow - shorter and easier to deploy things(j) Retro retro: * Started 5 minutes late, AV issues (Beyang using Linux) burned another 5 minutes * Ended 8 minutes late * Each point should have a proposal, even a strawman, to kick off the discussion * Should root cause before going forward in discussion (otherwise talking in circles) * Some issues should just be assigned to an owner and the team should trust them to run it. Others benefit from more discussion (e.g., team Slack structure)

(a)[email protected] resolve when done Assigned to Beyang Liu (b)[email protected] [email protected] Assigned to Beyang Liu ©[email protected] [email protected] Assigned to Christina Forney (d)[email protected] Assigned to Beyang Liu (e)[email protected] Assigned to Christina Forney (f)[email protected] Assigned to Christina Forney (g)[email protected] Assigned to Christina Forney (h)[email protected] Assigned to Christina Forney (i)[email protected] [email protected] [email protected] Assigned to Geoffrey Gilmore (j)[email protected] Assigned to Thorsten Ball