Closed Bug 633161 Opened 13 years ago Closed 11 years ago

TryServer repository: Investigate (again) dealing with older heads

Categories

(Developer Services :: General, task, P3)

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 770811

People

(Reporter: sgautherie, Assigned: hwine)

References

Details

(Whiteboard: [tryserver][cleanup])

Attachments

(3 files)

From bug 611030 comment 12:

"hg commit --close-branch" could probably be used at least, if it does help.
"hg strip --nobackup" might be possible and even better.
moving to the appropriate component for hg.m.o server side work
Assignee: nobody → server-ops
Component: Release Engineering → Server Operations
QA Contact: release → mrz
What is there to investigate here?  More details in this request please.
Assignee: server-ops → nmeyerhans
See bug 554656 for historical information about the discussions that have happened about this issue. Right now we're at the point again of needing to trim back the heads - I'll file a separate bug for that.
Blocks: 554656
So we need to know what happens when we run 'hg strip -n REV' [1] on the live try repo so that we might be able to do this automatically at regular intervals to keep try from getting so slow and full of heads.  

In order to determine if this is a viable option we will need to:

1. Enable the mq extension on the hg.mozilla.org/try .hgrc
2. Announce a time window to stakeholders saying that the strip command will be run during that window and while pushes to Try can (and should) continue to happen so that we know what happens with a live repo, if there is burning it might be because of the head clean up and so those who push to try during that time might need to do so again later for best results.
3. Have the "Resetting Try Repo" instructions at the ready in case stripping the repo explodes somehow and the repo needs to be reset.

This way, if anything goes wrong during the window we set and we have to reset the try repo, devs will already be aware that pending builds may get lost in the reset.

So the next step, as I see it, is deciding when to do this and who from ServerOps will be on point for trying the hg strip on try repo - I would be happy to see it happen this week if possible.  A four hour window is best based on previous experience with try. Early Friday morning perhaps?


[1] http://mercurial.selenic.com/wiki/PruningDeadBranches#Using_strip
(In reply to comment #4)
> So we need to know what happens when we run 'hg strip -n REV' [1] on the
> live try repo so that we might be able to do this automatically at regular
> intervals to keep try from getting so slow and full of heads.

For which values of N will we run this command?  Everything output by "hg heads"?

> 1. Enable the mq extension on the hg.mozilla.org/try .hgrc

I can do that any time.

> 3. Have the "Resetting Try Repo" instructions at the ready in case stripping
> the repo explodes somehow and the repo needs to be reset.

Re-cloning try from the mozilla-central repo is scripted, so it's essentially a single command. (/repo/hg/scripts/reset_try.sh from dm-svn01 or 02)

> So the next step, as I see it, is deciding when to do this and who from
> ServerOps will be on point for trying the hg strip on try repo - I would be
> happy to see it happen this week if possible.  A four hour window is best
> based on previous experience with try. Early Friday morning perhaps?

I can't promise that I'll be able to keep a close eye throughout the process, but I could kick it off around 7:30 or 8:00.  Somebody else from my team could probably do it earlier if we can provide them with a specific command to run.
(In reply to comment #5)
> (In reply to comment #4)
> > So we need to know what happens when we run 'hg strip -n REV' [1] on the
> > live try repo so that we might be able to do this automatically at regular
> > intervals to keep try from getting so slow and full of heads.
> 
> For which values of N will we run this command?  Everything output by "hg
> heads"?

Anything older than 10 days?  That's how long we keep builds for so that would seem to be a safe N to choose.

> I can't promise that I'll be able to keep a close eye throughout the
> process, but I could kick it off around 7:30 or 8:00.  Somebody else from my
> team could probably do it earlier if we can provide them with a specific
> command to run.

7:30 sounds fine, it's early enough for try to likely be slow. I'll be able to keep an eye on the try results and watch #developers for issues or questions.  If you're good to commit to this, I'll draft an announcement to the appropriate lists.
(In reply to comment #6)
> 7:30 sounds fine, it's early enough for try to likely be slow. I'll be able
> to keep an eye on the try results and watch #developers for issues or
> questions.  If you're good to commit to this, I'll draft an announcement to
> the appropriate lists.

OK, 7:30 is fine. Go ahead and draft the announcement.
(In reply to comment #4)

> to know what happens when we run 'hg strip -n REV' [1] on the live try repo

At first (at least), you probably want to use --time and may be --verbose --debug --traceback.

Was investigation done on a Try repo (or a part of it at least) clone first?
Are results published (or attached here)?

> 3. Have the "Resetting Try Repo" instructions at the ready in case stripping
> the repo explodes somehow and the repo needs to be reset.

Could it be possible to clone it beforehand, so ServerOps could later reproduce and debug (if need be), should any such breakage happen?


(In reply to comment #6)
> Anything older than 10 days?  That's how long we keep builds for so that
> would seem to be a safe N to choose.

Yes, "as long as builds (logs) are available" should be the right timeframe to keep.

Yet, there is no requirement to clean all old heads at once:
first we (just) need to know how (long) it behaves [and account for current backlog (repo) size], then we can schedule/automate this task.

As an assumed perf optimization, strip loop should run in reverse local rev order.
(And hopefully, the live repo has an empty working directory, to avoid updates.)
> At first (at least), you probably want to use --time and may be --verbose
> --debug --traceback.

I don't see these options you mention in http://mercurial.selenic.com/wiki/PruningDeadBranches#Using_strip, however if Noah can find a way to use those that's certain a good idea for tracking what's happening.
> 
> Was investigation done on a Try repo (or a part of it at least) clone first?
> Are results published (or attached here)?

This is that 'investigation'.
> 
> > 3. Have the "Resetting Try Repo" instructions at the ready in case stripping
> > the repo explodes somehow and the repo needs to be reset.
> 
> Could it be possible to clone it beforehand, so ServerOps could later
> reproduce and debug (if need be), should any such breakage happen?

There's nothing about this experiment that is really so different from our usual downtimes to reset the try repo so that level of preparation seems unnecessary - the win is if we discover we can strip on a live repo, if that's not possible we can take that information and set up regular reset blackouts on try.

> 
> Yet, there is no requirement to clean all old heads at once:
> first we (just) need to know how (long) it behaves [and account for current
> backlog (repo) size], then we can schedule/automate this task.
> 

Good point, and that is indeed what bug 554656  will be aiming to do once we learn from this experiment.

> As an assumed perf optimization, strip loop should run in reverse local rev
> order.
> (And hopefully, the live repo has an empty working directory, to avoid
> updates.)

Noah, I trust you can make the call on this as we monitor the results? Not sure how we'd know when we hit the "try is no longer slow/enough heads have been stripped" point though which is why I figured we should just go for _all_ up until 10 days ago.
(In reply to comment #9)

> I don't see these options you mention in
> http://mercurial.selenic.com/wiki/PruningDeadBranches#Using_strip

These additional options are global: 'hg -v help'.

> > Was investigation done on a Try repo (or a part of it at least) clone first?
> > Are results published (or attached here)?
> 
> This is that 'investigation'.

Well, I still have the feeling that the details you give look more like a direct-to-production action than a (brief) staging-first one.

> There's nothing about this experiment that is really so different from our
> usual downtimes to reset the try repo so that level of preparation seems
> unnecessary

But if you're used to do it that way, then let's do it.
(In reply to comment #9)
> > As an assumed perf optimization, strip loop should run in reverse local rev
> > order.
> > (And hopefully, the live repo has an empty working directory, to avoid
> > updates.)
> 
> Noah, I trust you can make the call on this as we monitor the results? Not
> sure how we'd know when we hit the "try is no longer slow/enough heads have
> been stripped" point though which is why I figured we should just go for
> _all_ up until 10 days ago.

My plan is to write a program that iterates over the heads in chronological order, running hg strip. It'll stop either when it finds a head that is less than 10 days old, or when we send it a sigterm.  I'll write this script today and post dry-run output if you're interested.
I feel like I've discussed this somewhere, but just stripping heads isn't going to fix things, unless you do it repeatedly. If someone pushes two MQ patches to try, then one of them will be a head, but its parent will still be a changeset that doesn't exist in mozilla-central. If you strip the head, the parent will just wind up as a new head.

Presumably what you really want to do is something like:
for each head H that's older than 10 days:
  X = the common ancestor of this head and the newest head
  Y = the child of X that's an ancestor of H
  strip Y (and its descendants)

This probably isn't too hard to do from a shell script or a Python script using the Mercurial API.
Here's a script that does what I just described. It works with my test repo, I think it will do the right thing with try, but I am not about to clone try to find out.

You can have the script do the stripping too, I left that line in but commented it out. It will print the ancestor changesets that you should feed into strip otherwise.
Here's the script I used to make a test repo that should sort of look like try.
Attached file try simulator script
Here's an expanded version of the mkheads.py script that sort of simulates the try server repo, with active pushes every 5 seconds. Just run mkheads.py, and it will populate /tmp/testrepo with 100 pushes, and then clone it to /tmp/clonerepo, and push new changes from there to /tmp/testrepo every 5 seconds. I left it running for a little bit, then used the stripheads.py script (attached here, but with the strip line uncommented) like so:
hg -R /tmp/testrepo heads --template="{node|short}\n" | tail -n50 | xargs python /home/luser/build/stripheads.py /tmp/testrepo

It took a few seconds to run, but it'd strip the oldest 50 heads. If a push happened while that was in progress, the pushing script would simply wait for the lock to become available:
pushing to /tmp/testrepo
waiting for lock on repository /tmp/testrepo held by 'cuatro:22567'
searching for changes
adding changesets
adding manifests                                                                
adding file changes
added 3 changesets with 3 changes to 1 files (+1 heads)

You can check the number of heads in testrepo easily using:
hg -R /tmp/testrepo heads --template="{node|short}\n" | wc -l
Comment on attachment 531989 [details]
strip off heads and their ancestor changesets

Suggestion for production usage:

>for h in sys.argv[2:]:

"for h in reversed(sys.argv[2:]):"
assuming nodes are listed from oldest to newest on command line.

>    if h not in repo:
>        continue
>    [...]
>    if ancestor == nullid:
>        continue

"print" a (explicit) warning/error in these cases, ftr.
(In reply to comment #15)
> try simulator script

Thanks for making this script and doing this test!
Confirmation that it behaves as expected is always nice to have ;-)

> hg -R /tmp/testrepo heads --template="{node|short}\n" | tail -n50 | xargs
> python /home/luser/build/stripheads.py /tmp/testrepo

Just in case, should we use "heads --topo" in real life?
(I'm not familiar with non-topological heads...)

We may even want to use "heads --closed".

Then we miss code to select "all but 10 last days" heads.
Ftr, "hg heads --template="{rev} {node|short} {date|age}\n" output is like
{
7762 9e5f4a666601 20 hours ago
7532 87ac8d87621b 4 weeks ago
7348 ba2638749f51 2 months ago
7114 0db87b9adc18 3 months ago
6964 3662ba4e22e5 3 months ago
5527 715997c56aaa 12 months ago
}
Note that revs are listed in reverse order (= most recent first), so stripheads.py loop would be fine as is but your (test) command should use "head" instead of "tail".

> You can check the number of heads in testrepo easily using:
> hg -R /tmp/testrepo heads --template="{node|short}\n" | wc -l

Also, I wonder if it could be possible not to strip heads which are still open in the "parent" repository (i.e. mozilla-central).
Yet, it's probably (much) easier and no big deal to just remove them too.
(In reply to comment #17)
> Just in case, should we use "heads --topo" in real life?
> (I'm not familiar with non-topological heads...)

I think in real life we should probably query the pushlog db to get the list of heads to strip. It'd be a pretty simple SQL query.

> Also, I wonder if it could be possible not to strip heads which are still
> open in the "parent" repository (i.e. mozilla-central).
> Yet, it's probably (much) easier and no big deal to just remove them too.

It doesn't matter either way, if we strip something that's alive in m-c, the next person to push to try will just push it back anyway.
(In reply to comment #18)
> It doesn't matter either way, if we strip something that's alive in m-c, the
> next person to push to try will just push it back anyway.
But which head gets built then?  I'd hate to see that person have to push again because the wrong head was built.
We enforce a single head per named branch on m-c. The only possible other heads are on named branches.

It doesn't really matter, if we're picking heads to strip I think we should be pulling them out of the pushlog, which should only ever contain heads of stuff people have pushed to the try server.
(In reply to comment #18)

> I think in real life we should probably query the pushlog db to get the list
> of heads to strip. It'd be a pretty simple SQL query.

Looks very promising :-)

> if we strip something that's alive in m-c, the
> next person to push to try will just push it back anyway.

Yeah :-|
The try server is acting up again, apparently, so I thought I'd revisit this. I think this fell by the wayside last time because try exploded and it got completely wiped out.

I believe the attached script above should work fine. We might need to test it on a more realistic try repo, I've only tested it on my simulated try repo (produced using the other two attached scripts).

I toyed with a local pushlog, and I think this query should let us get a list of heads to strip directly from the pushlog db:
SELECT node from changesets INNER JOIN pushlog ON pushid = id WHERE date < strftime('%s','now','-10 days') GROUP BY pushid ORDER BY rev DESC;

You should be able to feed that right to the stripheads.py script like:
sqlite3 /path/to/try/.hg/pushlog2.db "SELECT ..." | xargs python stripheads.py /path/to/try
I think we ought to go ahead and declare a sort of "soft downtime" to try this out. The Try repo is sucking anyway, we can't make it that much worse. If things really go bad we'll just trash the repo and regroup.
Adding coop and dustin to have an eye on coordinating this with IT.
(In reply to comment #23)
> I think we ought to go ahead and declare a sort of "soft downtime" to try
> this out. The Try repo is sucking anyway, we can't make it that much worse.
> If things really go bad we'll just trash the repo and regroup.

I'm happy to do this any time.  Say the word, either here or on #ops.
stripheads ran successfully. It took 124 minutes to complete on a try repository that was last cloned from mozilla-central 5 or 6 weeks ago.  Assuming nobody reports any problems resulting from this maintenance, it's probably worth it to automate a stripheads on a roughly weekly basis.
It turns out that subsequent runs of stripheads still take a long time, because pushlog still contains references to all the heads that were stripped in the previous run.  Assuming we're using the SQL from Comment 22 to identify the heads to strip, is it ok to delete those pushlog entries after stripheads runs?
Yeah, you can probably just "DELETE from pushes WHERE date < strftime('%s','now','-10 days')" after you run stripheads. Alternately, might make more sense to wrap this in a script so you use the exact same timestamp to do the select and then the delete.
Er, you'll want to delete the linked entries in the changesets table too. My SQL-fu is weak.
Blocks: 676420
Apparently I never tested this script on a repository with the pushlog hook enabled. Due to the way "hg strip" works, it causes the pushlog to throw an error. We'll need to figure out if that's something we can work around.
the only outstanding issue here is what SQL statements to run to prune out the pushlog?
This should work:

delete from changesets where pushid in (select id from pushlog where date < strftime('%s','now','-10 days'))

although I agree with Ted, better to generate a single timestamp to use for all 3 delete operations.
No, the issue is that "hg strip" fiddles with changesets (to preserve continuity of revids, I think?), so it actually triggers a prextnchangegroup hook, and the pushlog hook fails because it tries to insert a row with a duplicate 'rev'.
trying to triage some bugs that noah left behind, what is the next action for IT on this one?
Assignee: nmeyerhans → cshields
It sounds like there's a remaining issue regarding pushing during the pruning process.

Ted, would a process like the following work:
 1. install a hook (what sort?) that rejects pushes with a message saying "We're pruning the try repository .. try again soon" (maybe even with an ETA)
 2. run the SQL above
 3. remove the hook

Do you or catlee have a rough idea how long the pruning takes? seconds? hours? minutes?

We could try that manually the first time, and then run a script the next time, and then build a crontask.

Corey, I can drive this if you'd like.  I need to learn more about hg anyway.
dustin: the pushlog hook actually needs changes. Due to the way "hg strip" works, it unfortunately winds up triggering the pushlog hook while it's running.
Ah, thanks for correcting me.  Who can we put on the hook (har har) to make those changes?
Assignee: cshields → dustin
This looks like a releng/dev issue for the moment..
Assignee: dustin → nobody
Component: Server Operations → Release Engineering
QA Contact: mrz → release
(In reply to Dustin J. Mitchell [:dustin] from comment #35) 
> Do you or catlee have a rough idea how long the pruning takes? seconds?
> hours? minutes?

Noah answered this in comment #26: ~2 hours

(In reply to Dustin J. Mitchell [:dustin] from comment #37)
> Ah, thanks for correcting me.  Who can we put on the hook (har har) to make
> those changes?

Don't know, but I'm pretty sure it's not us, given lack of expertise and access to the hg machines. 

Who handles hg on the IT side now than Noah is gone? bkero?

Can IT make a clone of the existing try repo and test the stripheads+pushloghook changes+delete solution?
Assignee: nobody → server-ops
Severity: enhancement → normal
Component: Release Engineering → Server Operations
Priority: -- → P3
QA Contact: release → cshields
Whiteboard: [tryserver][cleanup]
There was some outstanding issues with stripheads the last time, which is why I had to prune try the old way the last time (vs using stripheads). I think Ted has more info.
See comment 30 and comment 33. As it stands, the pushlog hook and my stripheads script will not work properly together.
Assignee: server-ops → shyam
(In reply to Ted Mielczarek [:ted, :luser] from comment #28)
> Alternately,
> might make more sense to wrap this in a script so you use the exact same
> timestamp to do the select and then the delete.

Alternately you may use strftime('%s','now','localtime','start of day','utc'), that is unique enough to allow you avoiding an external timestamp.
ehr, I forget the 10 days, so strftime('%s','now','localtime','start of day', '-10 days','utc')
(In reply to Marco Bonardo [:mak] from comment #42)
> Alternately you may use strftime('%s','now','localtime','start of
> day','utc'), that is unique enough to allow you avoiding an external
> timestamp.

To be explicit, that's true except if run over (UTC) midnight.
Ted, so what's the course of action here?
Assignee: shyam → server-ops-devservices
Component: Server Operations → Server Operations: Developer Services
QA Contact: cshields → shyam
Assignee: server-ops-devservices → shyam
Right now the pushlog's changesets table uses 'rev' as a primary key:
http://hg.mozilla.org/hgcustom/hghooks/file/1e7a365890ab/mozhghooks/pushlog.py#l40
http://hg.mozilla.org/hgcustom/hghooks/file/1e7a365890ab/mozhghooks/pushlog.py#l73

'rev' is an integer that increments for each changeset (this is provided by the Hg repo).

The problem right now boils down to:
1) Running "hg strip" modifies the repository in such a way that it winds up calling the pushlog hook on groups of changesets that were previously committed to the repo.
2) As part of this operation, the rev number of those changesets is modified (presumably renumbered to remove the gap of the stripped changesets).
3) The pushlog hook attempts to insert the existing changesets, but blows up because it creates a duplicate in the 'rev' primary key field.

We probably need to make two fixes:
1) Stop using 'rev' directly as the primary key, just use an autoincrement integer key.
2) Make the pushlog ignore changes that happen as a result of 'hg strip', since it already knows about these changesets anyway.
Why not use the alphanumeric changeset ID as the primary key? Adding the autoincrement integer key seems somewhat pointless.
We use the rev for ordering in queries:
http://hg.mozilla.org/hgcustom/pushlog/file/e99a36d3fd4a/pushlog-feed.py#l138

I guess we could handle that post-query by sorting them in Python using the changeset data from Hg, but that makes things a little more complicated.
Assignee: shyam → ted.mielczarek
I believe Hal is taking care of this in RelEng now.
Assignee: ted.mielczarek → hwine
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #49)
> I believe Hal is taking care of this in RelEng now.

Actually not for a long time. We tried this for a while, and found that it was non-deterministic in improving things. See, e.g. the discussion in bug 734225 comment 24.

The gotcha on all of these approaches is performance more dependent on "depth" than "width", and "depth" is controlled by where the bugs are. (c/f bug 770811 comment 4)

At this point, should this bug be closed in favor of bug 770811, which has slightly newer information?
Bug 770811 has more current info and actions - work is happening there.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
Component: Server Operations: Developer Services → General
Product: mozilla.org → Developer Services
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: