The offending changeset that broke hgchanges yesterday turns out to be a merge from an ancient branch to current tip. That makes the diff insanely huge which is why things like hgweb were tripping over it. Kwierso point out that just ignoring those changesets would solve the problem. It’s not ideal but since in this case they aren’t useful changesets I’ve gone ahead and done that and so hgchanges is now updating again.
My handy tool for tracking changes to directories in the mozilla mercurial repositories is going to be broken for a little while. Unfortunately a particular changeset seems to be breaking things in ways I don’t have the time to fix right now. Specifically trying to download the raw patch for the changeset is causing hgweb to timeout. Short of finding time to debug and fix the problem my only solution is to wait until that patch is old enough that it no longer attempts to index it. That could take a week or so.
Obviously I’ll happily accept patches to fix this problem sooner.
Nearly a year ago I showed off the first version of my webapp for displaying recent changes in mercurial repositories. I’ve heard directly from a number of people that use it but it’s had a few problems that I’ve spent some of my spare time trying to solve. I’m now pleased to say that a new version is up and running. I’m not sure it’s accurate to call this a rewrite since it was entirely incremental but when I look back over of the code changes there really isn’t much that wasn’t touched in some way or other. So what’s new? For you, a person using the site absolutely nothing! So what on earth have I been doing rewriting the code?
Hopefully I’ve been making the thing a whole lot more stable. Eagle eyed users may have noticed that periodically the site would stop updating, sometimes for days at a time. The site is split into two pieces. One is a django website to display the changes. The other is the important bit that reads the change data out of mercurial and puts it into a database that the website can use. I do this caching because it would be too expensive to work it out on the fly. The problem is that the process of reading the data out of mercurial meant keeping a local version of every repository on the server (some 5GB for the four currently tracked) and reading the changes using mercurial’s python API was slow and used a lot of memory. The shared hosting environment that I run this out of kept killing the process when it used too many resources. Normally it could recover but occasionally it would need a manual kick. Worse there is some mercurial bug where sometimes pulling a repository will end up with an inconsistent database. It’s rare so many people don’t see it but when you’re pulling repositories every ten minutes it starts happing every couple of weeks, again requiring manual work to fix the problem.
I did dabble with getting this into shape so I could run it on a mozilla server. PAAS seemed like the obvious option but after a lot of work it turned out that the database size restrictions there made this impossible. Since then I’ve seem so many horror stories about PAAS falling over that I think I’m glad I never got there. I also tried getting a dedicated server from Mozilla to run this on but they were quite rightly wary when I mentioned that the process used a bunch of memory and wanted harder figures before committing, unfortunately I never found out a way to give those figures.
The final solution has been ripping out the mercurial API dependency. Now rather than needing local mercurial repositories the import process instead talks to them over the web. The default mercurial web APIs aren’t great but luckily all of the repositories I care about have pushlog installed which has a good JSON API for retrieving recently pushed changesets. So now the process is to pull the recent pushlog, get the patches for every changeset and parse them to work out which files have changed. By not having to load the mercurial structures there is far less memory used and since the process now has to delay as it downloads each patch file it keeps the CPU usage low.
I’ve also made the database itself a lot more efficient. Previously I recorded every changeset for every repository, but many repositories have the same changeset in them. So by noting that the database gets a little more complicated but uses much less space and since you only have to parse the changeset once the import process becomes faster. It took me a while to get the website performing as fast as before with these changes but I think it’s there now (barring some strange slowness in Django that I can’t identify). I’ve also turned on some quite aggressive caching both on the client and server side, if someone has visited the same page as you’re trying to access recently then it should load super fast.
So, the new version is up and running at the same location as the old. Chances are you’re using it already so if you’re noticing any unexpected slowness or something not looking right please let me know.
Back in the old days when we used CVS of all things for our version control we had a wonderful tool called bonsai to help query the repository for changes. You could list changes on a per directory basis if you needed which was great for keeping an eye on certain chunks of code. I recall there being a way of getting an RSS feed from it and I used it when I was the module owner of the extension manager to see what changes were landed that I hadn’t noticed in bugs.
Fast forward to today and we use mercurial instead. When we switched there was much talk of how we’d get tool parity with CVS but bonsai is something that has never been replaced fully. Oh hgweb is decent at looking at individual files and browsing the tree, but you can’t get that list of changes per directory from it. I believe you can use the command line to do it but who wants to do that? Lately I’ve been finding need of those directory RSS feeds more and more. We’re now periodically uplifting the Add-on SDK repository to mozilla-central, it’s really important to spot changes that have been made to that directory in mozilla-central so we can also land them in our git repository and not clobber them the next time we uplift. I’m also the module owner of toolkit, which is a pretty big sprawling set of files. It seems like everytime I look I find something that landed without me noticing. I don’t make for a good module owner if I’m not keeping an eye on things so I’d really like to see when new files are added there.
So I introduce the Hg Change Feed, the result of mostly just a few days of work. Every 10 minutes it pulls new changes from mozilla-central and mozilla-inbound. A mercurial hook looks over the changes and adds information about them to a MySQL database. Then a simple django app displays that information. As you browse through the directories in the tree it shows only changesets that affected files beneath that directory. For any directory you can also get an RSS feed of the same. Plug that into IFTTT and you have an automated system to notify you in pretty much any way you’d like about new changes you’d be interested in.
Some simple examples. For tracking changes to Add-on SDK I’m watching http://hgchanges.fractalbrew.com/mozilla-inbound/file/addon-sdk/source. For toolkit I’m looking at http://hgchanges.fractalbrew.com/mozilla-inbound/file/toolkit?types=added. Types takes a comma separated list of “added”, “removed” and “modified” to filter which changes you’re interested in. There’s no UI on the site for changing that right now, you’re welcome to add some!
One other neat trick that this does is mostly ignore merge changesets. Only if a merge actually makes a change not already present in either of the merge parents (mostly happens when resolving merge conflicts) will it show up in the list of changes, because really you don’t need to hear about changes twice.
So play with it, let me know if you find it useful or if you think things are missing. I can also add other mercurial repositories if people want. Some caveats:
- It only retains the last 2000 changesets from any repository in an effort to keep the DB size small and fast, it also only shows the last 200 changesets for each page, or just the last 20 in the feeds. These can be tweaked easily enough and I’ve done basically no benchmarking to say those are the right values.
- The site isn’t as fast as I’d like, particularly listing changes for the top level directory takes nearly 5 seconds. I’ve thrown some basic caching in place to help alleviate that for now. I bet someone who has more MySQL and django experience than me could tell me what I’m doing wrong.
- I’m off on vacation tomorrow so I guess I’m announcing this then running away, sorry if that means it takes me a while to respond to comments.
As I mentioned in my last post and as I know everyone is well aware, tinderbox is starting to feel the strain. You already can’t just glance at it to see what is going on. The Firefox tree is currently showing build and test results from a total of 23 machines. The interesting thing is that not all of these machines are actually doing any compiling. The rest are merely running tests on builds produced by the other machines. That isn’t to say that those test results are less important but I wonder whether it is worth treating these differently.
Thinking about this difference I’ve spent a short time knocking together an experimental view of tinderbox. The idea is simple, the test boxes that aren’t actually compiling don’t get a dedicated column. We still want those results though, and luckily there is somewhere to put them, on the build that they are testing. So every actual compile ends up as a box within which can be multiple results from testers. This leaves us a very manageable 9 columns, 3 per platform.
This is quite early stages, the physical perf and test numbers aren’t visible yet, there are a couple of bugs I know about, it doesn’t auto update and I wouldn’t recommend you really use this, but I am interested to know what people think before I proceed any further with it. There are a few questions in my mind like how to colour the main box of the build (currently the worse result from the compile and all the test results), how to colour the header (I suspect the worst of the most recent results from all of the machines reporting in that column), where to even put the full perf/test numbers (there isn’t really room in some of the fast nightly boxes). More importantly is this even readable? I think it is good for some things, often tinderbox gets confusing when you see a talos result fail long after the nightly box went green, here that talos red would be back on the nightly box that failed. I can imagine problems trying to figure out why the header is red when it’s actually the talos results from build a couple of times ago that is causing it.
Anyway without further ado, take a look at Tindermerge and let me know your thoughts.
Well it has been a mad few weeks. Between landing the new Get Add-ons pane, sheriffing the Firefox 3 beta 3 freeze, decimating old parts of the code and working through follow-up issues from the Get Add-ons pane and not to mention two 10 hour flights it really has been busy. Thankfully I’ve not just been totally buried under code, I’ve had some small spare time to tinker with a few other things (I need to to keep my sanity).
We’re getting better and better at adding quality control to our code. The QA team do their awesome work, we have some 50,000 automated unit tests now (depending how you count) and our tinderbox now has more machines running performance tests than it does machines building the Fox. One problem that all this brings though is a bit of information overload. The tinderbox is now so large that you can no longer glance to see if you can check in or not. Goodness knows how many pages you need to view to truly monitor all of the performance graphs. So, spurred on by Jonathan’s performance dashboard, I’ve been tinkering with some ways to present some of this data.
First off there is the old-style performance machines, pre-talos. These often find regressions first, because they cycle fast. Apparently that is going to improve but until then these old machines have useful data. Unfortunately the graphs they produce are … well lets just say pig ugly! It didn’t take too long to knock up something very basic, but nicer looking and allowing multiple results on the same graph. You can see it graphing the live data (yes it will appear blank for the age it takes to draw). Unfortunately the graph kit I’m using seems a little slow and not all that functional, but that could be changed. Then again I have just found out that new work on the graphs server pretty much blows this out of the water!
So that’s all pretty and good but doesn’t bring a much new to the table. Graphs are good but what we want to be able to do is relate the performance changes to the code that caused it. At this point Johnathan was kind enough to point me at Timeplot. This is a rather neat tool that lets you visualise numerical data on the same graph as defined events. That’s kinda perfect for this. It took me a little longer to hook this one up given the multiple sources and the kinda weird API that can’t normally load from anything other than a remote text file, but here is the result. You can see the performance data of course (Ts from the linux perf box), then each vertical band is a checkin which you can click on for more details.
In the end the timeplot is not massively useful. When you get down to it we tend to check in far more frequently than we gather performance information. I’ll be interested to see if the fast run talos boxes will improve matters. If we had many of them then we can start correlating the regressions across them and thus far reduce the range in which to look for potential causes. I’m interested in other ideas people might have for visualising these kinds of data, I think finding the right way to visualise our information is the key to managing things in the future.
Another thing that needs help is the tinderbox of course. While I was in Mountain View last week Johnathan and I did a quick brainstorm of what the tinderbox is used for. We think it is being used for too many different things and if we split the display out into just those tasks that individuals needed then it would be far more manageable. Maybe you have your own comments on uses we didn’t come up with.
Just finally, after playing with timeplot I was still in a tinkering mood. One thing I periodically have to do is look back over what work I did on specific days. Bugmail is a good indicator for this. Now though I am working in my own Mercurial repository which means I have a full checkin log of my own to look over. Given that it is interleaved with the full Mozilla CVS checkin log it still a little arduous to scan through. But know I have the tools and after not long I can know skim over my previous checkins and even see in depth what each involved (yes, sorry boss it doesn’t look like I did much coding today!).