My little tool to help folks track when changes are made to files or directories in Mozilla’s mercurial repositories has gone down again. This time an influx of some 8000 changesets from the servo project are causing the script that does the updating to fail so I’ve turned off updating. I no longer have any time to work on this tool so I’ve also taken it offline and don’t really have any intention to bring it back up again. Sorry to the few people that this inconveniences. Please go lobby the engineering productivity folks if you still need a tool like this.
It has been over eleven years since I first wrote a patch for Firefox. It was reviewed by the then-Firefox module owner, Mike Connor. If you had told me then that at some point in the future I was going to be the module owner I probably would have laughed at you. I didn’t know at the time how much Mozilla would shape my life. Yet yesterday Dave Camp handed over the reigns to me and here we are.
When Dave proposed me as the new module owner he talked about how he saw Firefox as a code module, responsible for the code that ships with the app, rather than the decisions about what the app needs to do. Those are delegated to other teams with better grasps of the situation like Product and UX. I agree with this wholeheartedly. It isn’t my role as an engineer to make those kinds of decisions. The Firefox module owner is focused on the implementation of the code in the browser and pretty much nothing else.
But in the Firefox module even the implementation decisions have always been heavily delegated to the peers of the module. I don’t intend to change that. I’m here to help guide those working on the code when they need a broader view, not to put my foot down and insist on how things should happen. I’m here to direct you to the peer most able to help you when you need a reviewer or run into problems. On some occasions I will also be here to listen and advise when you think the peers are wrong. In fact I see this role as less about being a module owner and more being a module steward, staying out of the way mostly but applying a gentle hand to the tiller when needed.
Those of you keeping an eye on things will note that I’m also the Toolkit module owner. That hasn’t changed and while I have always approached that module in much the same way as I plan to approach Firefox there are a few differences. I’ll talk about those another day.
I’m delighted to announce that I’ve just made Andrew Swan (aswan on IRC) a reviewer for Firefox code. That also reminds me that I failed to announce when I did the same for Rob Helmer (rhelmer). Please inundate them with patches.
There are a few key things I look for when promoting folks to reviewers, surprisingly none of them are an understanding of the full breadth of code in Firefox or Toolkit:
- Get to reviews soon. You don’t have to complete the review necessarily, do a first pass or if you really have no time hand off to someone else quickly.
- Be courteous, you are a personal connection to Mozilla and you should be as welcoming as Mozilla is supposed to be.
- Know your limits. Don’t review anything if you don’t understand the code it deals with, help the reviewee find an alternate reviewer.
- You don’t get to demand the reviewee re-write the patch in your style. If their code meets style guidelines, fixes the bug and is efficient enough don’t waste people’s time by re-architecting it.
Pretty much all of this is demonstrated by seeing what code the potential reviewer is writing and letting them review smaller patches by already well-versed contributors.
The add-ons manager has a dirty secret. It uses an awful lot of synchronous file I/O. This is the kind of I/O that blocks the main thread and can cause Firefox to be janky. I’m told that that is a technical term. Asynchronous file I/O is much nicer, it means you can let the rest of the app continue to function while you wait for the I/O operation to complete. I rewrote much of the current code from scratch for Firefox 4.0 and even back then we were trying to switch to asynchronous file I/O wherever possible. But still I used mostly synchronous file I/O.
Here is the problem. For many moons we have allowed other applications to install add-ons into Firefox by dropping them into the filesystem or registry somewhere. We also have to do things like updating and installing non-restartless add-ons during startup when their files aren’t in use. And we have to know the full set of non-restartless add-ons that we are going to activate quite early in startup so the startup function for the add-ons manager has to do all those installs and a scan of the extension folders before returning back to the code startup up the browser, and that means being synchronous.
The other problem is that for the things that we could conceivably use async I/O, like installs and updates of restartless add-ons during runtime we need to use the same code for loading and parsing manifests, extracting zip files and others that we need to be synchronous during startup. So we can either write a second version that is asynchronous so we can have nice performance at runtime or use the synchronous version so we only have one version to test and maintain. Keeping things synchronous was where things fell in the end.
That’s always bugged me though. Runtime is the most important time to use asynchronous I/O. We shouldn’t be janking the browser when installing a large add-on particularly on mobile and so we have taken some steps since Firefox 4 to make parts of the code asynchronous. But there is still a bunch there.
Performances is pretty important for the add-ons manager startup code, the longer we spend in startup the more it hurts us. Would this switch slow things down? I assumed that there would be some losses due to other things happening during an event loop tick that otherwise wouldn’t have but that the file I/O operations should take around the same time. And here is the clever bit. Because it is asynchronous I could fire off operations to run in parallel. Why check the modification time of every file in a directory one file at a time when you can just request the times for every file and wait until they all complete?
There are really a tonne of things that could affect whether this would be faster or slower and no amount of theorising was convincing me either way and last night this had finally been bugging me for long enough that I grabbed a bottle of wine, fired up the music and threw together a prototype.
It took me a few hours to switch most of the main methods to use Task.jsm, switch much of the likely hot code to use OS.File and to run in parallel where possible and generally cover all the main parts that run on every startup and when add-ons have changed.
The challenge was testing. Default talos runs don’t include any add-ons (or maybe one or two) and I needed a few different profiles to see how things behaved in different situations. It was possible that startups with no add-ons would be affected quite differently to startups with many add-ons. So I had to figure out how to add extensions to the default talos profiles for my try runs and fired off try runs for the cases where there were no add-ons, 200 unpacked add-ons with a bunch of files and 200 packed add-ons. I then ran all those a second time with deleting extensions.json between each run to force the database to be loaded and rebuilt. So six different talos runs for the code without my changes and then another six with my changes and I triggered ten runs per test and went to bed.
The first thing I did this morning was check the performance results. The first ready was with 200 packed add-ons in the profile, should be a good check of the file scanning. How did it do? Amazing! Incredible! A greater than 50% performance improvement across the board! That’s astonishing! No really that’s pretty astonishing. It would have to mean the add-ons manager takes up at least 50% of the browser startup time and I’m pretty sure it doesn’t. Oh right I’m accidentally comparing to the test run with 200 packed add-ons and a database reset with my async code. Well I’d expect that to be slower.
Ok, let’s get it right. How did it really do? Abysmally! Like incredibly badly. Across the board in every test run startup is significantly slower with the asynchronous I/O than without. With no add-ons in the profile the new code incurs a 20% performance hit. In the case with 200 unpacked add-ons? An almost 1000% hit!
Ok so that wasn’t the best result but at least it will stop bugging me now. I figure there are two things going on here. The first is that OS.File might look like you can fire off I/O operations in parallel but in fact you can’t. Every call you make goes into a queue and the background worker thread doesn’t start on one operation until the previous has completed. So while the I/O operations themselves might take about the same time you have the added overhead of passing messages between the background thread and promises. I probably should have checked that before I started! Oh, and promises. Task.jsm and OS.File make heavy use of promises and I have to say I’m sold on using them for async code. But. Everytime you wait for a promise you have to wait at least one tick of the event loop longer than you would with a simple callback. That’s great if you want responsive UI but during startup every event loop tick costs time since other code might be running that you don’t care about.
I still wonder if we could get more threads for OS.File whether it would speed things up but that’s beyond where I want to play with things for now so I guess this is where this fun experiment ends. Although now I have a bunch of code converted I wonder if I can create some replacements for OS.File and Task.jsm that behave synchronously during startup and asynchronously at runtime, then we get the best of both worlds … where did that bottle of wine go?
So a lot of the work done so far has been removing all this non-standard stuff so that ESLint can pass with only a very small set of style rules defined. Soon we’ll start increasing the rules we check in browser and toolkit.
How do I lint?
From the command line this is simple. Make sure and run
./mach eslint --setup to install eslint and some related packages then just
./mach eslint <directory or file> to lint a specific area. You can also lint the entire tree. For now you may need to periodically run setup again as we add new dependencies, at some point we may make mach automatically detect that you need to re-run it.
You can also add ESLint support to many code editors and as of today you can add ESLint support into hg!
- Linting can be used to enforce the sorts of style rules that keep our code consistent. Imagine no more nit comments in code review forcing you to update your patch. You can fix all those before submitting and reviewers don’t have to waste time pointing them out.
- Linting can catch real bugs. When we turned on one of the basic rules we found a problem in shipping code.
- With only standard JS code to deal with we open the possibility of using advanced like AST transforms for refactoring (e.g. recast). This could be very useful for switching from Cu.import to ES6 modules.
- ESLint in particular allows us to write custom rules for doing clever things like handling head.js imports for tests.
Where are we?
- Removed #include preprocessing from browser.js moving all included scripts to global-scripts.inc
- Added an ESLint plugin to allow linting the JS parts of XBL bindings
- Fixed basic JS parsing issues in lots of b2g, browser and toolkit code
- Created a hg extension that will warn you when committing code that fails ESLint
- Turned on some basic linting rules
- Mozreview is close to being able to lint your code and give review comments where things fail
- Work is almost complete on a linting test that will turn orange on the tree when code fails to lint
I’m grateful to all those that have helped get things moving here but there is still more work to do. If you’re interested there’s really two ways you can help. We need to lint more files and we need to turn on more lint rules.
We also need to turn on more rules. We’ve got a rough list of the rules we want to turn on in browser and toolkit but as you might guess they aren’t on because they fail right now. Fixing up our JS to work with them is simple work but much appreciated. In some cases ESLint can also do the work for you!
ESLint becomes the most useful when you get warnings before even trying to land or get your code reviewed. You can add support to your code editor but not all editors support this so I’ve written a mercurial extension which gives you warnings any time you commit code that fails lint checks. It uses the same rules we run elsewhere. It doesn’t abort the commit, that would be annoying if you’re working on a feature branch but gives you a heads up about what needs to be fixed and where.
To install the extension add this to a hgrc file, I put it in the .hg/hgrc file of my mozilla-central clone rather than the global config.
[extensions] mozeslint = <path to clone>/tools/mercurial/eslintvalidate.py
After that anything that creates a commit, so that includes mq patches, will run any changed JS files through ESLint and show the results. If the file was already failing checks in a few places then you’ll still see those too, maybe you should fix them up too before sending your patch for review? 😉
As Firefox increasingly switches to support running in multiple processes we’ve been finding common problems. Where we can we are designing nice APIs to make solving them easy. One problem is that we often want to run in-content pages like about:newtab and about:home in the child process without privileges making it safer and less likely to bring down Firefox in the event of a crash. These pages still need to get information from and pass information to the main process though, so we have had to come up with ways to handle that. Often we use custom code in a frame script acting as a middle-man, using things like DOM events to listen for requests from the in-content page and then messaging to the main process.
We recently added a new API to make this problem easier to solve. Instead of needing code in a frame script the RemotePageManager module allows special pages direct access to a message manager to communicate with the main process. This can be useful for any page running in the content area, regardless of whether it needs to be run at low privileges or in the content process since it takes care of listening for documents and hooking up the message listeners for you.
There is a low-level API available but the higher-level API is probably more useful in most cases. If your code wants to interact with a page like
about:myaddon just do this from the main process:
Components.utils.import("resource://gre/modules/RemotePageManager.jsm"); let manager = new RemotePages("about:myaddon");
The manager object is now something resembling a regular process message manager. It has
addMessageListener methods but unlike the regular e10s message managers it only communicates with
about:myaddon pages. Unlike the regular message managers there is no option to send synchronous messages or pass cross-process wrapped objects.
about:myaddon is loaded it has
The module documentation has more in-depth examples showing message passing between the page and the main process.
The RemotePageManager module is available in nightlies now and you can see it in action with the simple change I landed to switch
about:plugins to run in the content process. For the moment the APIs only support exact URL matching but it would be possible to add support for regular expressions in the future if that turns out to be useful.
The offending changeset that broke hgchanges yesterday turns out to be a merge from an ancient branch to current tip. That makes the diff insanely huge which is why things like hgweb were tripping over it. Kwierso point out that just ignoring those changesets would solve the problem. It’s not ideal but since in this case they aren’t useful changesets I’ve gone ahead and done that and so hgchanges is now updating again.
My handy tool for tracking changes to directories in the mozilla mercurial repositories is going to be broken for a little while. Unfortunately a particular changeset seems to be breaking things in ways I don’t have the time to fix right now. Specifically trying to download the raw patch for the changeset is causing hgweb to timeout. Short of finding time to debug and fix the problem my only solution is to wait until that patch is old enough that it no longer attempts to index it. That could take a week or so.
Obviously I’ll happily accept patches to fix this problem sooner.
Nearly a year ago I showed off the first version of my webapp for displaying recent changes in mercurial repositories. I’ve heard directly from a number of people that use it but it’s had a few problems that I’ve spent some of my spare time trying to solve. I’m now pleased to say that a new version is up and running. I’m not sure it’s accurate to call this a rewrite since it was entirely incremental but when I look back over of the code changes there really isn’t much that wasn’t touched in some way or other. So what’s new? For you, a person using the site absolutely nothing! So what on earth have I been doing rewriting the code?
Hopefully I’ve been making the thing a whole lot more stable. Eagle eyed users may have noticed that periodically the site would stop updating, sometimes for days at a time. The site is split into two pieces. One is a django website to display the changes. The other is the important bit that reads the change data out of mercurial and puts it into a database that the website can use. I do this caching because it would be too expensive to work it out on the fly. The problem is that the process of reading the data out of mercurial meant keeping a local version of every repository on the server (some 5GB for the four currently tracked) and reading the changes using mercurial’s python API was slow and used a lot of memory. The shared hosting environment that I run this out of kept killing the process when it used too many resources. Normally it could recover but occasionally it would need a manual kick. Worse there is some mercurial bug where sometimes pulling a repository will end up with an inconsistent database. It’s rare so many people don’t see it but when you’re pulling repositories every ten minutes it starts happing every couple of weeks, again requiring manual work to fix the problem.
I did dabble with getting this into shape so I could run it on a mozilla server. PAAS seemed like the obvious option but after a lot of work it turned out that the database size restrictions there made this impossible. Since then I’ve seem so many horror stories about PAAS falling over that I think I’m glad I never got there. I also tried getting a dedicated server from Mozilla to run this on but they were quite rightly wary when I mentioned that the process used a bunch of memory and wanted harder figures before committing, unfortunately I never found out a way to give those figures.
The final solution has been ripping out the mercurial API dependency. Now rather than needing local mercurial repositories the import process instead talks to them over the web. The default mercurial web APIs aren’t great but luckily all of the repositories I care about have pushlog installed which has a good JSON API for retrieving recently pushed changesets. So now the process is to pull the recent pushlog, get the patches for every changeset and parse them to work out which files have changed. By not having to load the mercurial structures there is far less memory used and since the process now has to delay as it downloads each patch file it keeps the CPU usage low.
I’ve also made the database itself a lot more efficient. Previously I recorded every changeset for every repository, but many repositories have the same changeset in them. So by noting that the database gets a little more complicated but uses much less space and since you only have to parse the changeset once the import process becomes faster. It took me a while to get the website performing as fast as before with these changes but I think it’s there now (barring some strange slowness in Django that I can’t identify). I’ve also turned on some quite aggressive caching both on the client and server side, if someone has visited the same page as you’re trying to access recently then it should load super fast.
So, the new version is up and running at the same location as the old. Chances are you’re using it already so if you’re noticing any unexpected slowness or something not looking right please let me know.