Improving the performance of the add-ons manager with asynchronous file I/O

The add-ons manager has a dirty secret. It uses an awful lot of synchronous file I/O. This is the kind of I/O that blocks the main thread and can cause Firefox to be janky. I’m told that that is a technical term. Asynchronous file I/O is much nicer, it means you can let the rest of the app continue to function while you wait for the I/O operation to complete. I rewrote much of the current code from scratch for Firefox 4.0 and even back then we were trying to switch to asynchronous file I/O wherever possible. But still I used mostly synchronous file I/O.

Here is the problem. For many moons we have allowed other applications to install add-ons into Firefox by dropping them into the filesystem or registry somewhere. We also have to do things like updating and installing non-restartless add-ons during startup when their files aren’t in use. And we have to know the full set of non-restartless add-ons that we are going to activate quite early in startup so the startup function for the add-ons manager has to do all those installs and a scan of the extension folders before returning back to the code startup up the browser, and that means being synchronous.

The other problem is that for the things that we could conceivably use async I/O, like installs and updates of restartless add-ons during runtime we need to use the same code for loading and parsing manifests, extracting zip files and others that we need to be synchronous during startup. So we can either write a second version that is asynchronous so we can have nice performance at runtime or use the synchronous version so we only have one version to test and maintain. Keeping things synchronous was where things fell in the end.

That’s always bugged me though. Runtime is the most important time to use asynchronous I/O. We shouldn’t be janking the browser when installing a large add-on particularly on mobile and so we have taken some steps since Firefox 4 to make parts of the code asynchronous. But there is still a bunch there.

The plan

The thing is that there isn’t actually a reason we can’t use the asynchronous I/O functions. All we have to do is make sure that startup is blocked until everything we need is complete. It’s pretty easy to spin an event loop from JavaScript to wait for asynchronous operations to complete so why not do that in the startup function and then start making things asynchronous?

Performances is pretty important for the add-ons manager startup code, the longer we spend in startup the more it hurts us. Would this switch slow things down? I assumed that there would be some losses due to other things happening during an event loop tick that otherwise wouldn’t have but that the file I/O operations should take around the same time. And here is the clever bit. Because it is asynchronous I could fire off operations to run in parallel. Why check the modification time of every file in a directory one file at a time when you can just request the times for every file and wait until they all complete?

There are really a tonne of things that could affect whether this would be faster or slower and no amount of theorising was convincing me either way and last night this had finally been bugging me for long enough that I grabbed a bottle of wine, fired up the music and threw together a prototype.

The results

It took me a few hours to switch most of the main methods to use Task.jsm, switch much of the likely hot code to use OS.File and to run in parallel where possible and generally cover all the main parts that run on every startup and when add-ons have changed.

The challenge was testing. Default talos runs don’t include any add-ons (or maybe one or two) and I needed a few different profiles to see how things behaved in different situations. It was possible that startups with no add-ons would be affected quite differently to startups with many add-ons. So I had to figure out how to add extensions to the default talos profiles for my try runs and fired off try runs for the cases where there were no add-ons, 200 unpacked add-ons with a bunch of files and 200 packed add-ons. I then ran all those a second time with deleting extensions.json between each run to force the database to be loaded and rebuilt. So six different talos runs for the code without my changes and then another six with my changes and I triggered ten runs per test and went to bed.

The first thing I did this morning was check the performance results. The first ready was with 200 packed add-ons in the profile, should be a good check of the file scanning. How did it do? Amazing! Incredible! A greater than 50% performance improvement across the board! That’s astonishing! No really that’s pretty astonishing. It would have to mean the add-ons manager takes up at least 50% of the browser startup time and I’m pretty sure it doesn’t. Oh right I’m accidentally comparing to the test run with 200 packed add-ons and a database reset with my async code. Well I’d expect that to be slower.

Ok, let’s get it right. How did it really do? Abysmally! Like incredibly badly. Across the board in every test run startup is significantly slower with the asynchronous I/O than without. With no add-ons in the profile the new code incurs a 20% performance hit. In the case with 200 unpacked add-ons? An almost 1000% hit!

What happened?

Ok so that wasn’t the best result but at least it will stop bugging me now. I figure there are two things going on here. The first is that OS.File might look like you can fire off I/O operations in parallel but in fact you can’t. Every call you make goes into a queue and the background worker thread doesn’t start on one operation until the previous has completed. So while the I/O operations themselves might take about the same time you have the added overhead of passing messages between the background thread and promises. I probably should have checked that before I started! Oh, and promises. Task.jsm and OS.File make heavy use of promises and I have to say I’m sold on using them for async code. But. Everytime you wait for a promise you have to wait at least one tick of the event loop longer than you would with a simple callback. That’s great if you want responsive UI but during startup every event loop tick costs time since other code might be running that you don’t care about.

I still wonder if we could get more threads for OS.File whether it would speed things up but that’s beyond where I want to play with things for now so I guess this is where this fun experiment ends. Although now I have a bunch of code converted I wonder if I can create some replacements for OS.File and Task.jsm that behave synchronously during startup and asynchronously at runtime, then we get the best of both worlds … where did that bottle of wine go?

Linting for Mozilla JavaScript code

One of the projects I’ve been really excited about recently is getting ESLint working for a lot of our JavaScript code. If you haven’t come across ESLint or linters in general before they are automated tools that scan your code and warn you about syntax errors. They can usually also be set up with a set of rules to enforce code styles and warn about potential bad practices. The devtools and Hello folks have been using eslint for a while already and Gijs asked why we weren’t doing this more generally. This struck a chord with me and a few others and so we’ve been spending some time over the past few weeks getting our in-tree support for ESLint to work more generally and fixing issues with browser and toolkit JavaScript code in particular to make them lintable.

One of the hardest challenges with this is that we have a lot of non-standard JavaScript in our tree. This includes things like preprocessing as well as JS features that either have since been standardized with a different syntax (generator functions for example) or have been dropped from the standardization path (array comprehensions). This is bad for developers as editors can’t make sense of our code and provide good syntax highlighting and refactoring support when it doesn’t match standard JavaScript. There are also plans to remove the non-standard JS features so we have to remove that stuff anyway.

So a lot of the work done so far has been removing all this non-standard stuff so that ESLint can pass with only a very small set of style rules defined. Soon we’ll start increasing the rules we check in browser and toolkit.

How do I lint?

From the command line this is simple. Make sure and run ./mach eslint --setup to install eslint and some related packages then just ./mach eslint <directory or file> to lint a specific area. You can also lint the entire tree. For now you may need to periodically run setup again as we add new dependencies, at some point we may make mach automatically detect that you need to re-run it.

You can also add ESLint support to many code editors and as of today you can add ESLint support into hg!

Why lint?

Aside from just ensuring we have standard JavaScript in the tree linting offers a whole bunch of benefits.

  • Linting spots JavaScript syntax errors before you have to run your code. You can configure editors to run ESLint to tell you when something is broken so you can fix it even before you save.
  • Linting can be used to enforce the sorts of style rules that keep our code consistent. Imagine no more nit comments in code review forcing you to update your patch. You can fix all those before submitting and reviewers don’t have to waste time pointing them out.
  • Linting can catch real bugs. When we turned on one of the basic rules we found a problem in shipping code.
  • With only standard JS code to deal with we open the possibility of using advanced like AST transforms for refactoring (e.g. recast). This could be very useful for switching from Cu.import to ES6 modules.
  • ESLint in particular allows us to write custom rules for doing clever things like handling head.js imports for tests.

Where are we?

There’s a full list of everything that has been done so far in the dependencies of the ESLint metabug but some highlights:

  • Removed #include preprocessing from browser.js moving all included scripts to global-scripts.inc
  • Added an ESLint plugin to allow linting the JS parts of XBL bindings
  • Fixed basic JS parsing issues in lots of b2g, browser and toolkit code
  • Created a hg extension that will warn you when committing code that fails ESLint
  • Turned on some basic linting rules
  • Mozreview is close to being able to lint your code and give review comments where things fail
  • Work is almost complete on a linting test that will turn orange on the tree when code fails to lint

What’s next?

I’m grateful to all those that have helped get things moving here but there is still more work to do. If you’re interested there’s really two ways you can help. We need to lint more files and we need to turn on more lint rules.

The .eslintignore file shows what files are currently ignored in the lint checks. Removing files and directories from that involves fixing JavaScript standards issues generally but every extra file we lint is a win for all the above reasons so it is valuable work. Also mostly straightforward once you get the hang of it, there are just a lot of files.

We also need to turn on more rules. We’ve got a rough list of the rules we want to turn on in browser and toolkit but as you might guess they aren’t on because they fail right now. Fixing up our JS to work with them is simple work but much appreciated. In some cases ESLint can also do the work for you!

Running ESLint on commit

ESLint becomes the most useful when you get warnings before even trying to land or get your code reviewed. You can add support to your code editor but not all editors support this so I’ve written a mercurial extension which gives you warnings any time you commit code that fails lint checks. It uses the same rules we run elsewhere. It doesn’t abort the commit, that would be annoying if you’re working on a feature branch but gives you a heads up about what needs to be fixed and where.

To install the extension add this to a hgrc file, I put it in the .hg/hgrc file of my mozilla-central clone rather than the global config.

[extensions]
mozeslint = <path to clone>/tools/mercurial/eslintvalidate.py

After that anything that creates a commit, so that includes mq patches, will run any changed JS files through ESLint and show the results. If the file was already failing checks in a few places then you’ll still see those too, maybe you should fix them up too before sending your patch for review? 😉

Delivering Firefox features faster

Over time Mozilla has been trying to reduce the amount of time between developing a feature and getting it into a user’s hands. Some time ago we would do around one feature release of Firefox every year, more recently we’ve moved to doing one feature release every six weeks. But it still takes at least 12 weeks for a feature to get to users. In some cases we can speed that up by landing new things directly on the beta/aurora branches but the more we do this the harder it is for release managers to track the risk of shipping a given release.

The Go Faster project is investigating ways that we can speed up getting changes to users. System add-ons are one piece of this that will let us deliver updates to core Firefox features more often than the regular six week releases. Instead of being embedded in the rest of the code certain features will be developed as standalone system add-ons.

Building features as add-ons gives us more flexibility in how we deliver the features to users. System add-ons will ship in two different ways. First every Firefox release will include a default set of system add-ons. These are the latest versions of the features at the time the Firefox build was produced. Later during runtime Firefox will contact Mozilla’s update servers to ask for the current list of system add-ons. If there are new or updated versions listed Firefox will download and install them giving users access to the newest features without needing to update the entire application.

Building a feature as an add-on gives developers a lot of benefits too. Developers will be able to work on and test new features without doing custom Firefox builds. Users can even try out new features by just installing the add-ons. Once the feature is ready to ship it ships as an add-on with no code changes necessary for integration into Firefox. This is something we’ve attempted to do before with things like Test Pilot and pdf.js, but system add-ons make this process much smoother and reduces the differences between how the feature runs as an add-on and how it runs when shipped in the application.

The basic support for system add-ons is already included in current nightly builds and Firefox 44 should be the first release that we could use to deliver features like this if we choose. If you’re interested in the details you can read the client implementation plan or follow along the tracking bug for the client side of the feature.

Making communicating with chrome from in-content pages easy

As Firefox increasingly switches to support running in multiple processes we’ve been finding common problems. Where we can we are designing nice APIs to make solving them easy. One problem is that we often want to run in-content pages like about:newtab and about:home in the child process without privileges making it safer and less likely to bring down Firefox in the event of a crash. These pages still need to get information from and pass information to the main process though, so we have had to come up with ways to handle that. Often we use custom code in a frame script acting as a middle-man, using things like DOM events to listen for requests from the in-content page and then messaging to the main process.

We recently added a new API to make this problem easier to solve. Instead of needing code in a frame script the RemotePageManager module allows special pages direct access to a message manager to communicate with the main process. This can be useful for any page running in the content area, regardless of whether it needs to be run at low privileges or in the content process since it takes care of listening for documents and hooking up the message listeners for you.

There is a low-level API available but the higher-level API is probably more useful in most cases. If your code wants to interact with a page like about:myaddon just do this from the main process:

Components.utils.import("resource://gre/modules/RemotePageManager.jsm");
let manager = new RemotePages("about:myaddon");

The manager object is now something resembling a regular process message manager. It has sendAsyncMessage and addMessageListener methods but unlike the regular e10s message managers it only communicates with about:myaddon pages. Unlike the regular message managers there is no option to send synchronous messages or pass cross-process wrapped objects.

When about:myaddon is loaded it has sendAsyncMessage and addMessageListener functions defined in its global scope for regular JavaScript to call. Anything that can be structured-cloned can be passed between the processes

The module documentation has more in-depth examples showing message passing between the page and the main process.

The RemotePageManager module is available in nightlies now and you can see it in action with the simple change I landed to switch about:plugins to run in the content process. For the moment the APIs only support exact URL matching but it would be possible to add support for regular expressions in the future if that turns out to be useful.

hgchanges is back up

The offending changeset that broke hgchanges yesterday turns out to be a merge from an ancient branch to current tip. That makes the diff insanely huge which is why things like hgweb were tripping over it. Kwierso point out that just ignoring those changesets would solve the problem. It’s not ideal but since in this case they aren’t useful changesets I’ve gone ahead and done that and so hgchanges is now updating again.

hgchanges is down for now

My handy tool for tracking changes to directories in the mozilla mercurial repositories is going to be broken for a little while. Unfortunately a particular changeset seems to be breaking things in ways I don’t have the time to fix right now. Specifically trying to download the raw patch for the changeset is causing hgweb to timeout. Short of finding time to debug and fix the problem my only solution is to wait until that patch is old enough that it no longer attempts to index it. That could take a week or so.

Obviously I’ll happily accept patches to fix this problem sooner.

Welcome to the new Toolkit peers – Paolo, Matt, Jared and Irving

Slightly belated in some cases but I’d like to formally welcome four new toolkit peers. Paolo Amadini, Matthew Noorenberghe, Jared Wein and Irving Reid have all shown themselves to be well capable of reviewing patches in any of the toolkit code. Paolo, Matt and Jared actually got added a few months ago but apparently I failed to make an announcement at the time. Irving was added just last week. Please congratulate them all and don’t go too hard on their review queues!

Also if you think there are others who should be peers of Toolkit (or current peers that are no longer relevant) then please let me know.

An update for hgchanges

Nearly a year ago I showed off the first version of my webapp for displaying recent changes in mercurial repositories. I’ve heard directly from a number of people that use it but it’s had a few problems that I’ve spent some of my spare time trying to solve. I’m now pleased to say that a new version is up and running. I’m not sure it’s accurate to call this a rewrite since it was entirely incremental but when I look back over of the code changes there really isn’t much that wasn’t touched in some way or other. So what’s new? For you, a person using the site absolutely nothing! So what on earth have I been doing rewriting the code?

Hopefully I’ve been making the thing a whole lot more stable. Eagle eyed users may have noticed that periodically the site would stop updating, sometimes for days at a time. The site is split into two pieces. One is a django website to display the changes. The other is the important bit that reads the change data out of mercurial and puts it into a database that the website can use. I do this caching because it would be too expensive to work it out on the fly. The problem is that the process of reading the data out of mercurial meant keeping a local version of every repository on the server (some 5GB for the four currently tracked) and reading the changes using mercurial’s python API was slow and used a lot of memory. The shared hosting environment that I run this out of kept killing the process when it used too many resources. Normally it could recover but occasionally it would need a manual kick. Worse there is some mercurial bug where sometimes pulling a repository will end up with an inconsistent database. It’s rare so many people don’t see it but when you’re pulling repositories every ten minutes it starts happing every couple of weeks, again requiring manual work to fix the problem.

I did dabble with getting this into shape so I could run it on a mozilla server. PAAS seemed like the obvious option but after a lot of work it turned out that the database size restrictions there made this impossible. Since then I’ve seem so many horror stories about PAAS falling over that I think I’m glad I never got there. I also tried getting a dedicated server from Mozilla to run this on but they were quite rightly wary when I mentioned that the process used a bunch of memory and wanted harder figures before committing, unfortunately I never found out a way to give those figures.

The final solution has been ripping out the mercurial API dependency. Now rather than needing local mercurial repositories the import process instead talks to them over the web. The default mercurial web APIs aren’t great but luckily all of the repositories I care about have pushlog installed which has a good JSON API for retrieving recently pushed changesets. So now the process is to pull the recent pushlog, get the patches for every changeset and parse them to work out which files have changed. By not having to load the mercurial structures there is far less memory used and since the process now has to delay as it downloads each patch file it keeps the CPU usage low.

I’ve also made the database itself a lot more efficient. Previously I recorded every changeset for every repository, but many repositories have the same changeset in them. So by noting that the database gets a little more complicated but uses much less space and since you only have to parse the changeset once the import process becomes faster. It took me a while to get the website performing as fast as before with these changes but I think it’s there now (barring some strange slowness in Django that I can’t identify). I’ve also turned on some quite aggressive caching both on the client and server side, if someone has visited the same page as you’re trying to access recently then it should load super fast.

So, the new version is up and running at the same location as the old. Chances are you’re using it already so if you’re noticing any unexpected slowness or something not looking right please let me know.

Developer Tools meet-up in Portland

Two weeks ago the developer tools teams and a few others met in the Portland office for a very successful week of discussions and hacking. The first day was about setting the stage for the week and working out what everyone was going to work on. Dave Camp kicked us off with a review of the last six months in developer tools and talked about what is going to be important for us to focus on in 2014. We then had a little more in-depth information from each of the teams. After lunch a set of lightning talks went over some projects and ideas that people had been working on recently.

After that everyone got started prototyping new ideas, hacking on features and fixing bugs. The amount of work that happens at these meet-ups is always mind-blowing and this week was no exception, even one of our contributors got in on the action. Here is a list of the things that the team demoed on Friday:

This only covers the work demoed on Friday, a whole lot more went on during the week as a big reason for doing these meet-ups is so that groups can split off to have important discussions. We had Darrin Henein on hand to help out with UX designs for some of the tools and Kyle Huey joined us for a couple of days to help work out the final kinks in the plan for debugging workers. Lot’s of work went on to iron out some of the kinks in the new add-on SDK widgets for Australis, there were discussions about memory and performance tools as well as some talk about how to simplify child processes for Firefox OS and electrolysis.

Of course there was also ample time in the evenings for the teams to socialise. One of the downsides of being a globally distributed team is that getting to know one another and building close working relationships can be difficult over electronic forms of communication so we find that it’s very important to all come together in one place to meet face to face. We’re all looking forward to doing it again in about six months time.