A simple command to open all files with merge conflicts

When I get merge conflicts in a rebase I found it irritating to open up the problem files in my editor, I couldn’t find anything past copying and pasting the file path or locating it in the source tree. So I wrote a simple hg command to open all the unresolved files into my editor. Maybe this is useful to you too?

[alias]
unresolved = !$HG resolve -l "set:unresolved()" -T "{reporoot}/{path}\0" | xargs -0 $EDITOR

Please watch your character encodings

I started writing this as a newsgroup post for one of Mozilla’s mailing lists, but it turned out to be too long and since this part was mainly aimed at folks who either didn’t know about or wanted a quick refresher on character encodings I decided to blog it instead. Please let me know if there are errors in here, I am by no means an expert on this stuff either and I do get caught out sometimes!

Text is tricky. Unicode supports the notion of 1,114,112 distinct characters, slightly more than a byte of memory can hold. So to store a character we have to use a way of encoding its value into bytes in memory. A straightforward encoding would just use three bytes per character. But (roughly) the larger the character value the less often it is used, and memory is precious, so often variable length encodings are used. These will use fewer bytes in memory for characters earlier in the range at the cost of using a little more memory for the rarer characters. Common encodings include UTF-8 (one byte for ASCII characters, up to four bytes for other characters) and UTF-16 (two bytes for most characters, four bytes for less used ones).

What does this mean?

It may not be possible to know the number of characters in a string purely by looking at the number of bytes of memory used.

When a string is encoded with a variable length encoding the number of bytes used by a character will vary. If the string is held in a byte buffer just dividing its length by some number will not always return the number of characters in a string. Confusingly many string implementations expose a length property, that often only tells you the number of code points, not the number of characters in a string. I bet most JavaScript developers don’t know that JavaScript suffers from this:

let test = "\u{1F42E}"; // This is the Unicode cow 🐮 (https://emojipedia.org/cow-face/)
test.length; // This returns 2!
test.charAt(0); // This returns "\ud83d"
test.charAt(1); // This returns "\udc2e"
test.substring(0, 1); // This returns "\ud83d"

Fun!

More modern versions of JavaScript do give better options, though they are probably slower than the length property (because it must decode the characters to understand the length:

Array.from(test).length; // This returns 1
test.codePointAt(0).toString(16); // This returns "1f42e"

When you encode a character into memory and pass it to some other code, that code needs to know the encoding so it can decode it correctly. Using the wrong encoder/decoder will lead to incorrect data.

Using the wrong decoder to convert a buffer of memory into characters will often fail. Take the character “ñ”. In UTF-8 this is encoded as C3 B1. Decoding that as UTF-16 will result in “쎱”. In UTF-16 however “ñ” is encoded as 00 F1. Trying to decode that as UTF-8 will fail as that is in invalid UTF-8 sequence.

Many languages thankfully use string types that have fixed encodings, in rust for example the str primitive is UTF-8 encoded. In these languages as long as you stick to the normal string types everything should just work. It isn’t uncommon though to do manipulations based on the byte representation of the characters, %-encoding a string for a URL for example, so knowing the character encoding is still important.

Some languages though have string types where the encoding may not be clear. In Gecko C++ code for example a very common string type in use is the nsCString. It is really just a set of bytes and has no defined encoding and no way of specifying one at the code level. The only way to know for sure what the string is encoded as is to track back to where it was created. If you’re unlucky it gets created in multiple places using different encodings!

Funny story. This blog posts contains a couple of the larger unicode characters. While working on the post I kept coming back to find that the character had been lost somewhere along the way and replaced with a “?”. Seems likely that there is a bug in WordPress that isn’t correctly handling character encodings. I’m not sure yet whether those characters will survive publishing this post!

These problems disproportionately affect non-English speakers.

Pretty much all of the characters that English speakers use (mostly the Latin alphabet) live in the ASCII character set which covers just 128 characters (some of these are control characters). The ASCII characters are very popular and though I can’t find references right now it is likely that the majority of strings used in digital communication are made up of only ASCII characters, particularly when you consider strings that humans don’t generally see. HTTP request and response headers generally only use ASCII characters for example.

Because of this popularity when the Unicode character set was first defined, it mapped the 128 ASCII characters to the first 128 Unicode characters. Also UTF-8 will encode those 128 characters as a single byte, any other characters get encoded as two bytes or more.

The upshot is that if you only ever work with ASCII characters, encoding or decoding as UTF-8 or ASCII yields identical results. Each character will only ever take up one byte in memory so the length of a string will just be the number of bytes used. An English speaking developer, and indeed many other developers may only ever develop and test with ASCII characters and so potentially become blind to the problems above and not notice that they aren’t handling non-ASCII characters correctly.

At Mozilla where we try hard to make Firefox work in all locales we still routinely come across bugs where non-ASCII characters haven’t been handled correctly. Quite often issues stem from a user having non-ASCII characters in their username or filesystem causing breakage if we end up decoding the path incorrectly.

This issue may start getting rarer. With the rise in emoji popularity developers are starting to see and test with more and more characters that encode as more than one byte. Even in UTF-16 many emoji encode to four bytes.

Summary

If you don’t care about non-ASCII characters then you can ignore all this. But if you care about supporting the 80% of the world that use non-ASCII characters then take care when you are doing something with strings. Make sure you are checking its length correctly when needed. If you are working with data structures that don’t have an explicit character encoding then make sure you know what encoding your data is in before doing anything with it other than passing it around.

Taming Phabricator

So Mozilla is going all-in on Phabricator and Differential as a code review tool. I have mixed feelings on this, not least because it’s support for patch series is more manual than I’d like. But since this is the choice Mozilla has made I might as well start to get used to it. One of the first things you see when you log into Phabricator is a default view full of information.

A screenshot of Phabricator's default view

It’s a little overwhelming for my tastes. The Recent Activity section in particular is more than I need, it seems to list anything anyone has done with Phabricator recently. Sorry Ted, but I don’t care about that review comment you posted. Likewise the Active Reviews section seems very full when it is barely listing any reviews.

But here’s the good news. Phabricator lets you create your own dashboards to use as your default view. It’s a bit tricky to figure out so here is a quick crash course.

Click on Dashboards on the left menu. Click on Create Dashboard in the top right, make your choices then hit Continue. I recommend starting with an empty Dashboard so you can just add what you want to it. Everything on the next screen can be modified later but you probably want to make your dashboard only visible to you. Once created click “Install Dashboard” at the top right and it will be added to the menu on the left and be the default screen when you load Phabricator.

Now you have to add searches to your dashboard. Go to Differential’s advanced search. Fill out the form to search for what you want. A quick example. Set “Reviewers” to “Current Viewer”, “Statuses” to “Needs Review”, then click Search. You should see any revisions waiting on you to review them. Tinker with the search settings and search all you like. Once you’re happy click “Use Results” and “Add to Dashboard”. Give your search a name and select your dashboard. Now your dashboard will display your search whenever loaded. Add as many searches as you like!

Here is my very simple dashboard that lists anything I have to review, revisions I am currently working on and an archive of closed work:

A Phabricator dashboard

Like it? I made it public and you can see it and install it to use yourself if you like!

Searchfox in VS Code

I spend most of my time developing flipping back and forth between VS Code and Searchfox. VS Code is a great editor but it has nowhere near the speed needed to do searches over the entire tree, at least on my machine. Searchfox on the other hand is pretty fast. But there’s something missing. I usually want to search Searchfox for something I found in the code. Then I want to get the file I found in Searchfox open in my editor.

Luckily VS Code has a decent extension system that allows you to add new features so I spent some time yesterday evening building an extension to integration some of Searchfox’s functionality into VS Code. With the extension installed you can search Searchfox for something from the code editor or pop open an input box to write your own query. The results show up right in VS Code.

A screenshot of Searchfox displayed in VS Code
Searchfox in VS Code

Click on a result in Searchfox and it will open the file in an editor in VS Code, right at the line you wanted to see.

It’s pretty early code so the usual disclaimers apply, expect some bugs and don’t be too surprised if it changes quite a bit in the near-term. You can check out the fairly simple code (rendering the Searchfox page is the hardest part of it) on Github.

If you want to give it a try, install the extension from the VS Code Marketplace or find it by searching for “Searchfox” in VS Code itself. Feel free to file issues for bugs or improvements that would be useful or of course submit pull requests of your own! I’d love to hear if you find it useful.

How do you become a Firefox peer? The answer may surprise you!

So you want to know how someone becomes a peer? Surprisingly the answer is pretty unclear. There is no formal process for peer status, at least for Firefox and Toolkit. I haven’t spotted one for other modules either. What has generally happened in the past is that from time to time someone will come along and say, “Oh hey, shouldn’t X be a peer by now?” to which I will say “Uhhh maybe! Let me go talk to some of the other peers that they have worked with”. After that magic happens and I go and update the stupid wiki pages, write a blog post and mail the new peers to congratulate them.

I’d like to formalise this a little bit and have an actual process that new peers can see and follow along to understand where they are. I’d like feedback on this idea, it’s just a straw-man at this point. With that I give you … THE ROAD TO PEERSHIP (cue dramatic music).

  1. Intro patch author. You write basic patches, request review and get them landed. You might have level 1 commit access, probably not level 3 yet though.
  2. Senior patch author. You are writing really good patches now. Not just simple stuff. Patches that touch multiple files maybe even multiple areas of the product. Chances are you have level 3 commit access. Reviewers rarely find significant issues with your work (though it can still happen). Attention to details like maintainability and efficiency are important. If your patches are routinely getting backed out or failing tests then you’re not here yet.
  3. Intro reviewer. Before being made a full peer you should start reviewing simple patches. Either by being the sole reviewer for a patch written by a peer or doing an initial review before a peer does a final sign-off. Again paying attention to maintainability and efficiency are important. As is being clear and polite in your instructions to the patch author as well as being open to discussion where disagreements happen.
  4. Full peer. You, your manager or a peer reach out to me showing me cases where you’ve completed the previous levels. I double-check with a couple of peers you’ve work with. Congratulations, you made it! Follow-up on review requests promptly. Be courteous. Re-direct reviews that are outside your area of expertise.

Does this sound like a reasonable path? What criteria am I missing? I’m not yet sure what length of time we would expect each step to take but I am imagine that more senior contributors could skip straight to step 2.

Feedback welcome here or in private by email.

New Firefox and Toolkit module peers in Taipei!

Please join me in welcoming three new peers to the Firefox and Toolkit modules. All of them are based in Taipei and I believe that they are our first such peers which is very exciting as it means we now have more global coverage.

  • Tim Guan-tin Chien
  • KM Lee Rex
  • Fred Lin

I’ve blogged before about the things I expect from the peers and while I try to keep the lists up to date myself please feel free to point out folks you think may have been passed over.

Thoughts on the module system

For a long long time Mozilla has governed its code (and a few other things) as a series of modules. Each module covers an area of code in the source and has an owner and a list of peers, folks that are knowledgeable about that module. The full list of modules is public. In the early days the module system was everything. Module owners had almost complete freedom to evolve their module as they saw fit including choosing what features to implement and what bugs to fix. The folks who served as owners and peers came from diverse areas too. They weren’t just Mozilla employees, many were outside contributors.

Over time things have changed somewhat. Most of the decisions about what features to implement and many of the decisions about the priority of bugs to be fixed are now decided by the product management and user experience teams within Mozilla. Almost all of the module owners and most of the peers are now Mozilla employees. It might be time to think about whether the module system still works for Mozilla and if we should make any changes.

In my view the current module system provides two things that it’s worth talking about. A list of folks that are suitable reviewers for code and a path of escalation for when disagreements arise over how something should be implemented. Currently both are done on a per-module basis. The module owner is the escalation path, the module peers are reviewers for the module’s code.

The escalation path is probably something that should remain as-is. We’re always going to need experts to defer decisions to, those experts often need to be domain specific as they will need to understand the architecture of the code in question. Maintaining a set of modules with owners makes sense. But what about the peers for reviewing code?

A few years ago both the Toolkit and Firefox modules were split into sub-modules with peers for each sub-module. We were saying that we trusted folks to review some code but didn’t trust them to review other code. This frequently became a problem when code changes touched more than one sub-module, a developer would have to get review from multiple peers. That made reviews go slower than we liked. So we dropped the sub-modules. Instead every peer was trusted to review any code in the Firefox or Toolkit module. The one stipulation we gave the peers was that they should turn away reviews if they didn’t feel like they knew the code well enough to review it themselves. This change was a success. Of course for complicated work certain reviewers will always be more knowledgeable about a given area of code but for simpler fixes the number of available reviewers is much larger than it was otherwise.

I believe that we should make the same change for the entire code-base. Instead of having per-module peers simply designate every existing peer as a “Mozilla code reviewer” able to review code anywhere in the tree so long as they feel that they understand the change well enough. They should escalate architectural concerns to the module’s owner.

This does bring up the question of how do patch authors find reviewers for their code if there is just one massive list of reviewers. That is where Bugzilla’s suggested reviewers feature comes it. We have per-Bugzilla lists of reviewers who are the appropriate choices for that component. Patches that cover more than one component can choose whoever they like to do the review.