Background
This report assumes that you know basically what HTTPS is, among other things. ‘We’ is used liberally in this post, and refers to the Wikimedia technical community in general – I am a volunteer and I don’t work for the foundation or any national chapters.
Wikimedia has several canonical domain names, the ones everyone knows about – wikipedia.org, wiktionary.org, wikibooks.org, wikiquote.org, and so on. These are fine, and HTTPS has been used to secure connections on them for a few years now.
Unfortunately, over the years we’ve also accumulated many non-canonical domains that simply redirect to the others. In that mess there’s a mix of domains that are owned by the foundation but not set up in the foundation’s name servers, domains owned by entities other than the foundation and pointed at wikipedia using external hosting, and – here’s the bit that we’re interested in today – some domains that are owned by the foundation and are set up in the name servers, and the foundation’s web servers serve redirects on.
Historically, you had to spend a lot of money to get certificates for your domain, and Wikimedia had enough of these redirect-only domains sitting around that the cost of buying HTTPS certificates to cover them all would be prohibitive. So these domains are accessed over plain HTTP only.
Fortunately, a HTTPS certificate provider named Let’s Encrypt launched in April 2016 which provides free certificates via an API named ACME – that is fully automated. Wikimedia has since begun to use the service in some obscure ‘simple’ production services such as gerrit (the code review system for developers), some private systems, and in August 2016 I used it to finally implement trusted HTTPS on the beta.wmflabs.org sites. To make this process simple, we use a script named acme_tiny.py, written by Daniel Roesler.
The problem
The thing about all the cases we’ve implemented it in is that the decryption only needs to happen on one single server. This is good enough for certain obscure services that only have one server handling encryption, but is no good for services that need to be handled by multiple such servers – e.g. anything that needs to be spread out across the data centres that Wikimedia rents space in. This is because of two things:
- When you go to the ACME API and ask for a certificate, you need to prove that you control the domain(s) which your certificate will match (Domain Validation or DV). The common way of doing this (and the only one that we use right now) is to write a file that the ACME server names into a /.well-known/acme-challenge/ directory on the web server running your domain. Unfortunately this means that, if one server requests a certificate for the domain, all the other servers will need to be able to serve the challenge that only one server received – so the file would need to be distributed between them, and this can be non-trivial without opening up various security cans of worms.
- The ACME server we’re using (Let’s Encrypt) applies rate limiting, so multiple servers requesting certificates for a domain would likely trigger the limits. (and we’d probably set off alarms with the sheer number of different domains we need to cover).
So, if we want to start protecting our redirect domains with free certificates, and serve them in a way that can handle our traffic volume, we have to come up with a way of centralising the generation of our certificates while securely distributing the private parts of the certificates to the authorised internal servers.
The solution
Myself, Brandon Black and Faidon Liambotis finally had the opportunity to sit down at the Hackathon and discuss how we were going to do this exactly. What we plan in basic terms (and I’ve begun to implement) is to have a central server that is responsible for requesting certificates, running an internal-facing service that serves certificate private/public parts to the authorised hosts, and forwarding of HTTP challenge requests through to the central service that made the request. Some of the details are far more complicated and technical but that’s the basic idea.
I’ve already got a basic setup running at https://krenair.hopto.org/ (this actually points to a server running in Cloud VPS, I didn’t use wmflabs.org as the domain name editing for that is broken right now). You can track the work on this system in Phabricator at https://phabricator.wikimedia.org/T194962
What next?
Right now my solution relies on some bad code that I need to clean up – the ‘client’ end (external-facing web server) also needs a bit of work to make the puppet configuration easy and remove some hard-coded assumptions. At some point we will need to determine exactly what the configuration should be for using it with the redirect domains, and Wikimedia Foundation Operations, should they decide it’s good, will need to determine how best to deploy this in production.
Another thing that we plan to do is move to using our configuration management’s system built-in support for pulling files securely from a central machine. Then of course there’s the recently-added support for wildcard certificates. To solve that we’ll need a customised acme_tiny script, and for production we’re going to need to build support for dynamic record creation into our name server, named gdnsd. (in labs this is handled by OpenStack Designate where there is already an API to achieve this, when permissions etc. have been sorted). In the distant future, after the above is done, it may actually be possible to add this as one of the canonical domains’ (wikipedia.org and friends) certificate options (Wikimedia keeps certificates from multiple authorities just in case there are problems with one of them) – that would mean we could serve those domains using certificates provided through an ACME API.