Outages Happen

It took less than a week to cause an outage. Astute readers will notice the mistake from the last post. I’ll show it again here:

docker run --detach \
      --name nginx-proxy \
      --publish 80:80 \
      --publish 443:443 \
      --volume /etc/nginx/certs \
      --volume /etc/nginx/vhost.d \
      --volume /usr/share/nginx/html \
      --volume /var/run/docker.sock:/tmp/docker.sock:ro \

docker run --detach \
      --name letsencrypt \
      --volumes-from nginx-proxy \
      --volume /var/run/docker.sock:/var/run/docker.sock:ro \
      --env "DEFAULT_EMAIL=encrypt@jackpenson.dev" \

Standard copy-paste without thinking, rushing to get stuff working. It’s also the pain of not having any peer-review process. What I’d done is not mount the certs directory to the host system. As I’d been experimenting with creating this site and restarting repeatedly as I was pushing updates to the first post and playing with different themes and jekyll configs I had been requesting new certificates the whole time.

I then ran head long into the let’s encrypt rate limiter. So on my final update I was granted with this lovely page:

Danger ahead!

Oh no! How peculiar, what on earth had happened to the certs?

$ docker logs letsencrypt

Creating/renewal jackpenson.dev certificates... (jackpenson.dev)
2020-10-27 18:58:37,595:INFO:simp_le:1359: Generating new account key
2020-10-27 18:58:42,414:INFO:simp_le:1387: By using simp_le, you implicitly agree to the CA's terms of service: https://letsencrypt.org/documents/LE-SA-v1.2-November-15-2017.pdf
2020-10-27 18:58:42,767:INFO:simp_le:1450: Generating new certificate private key
ACME server returned an error: urn:ietf:params:acme:error:rateLimited :: There were too many requests of a given type :: Error creating new order :: too many certificates already issued for exact set of domains: jackpenson.dev: see https://letsencrypt.org/docs/rate-limits/

By default letsencrypt-nginx-proxy-companion will supply a self-signed cert if it can’t create a valid one from lets encrypt. You could see this in the browser:

self-signed certificate

When I looked back at my hacky startup script it was obvious that I was throwing the certs away. What I should have done is:

docker run --detach \
      --name nginx-proxy \
      --volume /local/path/certs:/etc/nginx/certs \

However, I’ve been throwing all these certs away and now I really need them. I can’t make the site http only after lambasting it in my first post!

Docker has to keep these volumes somewhere on the disk, so where? /var/lib/docker/volumes:

$ ls /var/lib/docker/volumes/


Ah. That’s not super helpful.

That’s truncated as there are apparently 105 orphaned mounts:

$ find /var/lib/docker/volumes/* -maxdepth 0 -type d | wc -l

So, it’s probably not a great idea to manually walk through this sea of UUIDs trying to find a valid cert.


I know unix

The plan:

  1. Spin up the containers with the correct mount
  2. Find those files in /var/lib/docker/volumes
  3. Rescue the files and restart the container to get https working again

Step 1

$ ./start.sh

$ ls /local/path/certs

accounts  jackpenson.dev  default.crt  default.key  dhparam.pem

Step 2

$ find /var/lib/docker/volumes -name default.crt


Now that we’ve found the certs we can verify if they’re the correct ones by using openssl

$ openssl x509 -in default.crt -noout -issuer
issuer=CN = letsencrypt-nginx-proxy-companion

Um, that’s not it chief. It looks like the default.crt isn’t the correct file.

What do these files look like?

$ cat default.crt


Hmm, well we know that certificate files will have CERTIFICATE as the header, so we could grep for that and see if we find other files:

$ grep -rnw -e 'CERTIFICATE' /var/lib/docker/volumes

/var/lib/docker/volumes/6ef9ac7cde6b4cac3d211cd9a8509ea0c5b550eae90743d400f6480e92c18a50/_data/default.crt:30:-----END CERTIFICATE-----
/var/lib/docker/volumes/c4c50b3dd0779e13b1a9abe254ef280a58740c8698d72cc8a56e4e23d4ca4d69/_data/jackpenson.dev/chain.pem:1:-----BEGIN CERTIFICATE-----

That’s the ticket. jackpenson.dev/chain.pem that certainly sounds more like it.

Oh wait, we’ve got a jackpenson.dev directory in /local/path/certs. I wonder what’s in there:

$ ls /local/path/certs/jackpenson.dev
account_key.json  account_reg.json

Ahah. It looks like the chain.pem is part of the real certificate setup.

If we go have a look in that _data directory we find a jackpenson.dev.crt too, excellent news!

And if we open that one up:

$ openssl x509 -in jackpenson.dev.crt -noout -issuer
issuer=C = US, O = Let's Encrypt, CN = Let's Encrypt Authority X3

Step 3

So we’ll grab that mount, replace our certs and restart all the things.

$ cp -R /var/lib/docker/volumes/c4c50b3dd0779e13b1a9abe254ef280a58740c8698d72cc8a56e4e23d4ca4d69/_data /local/path/certs
$ kill.sh && start.sh

jackpenson.dev working



All in all it took longer to deconstruct the outage and turn it into a post than it took to resolve it. Luckily as this had all been on the same host the docker volumes had stuck around. If these were on ephemeral hosts I would have lost these volumes and I’d have had to wait an entire week to get the site back to https. I suppose at that point I really would have had to hang my head in shame and host the site as http for the time being.

Luckily, in my laziness I hadn’t torn down the host yet. I will need to do so soon when I write terraform for the digital ocean environment. Hopefully this will remind me to take backups before I delete it all.