diff --git a/README.md b/README.md
index 520d8ff..121deec 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,128 @@
 # ryu
 
+[![GoDoc](https://godoc.org/github.com/cespare/ryu?status.svg)](https://godoc.org/github.com/cespare/ryu)
+
 This is a Go implementation of [Ryu](https://github.com/ulfjack/ryu), a fast
 algorithm for converting floating-point numbers to strings.
 
-TODO: more description.
+The API is:
+
+```
+func AppendFloat32(b []byte, f float32) []byte
+func AppendFloat64(b []byte, f float64) []byte
+func FormatFloat32(f float32) string
+func FormatFloat64(f float64) string
+```
+
+These functions are the equivalents of calling strconv.FormatFloat or
+strconv.AppendFloat using the formatter `'e'` and precision `-1`:
+
+```
+// These are the same:
+const f float32 = 1.234
+s := ryu.FormatFloat32(f)
+s := strconv.FormatFloat(float64(f), 'e', -1, 32)
+```
+
+## Benchmarks
+
+These benchmarks were taken with Go 1.12beta1 on Linux/amd64 using an
+Intel i7-8700K.
+
+```
+name                                     old time/op    new time/op    delta
+FormatFloat32-12                            128ns ± 1%      50ns ± 2%  -60.82%  (p=0.000 n=7+8)
+FormatFloat64-12                            129ns ± 4%      65ns ± 5%  -49.54%  (p=0.000 n=7+8)
+AppendFloat32/0e+00-12                     24.4ns ± 1%     3.0ns ± 1%  -87.88%  (p=0.000 n=8+8)
+AppendFloat32/1e+00-12                     26.5ns ± 1%    13.2ns ± 3%  -49.98%  (p=0.000 n=8+8)
+AppendFloat32/3e-01-12                     52.2ns ± 1%    32.5ns ± 2%  -37.73%  (p=0.000 n=8+8)
+AppendFloat32/1e+06-12                     41.2ns ± 1%    17.9ns ± 1%  -56.45%  (p=0.000 n=8+7)
+AppendFloat32/-1.2345e+02-12               83.3ns ± 2%    34.2ns ± 1%  -58.90%  (p=0.000 n=8+8)
+AppendFloat64/0e+00-12                     24.5ns ± 2%     3.3ns ± 2%  -86.50%  (p=0.000 n=8+8)
+AppendFloat64/1e+00-12                     26.9ns ± 1%    14.5ns ± 1%  -46.06%  (p=0.001 n=8+6)
+AppendFloat64/3e-01-12                     53.0ns ± 1%    42.5ns ± 0%  -19.75%  (p=0.001 n=8+6)
+AppendFloat64/1e+06-12                     41.4ns ± 1%    21.1ns ± 1%  -49.05%  (p=0.000 n=8+8)
+AppendFloat64/-1.2345e+02-12               83.8ns ± 1%    43.3ns ± 1%  -48.32%  (p=0.000 n=8+8)
+AppendFloat64/6.226662346353213e-309-12    25.5µs ± 1%     0.0µs ± 1%  -99.84%  (p=0.000 n=8+8)
+```
+
+The test `TestRandomBenchmark` gathers statistics about the distribution of call
+latencies for random float64 values. Here is the summary for one sample of 10,000
+random floats:
+
+```
+    ryu_test.go:279: after sampling 50000 float64s:
+        ryu:               min = 2ns  max = 90ns     median = 41ns   mean = 41ns
+        strconv (stdlib):  min = 8ns  max = 25845ns  median = 106ns  mean = 154ns
+```
+
+The `strconv.FormatFloat` latency is bimodal because of an infrequently-taken
+slow path that is orders of magnitude more expensive
+(https://golang.org/issue/15672).
+
+## Size optimization
+
+The Ryu algorithm requires several lookup tables. Ulf Adams's C library
+implements a size optimization (`RYU_OPTIMIZE_SIZE`) which greatly reduces the
+size of the float64 tables in exchange for a little more CPU cost.
+
+I have a WIP implementation of this optimization on the `size` branch. A binary
+built using that version is 7.96 kB smaller. The benchmark results take a hit as
+compared with the non-size-optimized build:
+
+```
+name                                     old time/op    new time/op    delta
+FormatFloat32-12                           50.0ns ± 2%    49.4ns ± 1%     ~     (p=0.183 n=8+8)
+FormatFloat64-12                           65.0ns ± 5%    72.1ns ± 5%  +10.96%  (p=0.000 n=8+8)
+AppendFloat32/0e+00-12                     2.95ns ± 1%    2.98ns ± 1%     ~     (p=0.072 n=8+8)
+AppendFloat32/1e+00-12                     13.2ns ± 3%    13.1ns ± 1%     ~     (p=0.275 n=8+8)
+AppendFloat32/3e-01-12                     32.5ns ± 2%    32.4ns ± 1%     ~     (p=0.742 n=8+8)
+AppendFloat32/1e+06-12                     17.9ns ± 1%    17.6ns ± 1%   -2.12%  (p=0.001 n=7+8)
+AppendFloat32/-1.2345e+02-12               34.2ns ± 1%    34.4ns ± 1%     ~     (p=0.426 n=8+8)
+AppendFloat64/0e+00-12                     3.31ns ± 2%    3.29ns ± 1%     ~     (p=0.394 n=8+8)
+AppendFloat64/1e+00-12                     14.5ns ± 1%    14.6ns ± 4%     ~     (p=0.641 n=6+8)
+AppendFloat64/3e-01-12                     42.5ns ± 0%    50.0ns ± 1%  +17.44%  (p=0.001 n=6+8)
+AppendFloat64/1e+06-12                     21.1ns ± 1%    21.1ns ± 2%     ~     (p=0.452 n=8+8)
+AppendFloat64/-1.2345e+02-12               43.3ns ± 1%    50.9ns ± 1%  +17.57%  (p=0.000 n=8+8)
+AppendFloat64/6.226662346353213e-309-12    40.6ns ± 1%    47.7ns ± 1%  +17.38%  (p=0.000 n=8+8)
+```
+
+However, it's still generally faster than strconv:
+
+```
+name                                     old time/op    new time/op    delta
+FormatFloat32-12                            129ns ± 2%      49ns ± 1%  -61.72%  (p=0.000 n=8+8)
+FormatFloat64-12                            130ns ± 3%      72ns ± 5%  -44.32%  (p=0.000 n=7+8)
+AppendFloat32/0e+00-12                     24.5ns ± 2%     3.0ns ± 1%  -87.83%  (p=0.000 n=8+8)
+AppendFloat32/1e+00-12                     26.4ns ± 1%    13.1ns ± 1%  -50.26%  (p=0.000 n=7+8)
+AppendFloat32/3e-01-12                     52.6ns ± 2%    32.4ns ± 1%  -38.43%  (p=0.000 n=8+8)
+AppendFloat32/1e+06-12                     41.3ns ± 2%    17.6ns ± 1%  -57.51%  (p=0.000 n=8+8)
+AppendFloat32/-1.2345e+02-12               83.5ns ± 1%    34.4ns ± 1%  -58.82%  (p=0.000 n=8+8)
+AppendFloat64/0e+00-12                     24.6ns ± 2%     3.3ns ± 1%  -86.63%  (p=0.000 n=8+8)
+AppendFloat64/1e+00-12                     26.7ns ± 1%    14.6ns ± 4%  -45.51%  (p=0.000 n=8+8)
+AppendFloat64/3e-01-12                     52.7ns ± 1%    50.0ns ± 1%   -5.17%  (p=0.000 n=8+8)
+AppendFloat64/1e+06-12                     41.2ns ± 1%    21.1ns ± 2%  -48.61%  (p=0.000 n=7+8)
+AppendFloat64/-1.2345e+02-12               83.7ns ± 1%    50.9ns ± 1%  -39.17%  (p=0.000 n=8+8)
+AppendFloat64/6.226662346353213e-309-12    25.8µs ± 2%     0.0µs ± 1%  -99.81%  (p=0.000 n=8+8)
+```
+
+## Notes
+
+This package is a fairly direct Go translation of Ulf Adams's C library at
+https://github.com/ulfjack/ryu. This code is also licensed with Apache 2.0 as a
+derived work of that code.
+
+This package requires Go 1.12 (expected to be released February 2019).
+
+For a small fraction of inputs, Ryu gives a different value than strconv does
+for the last digit. This is due to a bug in strconv: https://golang.org/issue/29491.
+
+## Future work
+
+My plan is to incorporate this into strconv (see
+https://golang.org/issue/15672). Then everyone will benefit from the faster
+algorithm and there will be no need for this library.
+
+If you would like to contribute, I'm interested in any bugfixes or clear-cut
+optimizations, but given the above I don't intend to add more features or APIs
+to this package.
diff --git a/go.mod b/go.mod
index 877495b..a24fc5d 100644
--- a/go.mod
+++ b/go.mod
@@ -1,5 +1,3 @@
 module github.com/cespare/ryu
 
 go 1.12
-
-require github.com/kr/pretty v0.1.0
diff --git a/go.sum b/go.sum
index a1aa49e..e69de29 100644
--- a/go.sum
+++ b/go.sum
@@ -1,5 +0,0 @@
-github.com/kr/pretty v0.1.0 h1:L/CwN0zerZDmRFUapSPitk6f+Q3+0za1rQkzVuMiMFI=
-github.com/kr/pretty v0.1.0/go.mod h1:dAy3ld7l9f0ibDNOQOHHMYYIIbhfbHSm3C4ZsoJORNo=
-github.com/kr/pty v1.1.1/go.mod h1:pFQYn66WHrOpPYNljwOMqo10TkYh1fy3cYio2l3bCsQ=
-github.com/kr/text v0.1.0 h1:45sCR5RtlFHMR4UwH9sdQ5TC8v0qDQCHnXt+kaKSTVE=
-github.com/kr/text v0.1.0/go.mod h1:4Jbv+DJW3UT/LiOwJeYQe1efqtUx/iVham/4vfdArNI=
diff --git a/ryu.go b/ryu.go
index bbedc43..f149243 100644
--- a/ryu.go
+++ b/ryu.go
@@ -15,6 +15,8 @@
 // Ulf Adams which may be found at https://github.com/ulfjack/ryu. That source
 // code is licensed under Apache 2.0 and this code is derivative work thereof.
 
+// Package ryu implements the Ryu algorithm for quickly converting floating
+// point numbers into strings.
 package ryu
 
 import (
diff --git a/ulfjack/.gitignore b/ulfjack/.gitignore
deleted file mode 100644
index a1675c9..0000000
--- a/ulfjack/.gitignore
+++ /dev/null
@@ -1 +0,0 @@
-/bench
diff --git a/ulfjack/bench.c b/ulfjack/bench.c
deleted file mode 100644
index ec121f8..0000000
--- a/ulfjack/bench.c
+++ /dev/null
@@ -1,45 +0,0 @@
-#include <stdint.h>
-#include <stdio.h>
-#include <time.h>
-#include <unistd.h>
-
-#include "ryu/ryu.h"
-
-int64_t time_sub(const struct timespec *t0, const struct timespec *t1) {
-  int64_t nsec = (int64_t)t0->tv_sec * 1000000000 + (int64_t)t0->tv_nsec;
-  nsec -= (int64_t)t1->tv_sec * 1000000000 + (int64_t)t1->tv_nsec;
-  return nsec;
-}
-
-int main(int argc, char **argv) {
-  printf("%s\n", f2s((float)(6.400023450830159e+08)));
-  return 0;
-
-  struct timespec start, end;
-  int64_t elapsed;
-  int64_t iters = 0;
-
-  char buf[40];
-  int sink;
-
-  clock_gettime(CLOCK_MONOTONIC, &start);
-  for (;;) {
-    for (int i = 0; i < 10000; i++) {
-      d2s_buffered(1.0, buf);
-      sink += buf[2];
-    }
-    clock_gettime(CLOCK_MONOTONIC, &end);
-
-    iters += 10000;
-    elapsed = time_sub(&end, &start);
-    if (elapsed >= 1000000000) {
-      break;
-    }
-  }
-
-  double secs = (double)elapsed / 1000000000.0;
-  printf("%lu iters in %lf secs: %.2lf ns/iter\n", iters, secs, (double)elapsed / (double)iters);
-  if (argc == 1000) {
-    printf("%d\n", sink);
-  }
-}