Performance hints #1

Yawning · 2021-10-25T15:30:29Z

Shrek is the fastest .onion vanity address generator written in Go (at time of writing), but it's still very slow compared to native C programs like mkp224o.

As fast as my SSE2/AVX2 scalar-basepoint multiply implementation is, it is not going to compare to mkp244o because it cheats and only does one scalar-basepoint multiply per worker in the common case.

The gist of it is that instead of regenerating the entire key (expensive) each iteration of the search loop, you instead do something like:

eightTimesBasePoint = Point.Mul(Basepoint, 8) # Precompute this, though this is only called once/worker, so you could not bother.

priv, pub = GenerateKey()
scalar = ParseScalar(priv[:32])
point = ParsePoint(pub)
for {
  if Match(Encode(point) {
    // blah blah blah
  }

  scalar = scalar.Add(scalar, 1)
  point = point.Add(point, eightTimesBasePoint)
}

This does have the downside of not supporting RFC-style seed based private key representations, but IIRC tor supports/uses the expanded scalar | nonce form so keys generated by this algorithm work.

The relevant datastructures and math routines are in the curve and curve/scalar package of your ed25519 import.

nb: mkp244o does the scalar addition somewhat differently because of how I originally implemented it in the code they derived their implementation from, but the addition routine in the scalar package is sufficient (and arguably better because it is computationally impractical for wrapping to happen, so you can omit the check).

The text was updated successfully, but these errors were encountered:

innix · 2022-01-04T18:47:01Z

Hey, thanks for the tips and apologies for replying so late. I finally had some free time over Xmas to work on some side projects 😄

Using your pseudo-code above and using horse25519's code as a reference, I've been able to write up a Go implementation that skips doing the full key generation / scalar multiplication on every iteration. I had some problems with your scalar.Add method in the curve25519-voi library, so I borrowed the addsztoscalar32 function from mkp244o and got it working that way. Maybe I was doing something wrong; crypto isn't really my area of expertise.

This new implementation is much faster; around 4x faster than before. I think it's getting close to being the same performance as mkp2440p when running it using the -z flag. Unfortunately my app still gets obliterated by mkp2440p when it uses -B batching mode. It's so much faster it's kinda unbelievable.

innix added the performance New feature or change to improve software performance label Apr 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance hints #1

Performance hints #1

Yawning commented Oct 25, 2021

innix commented Jan 4, 2022

Performance hints #1

Performance hints #1

Comments

Yawning commented Oct 25, 2021

innix commented Jan 4, 2022