Good news – I’ve finally made some progress with performance improvements for v8. I’m not yet where I want to be (ie, where I think the true “speed of light” for the KNLs should be), but as of last night, I finally got a newly vectorized version that was already running around 30% faster than the previous one. There’s more potential for optimizations in that version – it’s a complete rewrite, so lot still in flux – but 30ish% is already enough to at least share this version. I’ll need probably tonight – and maybe tomorrow – to do some more testing, burn in, packaging, etc, but “something” should be coming up soon.
BTW: Just to explain where the entire v8 performance issue came from: In the past, the cryptonight family always stressed only on memory performance, with relatively little “compute” thrown in … yes, there was the AES encoding step, and the 64-bit multiply, but both are hardware-supported on CPUs and KNL, so the “true” cost was exclusively in the memory system. With v8, this has changed – there’s now some pretty nasty 64-bit division and double-precision floating point multiply (plus a lot of additional gunk) in the inner loop, and these are pretty compute intensive. To get these pieces fast I had to completely change the vectorization pattern in the inner core, and doing that is a pain if ever there was one: do a single bit wrong and you get a wrong result, and since all intermediary numbers are completely meaningless semi-random bit-patterns it’s near impossible to reasonably debug …. all you can do is write gigabytes of log files of every operation performance, and compare them bit by bit.
Anyway – that restructured code is now working. Lots of opportunity to do some low-level optimizations, and probably even a reasonable way of porting all that to KNC, too (which needs the same kind of vectorization) …. but at least in the short term I had to change a lot, probably broke a lot (including the regular CPU version :-/), etc. It’ll take a day to clean up and release a first version, but from then on we’re back on an upward ramp. Happy!
With that – happy mining!
PS: Just to give you guys an idea of just some of the things I had to deal with on this rewrite: The newly vectorized code needs to do some 64-bit integer divisions, and though KNL can do that in AVX512F the respective intrinsic for that (_mm512_div_epu64) is not even supported in either clang or gcc (not even in the latest top-of-tree’s, let alone released version); and though the Intel compiler does support this operation you need the brand newest latest intel compiler to even run on ubuntu 18; and …..