Optimization opportunities #10

danieldk · 2019-03-14T09:42:17Z

The vast majority of time during training is spent in the dot product and scaled additions. We have been doing unaligned loads so far. I have made a quick modification that ensures that every embedding is aligned on a 16-byte boundary and changed the SSE code to do aligned loads, the compiled machine code seems ok and the compiler even performs some loop unrolling.

Unfortunately, using aligned data/loads does not seem to have a measurable impact on running time. This is probably caused by those functions being constrained by memory bandwidth.

I just wanted to jot down two possible opportunities for reducing cache missed that might have an impact on performance.

Some papers that replace core word2vec computations using by kernels sample a set of negatives for a sentence, rather than for each token. In the base case the number of cache misses due to negatives is reduced by a factor corresponding to the sentence length. Of course, this modification may have an impact on the quality of the embeddings.
The embeddings in the output matrix and the vocab part of the input matrix are ordered by the frequencies of the corresponding tokens. This might improve locality (due to zipf's law). However, the lookups for subword units are randomized by the hash function. Maybe something can be gained by ordering the embeddings in the subword matrix by hash code frequency. However, in the most obvious implementation it would add an indirection (hash index code -> index).

@sebpuetz

sebpuetz · 2019-03-19T12:43:24Z

The vast majority of time during training is spent in the dot product and scaled additions. We have been doing unaligned loads so far. I have made a quick modification that ensures that every embedding is aligned on a 16-byte boundary and changed the SSE code to do aligned loads, the compiled machine code seems ok and the compiler even performs some loop unrolling.

That's also what I found. Time spent on scaled additions explodes even more when using complex context schemes.

Some papers that replace core word2vec computations using by kernels sample a set of negatives for a sentence, rather than for each token. In the base case the number of cache misses due to negatives is reduced by a factor corresponding to the sentence length. Of course, this modification may have an impact on the quality of the embeddings.

Would still be interesting to see what changes through this. Although I'm not sure what to expect. With rather short sentences and a large window, we might get cases where it's not even possible to draw any negatives without using something from the actual context window. Also, we would use tokens from the same sentence as negative examples, which could actually be related to the meaning of the focus word.

The embeddings in the output matrix and the vocab part of the input matrix are ordered by the frequencies of the corresponding tokens. This might improve locality (due to zipf's law). However, the lookups for subword units are randomized by the hash function. Maybe something can be gained by ordering the embeddings in the subword matrix by hash code frequency. However, in the most obvious implementation it would add an indirection (hash index code -> index).

Can't add much to this, but this sounds like something to test empirically?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization opportunities #10

Optimization opportunities #10

danieldk commented Mar 14, 2019 •

edited

Loading

sebpuetz commented Mar 19, 2019

Optimization opportunities #10

Optimization opportunities #10

Comments

danieldk commented Mar 14, 2019 • edited Loading

sebpuetz commented Mar 19, 2019

danieldk commented Mar 14, 2019 •

edited

Loading