Parallella Community

by **dobkeratops** » Wed Apr 05, 2017 7:08 pm

https://drive.google.com/file/d/0Bx4haf ... xtcEk/view

.. so similarities: 28mb on chip 'software-managed' (scratchpad?) memory,

differences: they seem to focus exclusively on 8bit multiples for inference only it seems, it seems far more focussed on one algorithm? I would much rather have the versatility of epiphany style risc cores.

So The TPU seems to be some sort of huge 8bit matrix multiplier with one giant scratchpad? , would the epiphany's memory architecture still offer advantage for Convolutional Neural Networks? (keeping filters in distinct scratchpads, closer to individual ALUs) - and multichip scalability. Maybe they can still keep some coefficients in the matrix array (I haven't quite read the details there..)

Also seems like it's driven by the host, with big CISC instructions (doing entire matrix multiply ops?)

Epiphany would be equally useful for training, I would guess, which the TPU isn't.

Any details yet on what they meant by deep learning instructions in the E5? .. I would have guessed that means some low precision support.. I would personally be happy even if it just had popcount (for cryptography?) because there are 1bit neural-net techniques out there. (Does 'communications' also refer to low-precision support .. encoding of signals ?)

I suppose this also confirms a relatively simple software library can be useful; they claim the back end is <1500 lines of code.
This hardware seems limited to a few functions.

Would the TPU have been a simpler chip to design and implement (e.g. 'more dependant on a host to drive it'); whilst I can imagine it would outperform the epiphany in 8-bit matrix performance, I still hope you could take an epiphany style design and skew it for different workloads - varying the number of functional units and custom instructions , whilst still having programability (e.g. like there are so many variations of ARM out there, some with FP, some without, some with NEON,some without..). I hope the right kind of SIMD can close the gap ('each instruction doing a load of multiply-accumulates').

by **jar** » Thu Apr 06, 2017 1:35 am

It appears that one of the key efficiencies with the TPU is running in batch. The neural networks are too large to fit all the weights within the 28 MB of on-chip SRAM, so in order to increase on-chip weight reuse, they run many in batch and store the intermediate/partial results before loading the next set of network weights. You can do this on other architectures, but apparently not within their latency requirements.

It seems they're still using GPUs for training and then quantize the results into 8-bits. IBM has taken the strategy of constrain-then-train for TrueNorth. This strategy seems to yield better results for TN, so maybe some software improvements can be made with regards to the TPU.

There's no doubt that lower precision than 32 bits is useful. I'm not sure how Google arrived at 8 bits rather than anything else.

Is there any chance we will ever see these commercially available? I was surprised that they even published the results today.

There is a technique to perform 4 synaptic operations (SOPs) per 3 clocks for 1 bit neural networks using instruction-level parallelism with the E32 ISA. Just for comparison, TN achieves 268 GSOPs in ~300 mW (but the host is higher power). The TPU is 92 TOPS (8-bit) in ~384W. SOPS and TOPS aren't comparable here, but I did it anyway.

by **dobkeratops** » Thu Apr 06, 2017 5:42 am

Parallella Community

Google TPU details, comparison

Google TPU details, comparison

Re: Google TPU details, comparison

Re: Google TPU details, comparison

Who is online