Google TPU details, comparison

Forum for anything not suitable for the other forums.

Google TPU details, comparison

Postby dobkeratops » Wed Apr 05, 2017 7:08 pm

https://drive.google.com/file/d/0Bx4haf ... xtcEk/view

.. so similarities: 28mb on chip 'software-managed' (scratchpad?) memory,

differences: they seem to focus exclusively on 8bit multiples for inference only it seems, it seems far more focussed on one algorithm? I would much rather have the versatility of epiphany style risc cores.

So The TPU seems to be some sort of huge 8bit matrix multiplier with one giant scratchpad? , would the epiphany's memory architecture still offer advantage for Convolutional Neural Networks? (keeping filters in distinct scratchpads, closer to individual ALUs) - and multichip scalability. Maybe they can still keep some coefficients in the matrix array (I haven't quite read the details there..)

Also seems like it's driven by the host, with big CISC instructions (doing entire matrix multiply ops?)

Epiphany would be equally useful for training, I would guess, which the TPU isn't.

Any details yet on what they meant by deep learning instructions in the E5? .. I would have guessed that means some low precision support.. I would personally be happy even if it just had popcount (for cryptography?) because there are 1bit neural-net techniques out there. (Does 'communications' also refer to low-precision support .. encoding of signals ?)

I suppose this also confirms a relatively simple software library can be useful; they claim the back end is <1500 lines of code.
This hardware seems limited to a few functions.

Would the TPU have been a simpler chip to design and implement (e.g. 'more dependant on a host to drive it'); whilst I can imagine it would outperform the epiphany in 8-bit matrix performance, I still hope you could take an epiphany style design and skew it for different workloads - varying the number of functional units and custom instructions , whilst still having programability (e.g. like there are so many variations of ARM out there, some with FP, some without, some with NEON,some without..). I hope the right kind of SIMD can close the gap ('each instruction doing a load of multiply-accumulates').
Last edited by dobkeratops on Thu Apr 06, 2017 5:29 am, edited 1 time in total.
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk

Re: Google TPU details, comparison

Postby jar » Thu Apr 06, 2017 1:35 am

It appears that one of the key efficiencies with the TPU is running in batch. The neural networks are too large to fit all the weights within the 28 MB of on-chip SRAM, so in order to increase on-chip weight reuse, they run many in batch and store the intermediate/partial results before loading the next set of network weights. You can do this on other architectures, but apparently not within their latency requirements.

It seems they're still using GPUs for training and then quantize the results into 8-bits. IBM has taken the strategy of constrain-then-train for TrueNorth. This strategy seems to yield better results for TN, so maybe some software improvements can be made with regards to the TPU.

There's no doubt that lower precision than 32 bits is useful. I'm not sure how Google arrived at 8 bits rather than anything else.

Is there any chance we will ever see these commercially available? I was surprised that they even published the results today.

There is a technique to perform 4 synaptic operations (SOPs) per 3 clocks for 1 bit neural networks using instruction-level parallelism with the E32 ISA. Just for comparison, TN achieves 268 GSOPs in ~300 mW (but the host is higher power). The TPU is 92 TOPS (8-bit) in ~384W. SOPS and TOPS aren't comparable here, but I did it anyway.
User avatar
jar
 
Posts: 295
Joined: Mon Dec 17, 2012 3:27 am

Re: Google TPU details, comparison

Postby dobkeratops » Thu Apr 06, 2017 5:42 am

jar wrote:I
There is a technique to perform 4 synaptic operations (SOPs) per 3 clocks for 1 bit neural networks using instruction-level parallelism with the E32 ISA. Just for comparison,


sounds interesting (i'm guessing what you might have been up to from the other thread..). I'm hopeful E5 has pop count which might accelerate that further

jar wrote: TN achieves 268 GSOPs in ~300 mW (but the host is higher power). The TPU is 92 TOPS (8-bit) in ~384W. SOPS and TOPS aren't comparable here, but I did it anyway.


yes; if the host has to supply the instruction stream, swap data in and out.. I think you have to think more about the ensemble.
I also suspect full programability will give other options, like sparsity (via indexing), symmetry / rotations.

I'm sure the E5 could function with a very simple host .. even if the TPU outperforms it for NNs , the E5 would still be a better choice for drones, and AI work (training..)
dobkeratops
 
Posts: 189
Joined: Fri Jun 05, 2015 6:42 pm
Location: uk


Return to General Discussion

Who is online

Users browsing this forum: No registered users and 13 guests