Epiphany-V: A 1024-core 64-bit RISC processor

e5_epiphany

I am happy to report that we have successfully taped out a 1024-core Epiphany-V RISC processor chip at 16nm.  The chip has 4.5 Billion transistors, which is 36% more transistors than Apple’s latest 4 core A10 processor at roughly the same die size. Compared to leading HPC processors, the chip demonstrates an 80x advantage in processor density and a 3.6x advantage in memory density.

Epiphany-V Summary:

  • 1024 64-bit RISC processors
  • 64-bit memory architecture
  • 64-bit and 32-bit IEEE floating point support
  • 64 MB of distributed on-chip SRAM
  • 1024 programmable I/O signals
  • Three 136-bit wide 2D mesh NOCs
  • 2052 separate power domains
  • Support for up to One Billion shared memory processors
  • Support for up to One Petabyte of shared memory
  • Binary compatibility with Epiphany III/IV chips
  • Custom ISA extensions for deep learning, communication, and cryptography
  • TSMC 16FF process
  • 4.56 Billion transistors, 117mm^2 silicon area
  • DARPA funded

Chips will come back from TSMC in 4-5 months. We will not disclose final power and frequency numbers until silicon returns, but based on simulations we can confirm that they should be in line with the 64-core Epiphany-IV chip adjusted for process shrink, core count, and feature changes. For more information, see report below:

Epiphany-V Technical Report

Cheers,

Andreas


This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

60 Comments

  • Deano says:

    Impressive achievement from such a small team!

    Love to hear more about the deep learning ISA extension, potential a killer applications for such a chip.

    Good luck with phase from tape out to working chips in your hand.

  • Petar says:

    just one word: WOW

  • Stéphane Zuckerman says:

    Congratulations! I hope I’ll be able to get my hands on this very very VERY soon. 🙂

  • valache says:

    I would like to have one in the first lot…

  • Richard G Wiater says:

    How much will the Epiphany-V: A 1024-core 64-bit RISC processor cost?

  • Kevin Leary says:

    Congrats. Impressive.

  • Mx says:

    Will it support arduino IDE?

  • dspx90 says:

    Is there any pricing out yet? Would like to have one for graphic development, so i hope its not top notch pricing.

  • Arumaraj says:

    So happy to hear this +

  • Witek says:

    Finally!

    Congratulations. Awesome job.

    Doubling of local SRAM from 32 to 64KB per cores, is really good thing. But I also hope bigger programs / kernels can be loaded into few neighbours and used as a “mini cluster” of semi-local memory. Would be nice to have separate network for that (other than the one used for off chip communication).

  • Witek says:

    Assuming 1GHz for the IO clock, 192 Bytes / cycle, translates to 192GB/s total extranal bandwidth, or 179MB/s per core. In big systems I would expect half of the IO used for memory, and half for interconecting other CPUs, giving 90MB/s per core on average. Which isn’t very high in memory intensive applications (like machine learning for example).

    I know 1000 IO pins is a lot, but might still not be enough.

  • saeed says:

    it is good!!!!!

  • saeed says:

    it is good!!

  • Robert Fontaine says:

    Congratulations will you continue to provide an fpga alongside your lovely little chips?

    You may have been a bit too forward thinking with the parallela but it seems that the hpc community and even big data is catching on to the fact that some of their algorithms can be fundamentally more efficient on an slow fpga than on a fast cpu/gpu.

    Well done!

  • Nancy says:

    Congratulations! Way to go!

  • Gregory Fowler says:

    Congratulations!!! This is an outstanding achievement!! Well done!!!

  • roberto says:

    After so many time i don’t give a dime to possibility to see this 1024 corese’s chip and i marked parallela as “good effort but failed”.
    I am *SO* happy to see i was wrong.

    chapeau!

  • Jonah Probell says:

    Impressive. Congratulations.

  • Maximiliam Luppe says:

    I’m very happy to be part of this achievement as a Kickstarter backer. Congratulations!

  • Brian Miller says:

    I am backer #2262. When I received my cluster, I was expecting to be able to buy the 64-core variant later on. And so I have waited for over two years. Now that you’ve announced the 1024-core Epiphany V, I look on with enthusiasm. However, since I can’t buy the 64-core variant, I’m sure I won’t be able to purchase this.

    How do I feel? I am on the outside of the candy store looking in. Back in 2007, Intel showed off an 80-core CPU. And I thought to myself, it would be so great to work with that. After all, I had worked with the Celerity, the first minicomputer to use a 32-bit microprocessor (two processor boards, each with two FPUs and an integer coprocessor). And so here comes the Parallella, and I thought to myself, “Yeah! This will be fun!” And … No interconnects. No 64-core, except for the initial Kickstarter. Supercomputer.io came and went.

    Yes, I have Pi’s, I have the Nvidia Jetson TK1, and I have other small strange things. I’ve supported Seti@Home and BOINC. Even though over 10,000 boards have shipped (how many now?), it’s still as if there’s a chicken-and-egg problem.

  • daniel says:

    Awesome, Congratulations!

    Any plans to deliver a Paralella around this new CHIP?

  • Thomas says:

    Promises are nice. I hope this will be a stable product this time… I’ve burned my hand with designing with a 64 core version, when it went to EOL before it got into mass production…

    • Jeneva says:

      Mann, das geht ja richtig zu Herzen. Schnief. Bin sonst nicht so leicht zu beeindrucken. Spende hiermit Trost und wünsche Licisbngsdeutleher im Exil alles Gute! Möge er seine neue Tussi bald in die Arme schließen können.FUNKY

  • Santosh says:

    Fantastic……can’t wait to get hold of this

  • Jack says:

    Checking the blog for months waiting for the big news! =)

    Awesome but don’t forget to add a decent memorycontroller on your next board.
    All competing boards do either have a massive amount of fixed memory or SODIMM slots along side their cpu’s.
    It would be awesome to have this amount of computational resources and a space to park a ton of data.

  • edward says:

    This is a huge step forward for the Parallella project, as the smaller chip didn’t really offer much of a difference relative to intel’s offerings, which are up to around 20 cores. 64 limited power cores vs. 20 full power cores was not a compelling proposition. The thing that this chip offers finally, is a resounding price/performance advantage over intel’s rather expensive “big iron”. For example, in the press release for Intel’s E7-8890v4 chip (which costs over $12k/chip), they talk about 8 sockets boards, that have 196 cores that “start at $200k”. If you imagine 10 of the Parallela 1024 core chips, suddenly you have something that is very competitive. A 10,000 core system, at under $20k, would be a boon to research everywhere. Really at this stage of the evolution of massively parallel machines, the key thing is to get these systems into the hands of graduate schools around the world, so that new programming languages can be worked on. I know Tucker Taft’s project for a parallel language, called Parasail, hasn’t gotten much traction because hardly anyone has 1000 cores to program, so i am hoping the best for this new chip. The nVidia massive core systems are not easy to program, having been retrofitted from 3D graphics chips, and they present so many weird quirks to the programming. The parallella architecture is much cleaner and tremendously simpler. It remains to be seen how many algorithms can live within 64MB, but i suspect that the doubling of the RAM in this 5th gen chip will do the trick for most people. Once you start manipulating images, the memory gets eaten up fast. This whole process is going to take years to play out, but is a great step forward.

  • bob says:

    I am a bit worried money to developed came from darpa, they are well known to not be a onlus organization, everything they do is because the have some “grey” plans on it: on other side, community was not able to provide all money needed, so the “necessary evil” must be accepted by parallella. All in all, this 1024 cores chip seems to be “the next big thing” in “number crunching for everyone” field and we could count the time BEFORE it and AFTER it as, for “embedded systems” field, we start to do in 2012 with raspberry. Low size, low energy, 16nm, are good, little size of ram is a question mark for many kind of calculation but it is a good begin. Also scaling of efficency is unknown , it depend by many factors (type of algorithm, data do be moved inside mesh network, etc). only “test on field” will answer to this question. What we need to know now is price of chip: if it came for 1000$$ it is not a democraticizing of massive calculation, if it came for 100$$ it is. of course parallella must have back all efforts they placed inside the project as money reward. Hope parallella will do the right middle size between money reward and attention to community when they will set final price. hope with part of money from 1024 core chip’s selling they will have enough revenue to do in autonomy develop of parallella-VI with 16384 cores with 1MRAM dedicated to every single core chip 🙂

    we can do some speculation: if parallella-IV 64 core at 28nm run at 500Mhz, 1024 core at 16nm should reach 8-900Mhz
    for sure it will need at lest a heatsinnk , in worst scenario also a fan
    aside single chip board you will do, are planned some luxury version, i.e. mainboard with 4 socket where can be placed S.O.M. with parallella-V to have a scalable from 1K to 4K (or more) core system?

    Hope we will receive some anticipation before spring 2017.

  • Andy says:

    If it has “extensions” for Deep Learning (not a hardware/chip design person, myself), how will programmers make use of it? I have searched for ways to make use of epiphany with tensorflow or theano, but gotten nowhere. At least for me, developing machine learning models with C++ on bare metal is not an option.

  • Mike Ross says:

    I share bob’s concerns over DARPA funding and also daniel’s question about a new board based on this chip. If this anouncement is for real then a board based on this chip would be simply astonishing. The potential for advanced algorithm development would be phenomenal, especially if it could be made low cost and affordable. However, I really doubt that DARPA or NSA would like to see this technology in the public domain. I suspect that any production version being made available to the general public would probably be scaled down in some way. Anyway, I don’t get too excited about these announcements anymore. I still remember when they said 64 core boards would be available but it seems only few got those. The rest of us had to make do with 18. Sorry to be a kill joy folks but all this just sounds too good to be true.

  • bob says:

    mike ross, you are right. Too much often i heared “we will do” instead of “we already did”. too much often people promise (im not referring to parallella, just talk in general) “miracles” that fail on real. we must be realistic, lets wait chip arrive. let wait board is ready. let wait bencmark are done. only there we can say “it is a success”.

  • Neel Gupta says:

    Finally !
    So, when can we buy it ?

  • Neel Gupta says:

    wait… “DARPA funded” ?
    What would be the repercussions of that ?
    Will we actually be able to buy it ?
    Will it have backdoors, like all Microsoft products ?

  • kamikaze says:

    no real board in stores – no trust

  • […] McKee). This work continued in 2016 as we needed a way to validate our design decisions for the 1024-core Epiphany-V.  Debugging with the simulator is an order of magnitude easier than with hardware, so you should […]

  • jeff says:

    wow, great job

  • James Preisig says:

    Andreas,

    I work at another end of the spectrum needing from what this chip seems to be designed for (TFLOPS of processing power). My systems are embedded and I need them to come in at about 2 watts for the core processing functions. In that, I need about 150 to 160 GFLOPS (single precision floating point) of real processing capability. This may be at the low end of the what your new chip can do. If so, I would hope that cores can be disabled to save power when they are not needed. You are calling this a SOC. Will there be any control processor or even a soft core processor on an FPGA that can “run” the system or will this chip need a companion chip that handles its interface to the outside world?

    Looking forward to seeing the new chip and its performance and power consumption. What is the size of the new chip?

  • James says:

    I really like the improvements. I was expecting Adaptiva to go the full 4096 cores and shrunken to 14nm, but that wouldn’t leave much in the way for debugging, optimisation and other improvements due to the substantially increase in cost.

    I’d love to see how well this scales up in performance compared to the previous generation and would be happy to test it 🙂

  • Roger says:

    congrats man. you guys have come a long way and this sounds really impressive.

  • Ali Azarian says:

    Congratulations! That’s a great news !

  • Carlos Perez says:

    DARPA funded the internet. They fund a lot of stuff.

    As I understand this is 64k per processor. So that’s 64k for both code and data?

    It is an interesting architecture and will have to warp our minds to figure out how to use this in embedded applications.

    It is not going to work well for training deep learning because of the memory bandwidth bottleneck, However, it should work okay for inference assuming that we can exploit the estimated 1 teraflops capability.

  • Eugen Leitl says:

    64 kByte embedded SRAM in each node has not much of a bottleneck. Accessing remote node memory is penalized by latency, commensurable with distance. So this assumes access locality, which is quite often a given for many problems.

  • Amit says:

    This might be very well for training deep learning too.. Just need to push the envelope enough 🙂 .. Waiting for the actual product which we can buy and experiment.

  • Victor says:

    Congrats! Anything you can release on the new deep learning capabilities will be widely appreciated.

  • dast says:

    what would be the price !? could we sold it now !?

  • dast says:

    where could we buy it now !?

  • Kie says:

    I want one too!!!

  • Traroth says:

    Any news yet?

  • SeyedRamin says:

    Where could we buy it now !? How much does it cost?

  • Tom says:

    Are we there yet? I feel like a 5 year old waiting for this!

  • David says:

    Any chance for getting a status update? Especially on availability and estimated cost?

    It has been 6+ months since the announcement on the tape out. At the very least, when we should check back in for an announcement? Perhaps a mailing list we could add or emails to to make sure we get the announcement when it comes out?

  • name says:

    so? from announcement to now quite 7 months are gone!
    no update? no info? no nothing?
    shall we derubricate this post from “good news” to “vaporware”?
    if there is delay it can be acceptable, ok, but al least advise us!

  • R.L. Flores says:

    We are finalizing the resurrection of our S&L banking, financial and investment A.I. engine and application. Our system was developed and implemented using the Texas Instruments’ “Explorer” Lisp engine and accelerated/processed by AMD’s 2900 slice processor family. We are building a new engine that will run our inference engine, pattern recognition engine, etc. We’d like to begin design-in of an Epiphany V cluster of 128/256 and need an availability date. Please advise.

Leave a Reply