Parallel Lines

Here are some parallel lines of thought that will be converging on the horizon for me over the next week:
– Moore’s Law, and 1995-2005 versus 2005-2015
– CPUs versus GPUs, and a surprising return to 16 bit floating point numbers
– Perils of parallelism versus Scala and Golang FTW
– Blockchains and BitCoin
– Deep Learning Neural Networks
– the True North, strong and free
– NVIDIA’s GTC 2018

Let’s see if I can connect them together logically in this article, for you!

Do you remember Moore’s Law? It used to deliver such huge speed increases. You really had to buy a new computer every couple of years because your old one could not run the new programs. Ah. Good times.

CPUs — the big brains of our computers (at least until recently)

If you bought a really great computer in 1995 you might have a single-core, 32bit Central Processing Unit (CPU), with perhaps 3 million transistors, and clocked at about 100MHz. That’s 100 million cycles per second. Back in the day, that was cherry.

Moore’s Law was telling us we could expect double the number of transistors about every 2 years. And from experience we expected about that much improvement in performance as well, as the transistors shrank, they would run faster. Now let’s look ahead 10 years and see what we get. Doubling every 2 years means we should see about: 2 x 2 x 2 x 2 x 2 = 32 times the number of transistors, and we generally hoped for a similar improvement in performance.

Ten tears later, in 2005, you might have a single core 64bit CPU, with perhaps 125 million transistors, and clocked at about 3GHz. That’s 3,000 million cycles per second. So in ten years the bus size doubled, the number of transistors increased 40 times, and the clock speed increased 30 times. It’s still a single core cpu, so it’s single-threaded at the hardware layer, but the larger word size improved performance on some tasks more than the clock rate improvement would suggest. So Moore’s Law held pretty well through that decade.

Now let’s look at the following ten years. In 2015, most CPUs were still 64 bit, but they usually had multiple execution cores (e.g., 4 or 8) and the number of transistors has grown to perhaps 4 billion. However, the clock rate is still around 3.6GHz. Wait. What? In this second 10 year period, even though the number of transistors on our CPUs has again increased by about 32 times, the clock rate just only slightly nudged forward, by about 20% — that’s 20 percent over a whole decade!

Even though Moore’s Law held true, performance increases were disappointing. What happened? The laws of physics started to intrude. As transistors shrank below various thresholds, new design challenges presented themselves, and major leaps forward have been elusive.

There is general acknowledgement that we need to look elsewhere for CPU speed increases. On modern CPUs, hyper-threading and multiple cores have kept us going up, up, up. However, let’s look at what’s going on elsewhere, with Graphical Processing Units (GPUs). GPUs are the chips that traditionally have been partnered with CPUs to drive the graphical displays attached to our computers. Computer gaming, and graphical design applications have contributed to a lot of advancement in GPU chips. Recently though, two completely new applications are driving demand for GPUs and funding their performance increases. What are those new applications? I’ll get to that later. For now, let’s look at how GPUs are outperforming CPUs.

GPUs (based on just one company, NVIDIA)

Note that I decided to only talk about NVIDIA GPUs here (for a couple of reasons that I promise to get to a bit later in the article).

In 1995, NVIDIA launched their first product, the STG-2000 (Diamond Edge). These were the very early days of IBM PC (and compatible hardware) graphics cards. It was clocked at 12MHz, and had a single vertex-pixel pipeline. It could compute about 12 million floating point operations per second (12 MFLOPS).

Fast forward to 2005 and we see NVIDIA GeForce 6x series devices clocking at perhaps 500MHz, with 16 pixel shaders and 6 vertex shaders operating in parallel. Similar to CPU clock rate growth during the period, NVIDIA’s GPU clock rate increased about 30 times. Additionally, the number of parallel operations increased over 20 times as well. It could compute about 7 billion (billion is “giga”) floating point operations per second (7 GFLOPS). That is, its performance improved by a factor of almost 1000 — much better than the CPU performance gains in the period.

By 2015 NVIDIA was selling the GeForce 9x series. Clock speeds were then around 1GHz. The number of cores though has jumped, with over 3000 vertex/pixel shader processors (CUDA cores). At this point GPUs are becoming quite sharply distinguished from CPUs. Their clock rates are only a third of the CPU clock rates, but GPUs are capable of a couple of orders of magnitude more parallelism. These NVIDIA GPUs could compute about 9 trillion floating point operations per second (9 TFLOPS). That is, while CPU performance gains were modest over this ten year period (mostly coming from additional cores adding parallelism) GPU performance improvement was again almost 1000 times better. That’s 20% gain versus 100,000% gain.

For further reading
NVIDIA GPUs over time

The newest NVIDIA Volta chips appearing now (2018) claim to be able to deliver 125 TFLOPS (an order of magnitude greater than the GeForce 9x series in 2015) for tensor core matrix multipliers (16 bit or 32 bit floating point matrix multipliers). But wait, why 16 bit floating point numbers? That’s taking us back almost to the dawn of the personal computer era. What use are those imprecise calculations? Well, here’s where it gets interesting. Remember that I mentioned there are some new kids on the block, driving demand for GPUs? One of those new kids needs very little precision at all. More about that, soon.

Parallelism is hard

I hope you are starting to accept that to get better performance we really need to start getting our programs doing multiple things at once. Unfortunately most of the computer programming languages in widespread use today were designed in the computing era where CPUs were single-core, and most (not all) algorithms ran sequentially. Most of the computer programmers working today also came from that era. Yeah. I’m one of them.

It’s easy to write sequential programs:
Do this first.
Do this next.
Do this N times.
If this is true do such-and-such, otherwise do something else.

Further, it’s also often easy to be complacent about software performance when the computer you buy in two years’ time will be twice as fast as the one you are using to write the program today. But that’s just not the case any more. The easiest way to squeeze out more performance today is to have your code do multiple things in parallel. But there’s a big gotcha in doing that.

Humans think mostly sequentially. Some of us even have trouble chewing gum while walking. Try patting your head, while rubbing your tummy, and dancing. Or try juggling 3 objects of different shapes and weights. Or 5 objects. Of course, some of us can manage parallelism better than others.

I taught Computer Science at the University of Victoria in Canada for 13 years early in my career, and I have led many software development teams since then. I can tell you that in general, designing correct parallel execution algorithms is difficult for most programmers.

Sharing information between multiple threads of execution often requires keen insight into the working of the program, careful planning and execution, and usually elaborate precautions to prevent problems. These things are often problematic, especially for beginners, or anyone unfamiliar with safe parallel programming idioms. Googling for information about common parallel programming problems like “race conditions” and “deadlock” will fill your reading list for weeks. Parallel programming can also be difficult when older programming languages provide only crude support for parallelism.

For further reading
Fun intro to race conditions and deadlock

Without appropriate development tools and methodologies, adopting parallelism will often introduce reliability issues. For example, it may cause intermittent problems that are difficult or impossible for developers to detect, or reproduce or reliably fix. Parallel programming has problems like that when things being done in parallel happen at slightly different times and events take place in subtlely different orders than they did during testing.

Let’s take a slight detour into modern parallel programming languages now…

Modern languages make parallel programming easier

Scala Logo (fair use)

Scala for example uses a functional programming paradigm where data is immutable, preventing the possibility of race conditions or deadlock on protected data elements. Unfortunately developers trained to use the much more common imperative programming paradigm (e.g., C, C++, Java, Python, etc.) often find it difficult to transition to the functional programming paradigm.

Go Logo (fair use)

Go (or “golang”, to make it easier to Google) is a modern imperative programming language designed from the beginning to support parallel programming. Go has a syntax in the C family of languages. If there are any old-timers reading this, it is syntactically similar to Pascal or Modula II (these once popular languages are now mostly dead). Go has a few features that make it a pleasure to use for parallel development. For example, goroutines simplify the creation of new execution threads in your code. These threads are more efficient to start up than the underlying operating system’s threads and they can be scheduled on multiple threads in the underlying OS. Go channels enable safe communication between execution threads without the complexity and overhead of older synchronization tools like mutex locks. The Go select statement is another very powerful tool that simplifies synchronization of multiple threads. Go also provides a great learning and sharing tool called the Go Playground.

It’s usually easy for traditional imperative programmers to transition to using the Go language, although you will need to spend much more time working with Go before you will master it. I want to emphasize that the programming idioms Go encourages help to prevent some of the more common disasters of parallel programming. I encourage you to try it out and I plan to write more about Go on this site someday.

Can we get back to who are those new kids on the block?

Okay, well, one set of those new kids on the block, is better described as being on the block-chain! I’m talking about blockchain miners. What’s a blockchain and what’s a blockchain miner? A blockchain is a distributed consensus ledger. You can use it like a database, but there is no single central authority holding the credentials for the database. Instead, it is completely distributed and the only source of truth in the data is based upon the current consensus of the blockchain participants. Only blockchain miners are able to write data to the blockchain, and there are elaborate protocols in place to decide which miners can write and when, in order for all of the blockchain participants to add the newly written material into their shared consensus. Suffice to say that whichever miner can solve a large amount of computation fastest is most likely to be able to write. And that makes the latest fastest GPUs very attractive to blockchain miners.

The most famous blockchain is BitCoin. With the amazing recent rise in value of Bitcoin, it has become more lucrative to be a Bitcoin blockchain miner. As a result, Bitcoin miners have been buying up large quantities of high-end GPUs to use their incredible computational power. This has driven up the prices and reduced the availability of high end graphics cards.

I should add that Bitcoin is not the only blockchain doing this. It’s just the first and most famous. There are other (newer) blockchain technologies that rely upon the same concept of “proof of work” so they have computationally intensive miners as well — Ethereum for example. Ethereum is cool because code can run on its blockchain! That is, you can store a smart contract object, with state on the Ethereum blockchain, and then methods in the smart contract can update the state of that object right on the blockchain. Bitcoin has much more awkward ways to accomplish similar things.

Further reading
A great book on how blockchains (not just Bitcoin) work
GPU shortage explanation

And the other new kid? Who is that?

The other new kid on the block, who is also eager for high performance GPUs, is Machine Learning (ML). In particular a type of software design called a Neural Network (NN) and Deep Learning (DL) NN in particular. DL has enabled huge strides forward in Artificial Intelligence (AI) the last 5 or 6 years, in no small part due to the hardware advances in GPUs. Generally DL does not require high precision floating point arithmetic. What it does require is large amounts of parallelism. As we discussed above, GPUs have come to excel at parallelism, and the most recent GPUs have been able to improve upon that even more by reducing the size of their floating point numbers to 16 bits. This is perfect for DL.

DL NNs are designed as crude approximations of human brains, by simulating individual neurons, lots of them. Input data sources are attached to the input (simulated) neurons in the NN. In turn, the input neurons are connected to many other (simulated) neurons which are connected to many other (simulated) neurons and so on, layer upon layer, many layers deep. Eventually the final output layer of simulated neurons sums up what has been learned in the network. Each simulated neuron between the input layer and the output layer takes input from a bunch of neurons, weighs each of them individually, then renders its judgement on to the next layer. These NNs are not usually explicitly programmed with this myriad of weights at every connection. Instead, these NNs are trained.

Training involves presenting input data to the NN, letting it produce output, and then reinforcing appropriate output. When an output is reinforced, it is back propagated in reverse through the NN reinforcing the weights of neurons that contributed to the correct rendering and reducing the weights of those that did not. In this way, through training over repeated trials, the weights are refined and reliability improves. Large datasets (consisting of pairs of input data sets, and desired output states) are generally developed and curated by humans for training the NN.

Notice that you don’t exactly program DL NNs. You curate training sets and you train them. In fact, we don’t generally know how or why they make the decisions they make. We don’t know why each simulated synapse has the weight it has, and there are usually a lot of them (e.g., millions). The layers between the input layer and the output layer, where the decision-making actually occurs, are actually called “hidden layers” in the literature. This naming highlights our lack of visibility into what is going on inside. Is that important? Well, I have no idea what is going on at the neuron or synapse level inside my son’s brain. Nevertheless I observe his performance, and reward/reinforce appropriate behaviors and punish/reduce any unfortunate behaviors. It’s not important (or feasible) for me to know how or why he does what he does. It’s sufficient that he behaves appropriately. Perhaps appropriate behavior should be a sufficient test for artificial intelligences too? I realize that this makes some people uncomfortable.

ML has been able to accomplish many amazing things in the last few years, from computers that are very good at understanding spoken human language, to self-driving cars. And these successes are poised to accelerate in the next few years.

Further reading
Great book on Deep Learning

I recently came across this beautiful rendering of Neural Networks in operation. I suspect if we could view the electrical activity of the neurons inside our skulls, they would look similar to these neural network visualizations.

Some have suggested artificial intelligence is a slippery slope, and we are destined to be superseded on planet earth by artificial intelligences. Check out Nick Bostrom for example. I don’t share Dr. Bostrom’s fears, but he makes some interesting arguments in his TED talk, and he is a very entertaining speaker.

What is “the True North, strong and free”?

“The True North, strong and free!” is a line from the Canadian national anthem, and that’s my birth country (this is apparently the correct capitalization, for some reason). But that’s not the True North that I wanted to talk with you about. True North is also the name of IBM’s latest “brain inspired” computer chip. I have been observing the development of this hardware series for years with awe and anticipation. I am currently on a wait list to get access to one of these chips, and I am hopeful that I may be able to get one by the end of 2018. Currently only a select few, like Lawrence Livermore National Laboratory, and DARPA, and various university collaborators have access to these chips.

True North is neuromorphic, meaning that its hardware design attempts to mimic that of the human brain’s cortex. Some info:
– human brain has about 86 billion neurons
– human brain power dissipation is about 20W
– True North has only a million (simulated) neurons and 256 billion individually programmable synapses
– True North power dissipation is about 70mW (so by my math it consumes about 30 times as much power per neuron as the brain)
– in contrast, modern GPUs consume around 400W under load, or about 5000 times as much as a True North chip
– True North has a non-von Neuman architecture, has no clock, has memory within, and is designed to minimize bus traffic
– True North is stackable, and some very large machines have been built from large numbers of them (e.g., 100 trillion synapses)

To me this is a very exciting concept. I am eager to get my hands on the hardware so I can try to develop something on it. The programming paradigm is apparently very different from everything else, so that just makes it even more interesting to me. I am hopeful that this sort of radical architectural thought could someday lead to even greater performance than GPUs for DL tasks, and also be less costly in power consumption.

Further reading
IBM’s brain-inspired chip, True North

You promised to explain why you only talked about NVIDIA-brand GPUs

I alluded to 2 reasons that I had for talking about only NVIDIA GPUs. The first of these is that it seems to me, as a person trying to teach myself about Deep Learning, that NVIDIA is far ahead of all the other GPU manufacturers in their support of Deep Learning. NVIDIA very early on realized that DL was a great fit for GPUs, and they invested in libraries and toolkits and they nurtured this market. Today their CUDA toolkit works with everything (all the major frameworks, like TensorFlow, Caffe, etc.). Although other GPU manufacturers support OpenCL (the open compute library for GPUs, a way to perform non-graphics tasks on GPUs), none of those de-facto standard frameworks (to my knowledge) run on OpenCL. Even though there are other great CPUs that appear poised to perform better on low precision floating point and cost less than the NVIDIA GPUs, NVIDIA just has a huge head start on them in software support. What I read everywhere is that if you want to avoid troubles, use NVIDIA hardware. In my team at work we all use NVIDIA hardware. We have GeForce 10 series and Jetson TX1 and Jetson TX2 development kits.

And the other reason is that NEXT WEEK is the NVIDIA GPU Technology Conference (GTC) right here in sunny San Jose, CA! I attended this conference last year, and it was fantastic. There were many very inspirational presentations! I’ll be attending again this year, and some of my team mates are presenting. I am guessing that back in the day, these NVIDIA GTCs were probably all about cool graphics algorithms for the hottest computer games and other 3D rendering applications. But last year, almost everything I saw was about Machine Learning, and I’m hoping for more of the same this year. I am very excited about this conference.

Well, how did I do? Did I tie all of these seemingly disparate ideas together as I hoped? I think I did!

I tied together Moore’s Law, 1995-2005 versus 2005-2015, CPUs versus GPUs and the surprising return to 16 bit floating point numbers. I talked about the perils of parallel programming and I digressed a little into two modern programming languages Scala and Go, and talked about how they mitigated those issues. I also talked about the two new kids on the block: blockchains (name-dropping BitCoin and Ethereum), and Deep Learning, and how they are driving up the price of GPUs, then I slipped in a little bit about the True North chip, and ended with my excitement about NVIDIA’s GTC 2018. This ended up being a long post!

I hope you enjoyed this article and I also hope you will share your comments on any or all of the above.

Published by

mosquito

An Insignificant Annoyance.

2 thoughts on “Parallel Lines”

  1. Fascinating blog! Is your theme custom made or did you download it from somewhere? A theme like yours with a few simple tweeks would really make my blog stand out. Please let me know where you got your design. Many thanks

    1. Thanks, Tracey. It’s just the standard (free) “Kleen Blog” theme. It was one of the choices on the first page when I created this WordPress site.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.