35 Years of Neural Networks

My hope is that this post serves as a ‘getting us up to now’ primer on the technology behind AI and neural networks (terms that I use interchangeably). I’ve approached this from the perspective of someone who was familiar with primitive neural networks from long ago but hasn’t followed the new developments in the field closely until recently. Future posts will explore modern developments in AI more deeply, assuming some of the background covered here.

Ancient History

I was very interested in neural networks when they significantly entered the public eye in the late 80’s and early 90’s. In those days, the available computing power only made it feasible to train neural networks that are laughably tiny by today’s standards. I remember that a friend and I back then trained a neural network to predict the outcome of NFL football games using the stats from past games. I don’t remember the details, but that neural network probably had something like 10 neurons total. It learned that the most important stat for predicting football success was turnovers. Which is true enough but not exactly a groundbreaking result.

Circa 1991 I got a grant of time on a Cray Y-MP supercomputer. I trained neural networks that had a total of 15 neurons to solve a pattern recognition problem as part of my Ph.D. thesis research. My thesis compared this neural network technique with other more conventional techniques, and it turned out that even this rudimentary neural network performed admirably by comparison.

In those days, the AI and the neural networks fields were different. Most AI research was pursuing approaches other than neural networks in areas like natural language, vision, pattern recognition, and knowledge representation. Neural networks were an interesting sideshow because they were based on how actual neurons in brains worked, but cutting-edge neural networks back then didn’t seem nearly as capable as other approaches to AI. It seemed like rule-based approaches like Cyc held more promise - they just had to be given the proper set of rules that codified knowledge and common sense. But progress in AI was slow, giving rise to the old “AI is 10 years away and always will be” saying. Meanwhile, thanks to steady increases in computing power, larger and larger neural networks could be created and trained with each passing year.

Breakthroughs

Around 2012, several things converged to allow neural networks to break out and come to the forefront:

Steady increases in computing power, and GPUs (developed to enable graphics-intensive computer games) were used to greatly speed up training neural networks.
Advances in neural network architecture resulted in Convolutional Neural Networks (CNNs) that involved layers specialized for image processing.
Quality training data sets such as ImageNet were created. AlexNet was a breakthrough CNN trained on the ImageNet data set using Nvidia GPUs, and it took the AI world by storm by achieving impressive success rates in standardized image recognition challenges. AlexNet contained 650 thousand neurons and 60 million connection weights, which is obviously a far cry from the 15 or so neurons that I used back in the day.

Follow-on developments showed that AlexNet was definitely not some kind of fluke, as researchers found that adding more and more neurons and layers produced even more capable neural networks. Neural networks came to almost completely take over the field of AI. Few AI researchers are now working with anything other than neural networks. At some point, the “deep learning” terminology came into vogue to describe modern neural networks with millions of neurons and many layers, mostly just to differentiate them from the less capable techniques used in the past.

Approaching Modernity

Around 2018, further advances in neural network architectures produced Transformers, which enabled Generative AI and Large Language Models (LLMS), that can legitimately be said to understand natural language. Similarly, circa 2022, more innovations in neural network architecture and training strategies enabled image generators like Stable Diffusion. Future posts will explore these fascinating advances in more detail.

A measure of the speed of progress in the field is that the GPT-3 LLM, released in 2020, had around 175 billion connection weights. The number of connection weights in the GPT-4 LLM, released in 2023 is proprietary, but is thought to be approximately 1.76 trillion. We are getting into the realm of large numbers here that humans have trouble comprehending.

In addition to ever-increasing size and scale of the neural networks enabled by advancing technology, these latest developments have come from very clever new concepts, approaches, and architectures. For example, just by training a neural network to predict the next word in text given some number of previous words, and feeding in a large portion of the internet as training data, you get an LLM that for all intents and purposes understands natural language. Which seems almost miraculous. But underneath these innovative approaches are just basic neural networks, scaled up immensely and arranged in clever ways.

Neural Network Basics

This video and its follow-on video from the always-excellent 3Blue1Brown are good primers on the topic. If you prefer a non-video treatment, this link is quite good. Or just ask ChatGPT to “Give a basic primer on how neural networks function”.

Most of the information presented in those links hasn’t changed much in 35 years. In fact, I created this PDF of an Appendix to my Ph.D. thesis from 1992 that was intended to be a primer on the basics of neural networks for high-energy physicists. I was surprised at how well the content held up over the years, other than the sizes of the networks discussed. The appendix includes plots of how most of the connection weights in the neural network evolved during training. You won’t find that plot in modern treatments of neural networks that have trillions of connection weights.

The main reason that I made a PDF from printed dead tree pages from 1992 was not because I am overly proud of my writing, but because I wanted to do a quick exploration of how to turn photos of pages from paper books into decent modern PDFs or web pages. Frankly I was sorely disappointed with the (free) tools that I found to do this, and with the quality of the result. The PDF ended up just being slightly retouched photos of the pages - I was hoping for much more. This seems like an area where AI-based image processing could not only correct the image distortions endemic to taking photos of pages, but also recognize the text, math equations, and diagrams from the photos and turn them into ’native’ PDF (or HTML) text, equations, and diagrams, all laid out as they were on the original paper page. The tools I tried did a reasonable job of recognizing text, but they didn’t come close to preserving the page layout or equations or diagrams (they didn’t even try…). This seems like something to pursue further.

In the thesis, the following assertion is made:

"[...]neural networks have a reputation of being incomprehensible "black boxes" whose actions defy explanation. However, this situation is changing, and the incorporation of neural networks into both data analysis and data acquisition is becoming widespread."

As it happens, the situation has not changed all that much - we still don’t have a good understanding of just how large neural networks do what they do. (But to be fair, neural networks have gotten many millions of times larger, there is a lot more to understand now.) This will be the topic of a future article, exploring the state of the art, and the potential for the AIs to take over the world while we’re not watching.

For the remainder of this post, I’m going to assume that you’ve watched the 3Blue1Brown videos, and just add on a few comments.

Network Architectures

Note that the architecture of a neural network is strongly tied to the data set that will be used to train it. The decision to use the MNIST data set for training the handwriting recognition neural network discussed in the video determines the number of inputs (a 28 x 28 grayscale image ‘unrolled’ into 28 x 28 = 784 inputs) and the number of outputs (one output each for the digits 0 - 9). Determining the number and size of hidden layers that will yield the best performance is still somewhat of a black art though.

Also note that this same training data slightly reformatted could be used to train a neural network to instead output a 4-bit binary representation of the digits 0 - 9 (e.g. 0000, 0001, 0010, … 1001), in which case the number of neurons in the output layer would be 4 instead of 10. Or it could even output a single value that had values of 0.0 to 9.0. But some of these choices for the output layer might make the network harder to train or less performant. They might also lose a nice feature of the one-output-per-digit scheme; the ability for the network to indicate how confident it is in its answer. E.g. if the input image was a sloppily-written ‘7’ that could easily be mistaken for a ‘1’, the network might output values near 1.0 for both the ‘1’ and ‘7’ output neurons.

Matrix Math

As was pointed out in the video, picturing a neural network in terms of layers of interconnected neurons is useful conceptually. In practice real-life neural network implementations boil down this representation to the underlying math. E.g. if you represent the n inputs to a layer of neurons as a n x 1 column vector, and the n weights of each of m neurons in the layer as an entry in a m x n matrix, the multiply-by-weights-and-sum portion of the calculation that each neuron does is simply a matrix multiplication that results in a m x 1 column vector that contains the sum for each of the m neurons. That column vector of summed values is fed to the non-linear activation function to ‘squash’ it to get the final outputs of the layer. This removes the need have a class Neuron that handles the calculation for a single neuron at a time. It is much easier to do that math for a layer at a time using linear algebra.

It so happens that GPUs are very adept at multiplying matrices, which is why a GPU is practically a prerequisite for running neural networks quickly, and especially for training them.

NPUs (Neural Processing Units), Hardware that is even more specialized to perform the matrix operations needed for training and evaluating neural networks, are becoming available in home computers. It is interesting that the performance of these NPUs is often measured with TOPS - trillions of operations per second. Microsoft requires a minimum of 40 TOPS to label a PC as a ‘Copilot+ PC’. Ponder the magnitude of those numbers for a while if you can!

Activation Functions

The non-linearity introduced to neural networks via the activation function is crucial to making neural networks work. Without this non-linearity, layers of neurons would only be able to perform boring linear transformations on the inputs to get the outputs, no matter how many layers the neural network contained. This is easy to see if you consider the action of a neural layer sans a non-linear activation function as just a (linear) matrix multiplication - any number of (linear) matrix multiplications associated with any number of layers can always be reduced to a single (linear) matrix multiplication. So having multiple layers in a neural network architecture would buy nothing if it weren’t for the non-linearity of the activation function.

As mentioned in an addendum to the first 3Blue1Brown video: back in the day, the use of sigmoid activation functions (usually via a lookup table) was de rigueur, because that is how actual biological neurons behaved. But at some point it was discovered that a much simpler (and easier to compute) ReLU activation function still introduces enough non-linearity to make the magic happen. The ReLU activation function just takes input of x and output 0 if x < 0, or x if x > 0.

Training

As covered in the second video, a neural network must be trained. During training, the network is presented with a training set that contains many example (input values, desired output values pairs.) The weights for each input to each neuron are adjusted according to an algorithm called backpropagation. The emergence of backpropagation caused the early surge in interest in neural networks in the late 1980s that I was caught up in. The training proceeds in passes through the training set. Each pass incrementally adjusts the weights to make the outputs of the neural network better match those in the training set. Backpropagation is at the heart of how neural networks function. Training is a very computationally expensive operation compared to just using a neural network to produce an output given some inputs.

After training, the neural network can be presented with inputs not seen during training. It will, hopefully, produce reasonable output values because it has learned to generalize the salient underlying features from the training set. However, there is also the possibility that the network just memorized the training data. It would, therefore, perform poorly when presented with inputs that were not explicitly present in its training data. This is known as overfitting, and is prone to happen when a network has too many neurons (and thus too many connection weights) for the size of the training data set.

Convolutional Neural Networks

As hinted at in the second 3Blue1Brown video, neural networks can do better on the digit recognition task by tweaking the architecture of the network to help it recognize certain common features in the image, like lines, edges and loops. This can be accomplished by adding in pre-built detectors for these features, instead of forcing the neural network to learn about the features on its own. A way to do this is to build in a convolution kernel that can detect a certain feature like an edge, then run it over the image, usually at multiple different scales. The neural network then uses the output of running this convolution kernel over the image as one of its inputs. If the convolution kernel ‘gets excited’ about some area of the image, it has found an edge there, and the neural network then has access to this ‘higher level’ information - e.g. “I found an edge oriented in this direction at position x,y in the image”, rather than just the lower level “here are the raw pixel values for the area around x,y in the image”. This is the idea behind the Convolutional Neural Networks mentioned above, and it was already in use by the AlexNet network in 2012. This approach is also somewhat inspired by biology, as our own visual system seems to have neural circuitry to detect edges.

High Dimensions

The video also mentions the fact that neural networks (even the small simple ones) operate in very high-dimensional spaces. It is difficult for us humans to really understand the things that can happen in these ridiculously high dimensional spaces - our intuitions fail us beyond just 3 or 4 dimensions. I tend to agree with some modern AI researchers that think that much of the ‘magic’ behind large neural networks comes from behavioral aspects of these high dimensional spaces, and that if we come to understand more about how things behave in these trillion-dimensional spaces, we will gain more than just a better understanding of how neural networks do what they do. As with many things mentioned in this post, I hope to explore some of this high-dimension magic in more detail in future posts.

Summary

It amazes me that the concepts behind neural networks are still relevant after so many years, yet neural networks have come so very far and increased in scale so dramatically. I cannot think of any other areas where humans have scaled a technology from roughly 50 connection weights (from my thesis neural network, which was not cutting edge at the time but still kept a supercomputer busy) to 1.76 trillion, and still counting.

Author

DrProton

Mostly-retired Software Engineer, ex-Physicist, and lifelong learner.