Deep learning refers to a family of machine learning algorithms that can be used for supervised, unsupervised and reinforcement learning.

These algorithms are becoming popular after many years in the wilderness. The name comes from the realization that the addition of increasing numbers of layers typically in a neural network enables a model to learn increasingly complex representations of the data.

Artificial neural network models are inspired by the biological circuitry that makes up the human brain. They were first created to show how biological circuitry could compute or determine the truth values of propositional statements in first-order logic.

Warren S. McCullough and Walter Pitts showed how to construct a series of logic gates that could compute the binary truth values. The neurons in their model are individual units that integrate activity from other neurons. Each connection between neurons is weighted to simulate synaptic efficacy – the ability of a presynaptic neuron to activate a post-synaptic neuron.

Even though they considered the temporal sequence of processing and wanted to include feedback loops, their models were unable to learn or adapt.

* *

*Left: A drawing of neurons and their interconnections by the early neuroanatomist, Ramon y Cajal. Right: Drawings of “neuron nets” by McCullough and Pitts, designed to process logical statements*.

Although various learning rules exist to train a neural network, at its most basic the learning can be thought of as follows: A neural network is presented with some input, and activity propagates throughout its series of interconnected neurons until reaching a set of output neurons.

These output neurons determine the kind of prediction the network makes. For example, to recognize handwritten digits, we could have 10 output neurons in the network, one for each of the digits between 0-9. The neuron with the highest activity in response to an image of a digit denotes the prediction of which digit was seen.

At first, the weights between the neurons are set to random values, and the first predictions about which digit is in an image will be random. As each image is presented, the weights can be adjusted so that it will be more likely to output the correct answer the next time it sees a similar image.

By adjusting the weights in this manner, a neural network can learn which features and representations are relevant for correctly predicting the class of the image, rather than requiring this knowledge to be predetermined by hand.

While any procedure for updating the weights in this manner can suffice – for example, biological learning laws, evolutionary algorithms and simulated annealing – the primary method used today is known as backpropagation.

The backprop algorithm, discovered several times by different researchers from the 1960s onward, effectively applies the chain rule to mathematically derive how the output of a network changes with respect to changes in its weights. This allows a network to adapt its weights according to a weight update rule based on gradient descent.

Despite the rules being in place for neural networks to operate and learn effectively, a few more mathematical tricks were required to really push deep learning to state-of-the-art levels.

One of the things that made learning in neural networks difficult, especially in deep or multilayered networks, was mathematically described by Sepp Hochreiter in 1991. This problem was referred to as the *vanishing gradient* problem, with a dual issue now referred to as the *exploding weight* problem.

Hochreiter’s analysis motivated the development of a class of recurrent neural network (RNN) known as a long short-term memory (LSTM) model, which is deep in time, rather than deep in space. LSTMs overcame many difficulties faced by RNNs, and today remain among the state-of-the-art for modeling temporal or sequential data.

Parallel developments for feedforward and convolutional neural networks would similarly advance their ability to outperform traditional machine learning techniques across a wide range of tasks.

In addition to hardware advances like the proliferation of graphics processing units (GPUs) and the ever increasing availability of data, smarter weight initializations, novel activation functions, and better regularization methods have all helped neural networks function as well as they now do.

**About the Author**

*Sohrob Kazerounian is a senior data scientist at Vectra where he specializes in **artificial intelligence, deep learning, recurrent neural networks and **machine learning.* *Before Vectra, he was a post-doctoral researcher with Jürgen Schidhuber at the Swiss AI Lab, IDSIA. Sohrob holds a Ph.D. in cognitive and neural systems from Boston University and bachelor of sciences degrees in cognitive science and computer science from the University of Connecticut.*