We say that necessity is the mother of invention, and that certainly was the case at Bletchley Park between 1942 and 1944, as the secret headquarters of British codebreakers during the Second World War. As depicted in the 2014 historical drama The Imitation Game, we see Benedict Cumberbatch play the role of Alan Turing, one of the most brilliant mathematicians and cryptanalysts of his time. Turing, along with a crack team of codebreakers, was tasked by the British military with deciphering the unbreakable Nazi code encrypted by the infamous Enigma machine. Whilst thousands of Allied soldiers and civilians died by virtue of the secret messages and co-ordinates sent to German U-boats and the Luftwaffe via Enigma, Turing and his team had to race against time to come up with a solution.
Within just a few weeks of working on the Enigma code, Turing had radically altered the course of the military’s efforts. The plan he proposed was to make use of a cryptanalytic machine that could help break the German cypher. Whilst in the film it is insinuated that Turing and his team conceptualised and built the machine from scratch, it was in fact modelled on a Polish machine called the Bomba – albeit with some very important alterations insisted upon by Turing. This code-breaking machine was named the Bombe – as a nod to its predecessor and because of the ominous ticking sound made by the dozens of indicator drums continuously testing possible outcomes – and it would significantly change the course of history.
With the help of this electromagnetic cryptanalyst machine that effectively automated and optimised the trial of different possibilities in the code-breaking process, Turing and his team managed to crack the previously unbreakable Enigma code. Many historians argue that this breakthrough was critical for the Allies to eventually go on to win the war, with hundreds of intercepted German messages being decoded to give their forces a distinctive strategic advantage going into battle. After the war, Turing made great strides in advancing early computing developments, and to this day, many call him the father of modern computing.
In many ways, the history of AI begins with the very first manifestations of the digital electronic computer, dating back to Turing’s earliest research in the 1930s. But in the strictest sense, the highly effective and decorated Bombe machine built by Turing’s team at Bletchley Park could not truly be called a computer. For one thing, the Bombe could only solve one problem. And secondly, it could not store or retrieve data, these being the critical functions that allow modern computers to achieve the level of programmability that makes them so powerful today.
Despite the Bombe not quite being classified as the first ever computer, Turing’s truly visionary work after the war demonstrated incredible foresight into the future of computing. In a paper called “On Computable Numbers, with an Application to the Entscheidungsproblem” (1936), Turing detailed mathematical proofs that there could exist a machine that could calculate any conceivable computation, given that it was representable in the form of an algorithm. These theoretical machines were to be called Universal Turing Machines (UTM), a seminal idea that would later be used by John Von Neumann to create the Electronic Discrete Variable Automatic Computer (EDVAC) in 1949. Built for the US Army’s Ballistics Research Laboratory in Pennsylvania, EDVAC was the first ever electronic stored-program computer, and unlike previous manifestations, used a binary numbering system as opposed to a decimal system – the format still used in modern computer programming today.
As was the case with the EDVAC, the first ever machine intended to “learn” was also funded by the US Military, this time through the Office of Naval Research and built by Frank Rosenblatt at the Cornell Aeronautical Laboratory in 1957. The Perceptron, as it was called, was an early prototype for machine learning, making use of a rudimentary neural network for image recognition. Unlike modern AI, the Perceptron was a machine, not a program. And although the “learning” aspect of the machine works similarly to neural networks of today, with neurons processing incoming data and altering the weights (or relative importance of inputs) attached to these neurons depending on the resultant output, the weightings connected to neurons of the Perceptron were physically altered (as opposed to digitally) via small electrical motors. This early form of AI was called connectionism. But what seemed at first to be a significant breakthrough in machine learning and artificial intelligence, would ultimately, but unintentionally, be a massive burden to the entire field of study.
After a very promising and fruitful period for artificial intelligence research and development from the mid-1950s to late-1960s, what ensued was to be called the “AI Winter”, largely catalysed by the reception and review of the Perceptron machine by one single book in particular – Perceptrons: An Introduction to Computational Geometry (1969). The famous work – produced by American cognitive scientist Marvin Minsky and the South African-born American mathematician Seymour Papert – focused on the limitations of the Perceptron system, specifically providing mathematical proofs that such a neural network was not capable of learning an exclusive disjunction (XOR) function.
So influential was this book that it would change the course of AI research for decades to come. The result was a significant slowdown in sponsorships and a general feeling of pessimism around the discipline, with most experts on the matter espousing the limited capabilities of the earliest forms of neural networks – in the form of connectionist systems such as the Perceptron – resulting in an industry-killing funding freeze. Between the release of Perceptrons in 1969 and the eventual revival of AI research in the mid-1980s, funding for connectionism-type projects – as the earliest forms of neural networks – was near-impossible to attain. It would not be until the advent of multi-layered neural networks (capable of deep learning) that artificial intelligence research and optimism surrounding machine learning would make a revival, thanks in no small part to a few stubborn and dedicated researchers on the ground who battled through the AI Winter without much support and often under much criticism.
Unfortunately, the inventor of the Perceptron, Frank Rosenblatt, would not live to see the revival of his field – having died in a boating accident not long after the release of Perceptrons – but the late Marvin Minsky would, living long enough at least to swallow his words and completely change his mind. Minsky, who for many years doubted the capability of neural networks, would even later go on to co-found the Massachusetts Institute of Technology’s (MIT) AI laboratory, becoming one of the foremost experts in the field and a great believer in the massive potential of the “learning machine.”
What Marvin Minsky and Seymour Papert did not account for in their industry-altering book was that neural networks would become multi-layered – a breakthrough that, along with the significant developments in the processing power of computers, would eventually end the AI Winter and open up a world of possibilities for the implementation of artificial intelligence and machine learning. Like many of mankind’s greatest technological triumphs, the most significant technique propelling artificial intelligence into the future is inspired by nature. The concept of the artificial neural network (ANN), as perhaps the most advanced system in the realm of machine learning at present, is loosely based on the neural circuits that occur naturally in the brain. In a biological neural network, chemical and electrical synapses connect unimaginably intricate circuits of neurons that link together to make up the central nervous system. Each of these neurons has dendrites (receptors) and axons (transmitters) that respectively receive and send signals across a network of neurons, each of which translates various signals and stimuli into meaningful information for use in the brain.
Whilst much simpler in design compared to their biological counterparts, artificial neural networks work in much the same way. At the most conceptual level, neural networks can “learn” through considering many inputs via their neurons and adjusting the translation or processing of the data, based on the relevance of the output to the desired result – which may or may not be dictated by the user. This is what makes this system of machine learning so powerful – the ability to learn and self-correct without the need for continuous manual intervention by the programmer. And critical to the evolution of neural networks in the quest for true self- actualising artificial intelligence has been the advent of deep learning.
Deep learning involves the training of artificial neural networks that are several layers of neurons deep – known as deep nets. Instead of one node processing all incoming data and producing a final result, deep nets rely on sequentially filtering data through multiple layers to refine the output. One can think of these layers of neurons as a stack of sieves or nets, each with a different sized mesh, allowing some particles through whilst blocking others. In natural neural networks this filtering process is similar – individual neurons decide which stimuli are most relevant, and which are not, in determining whether or not the synaptic connection will fire to pass on the signal to the next layer of neurons. Crucially, however, this filtering process is not a binary yes-no system, but rather relies on the calibration of a finely-tuned weighting mechanism for each neuron. The adjustment of these weights on the inputs to the neuron is the key capability that allows a neural network to “learn”.
To better understand this concept, let us imagine a common use for neural networks in the real world – image recognition. To narrow this down even further, let us just focus on a system that can recognise hand-written numbers. In this example, as in all deep learning neural networks, there are multiple layers of neurons making up the system. The first layer receives the external input, whilst the last layer delivers the prediction – in this case, a number from zero to nine. The set of layers wedged in between the first and last layers is where the calibration and filtering process happen, and these layers are often called the hidden layers. The activation of various neurons in these hidden layers will determine the final prediction in the last layer of neurons.
The example of image recognition for handwritten text is fairly complex, since the data being fed into the system is not in a neat numerical format – yet this is where neural networks have an advantage over other machine learning processes. In the case of recognising a number, for example, the image would typically be inputted in the form of a grid, where each block of the grid would represent a pixel. In a 28 by 28 grid, there would then be 784 blocks, and each block would be represented by a neuron in the first layer of the network. The first layer would then be 784 neurons long, each capturing the grayscale value of their corresponding pixel, often as a value from zero to one, where zero is pure white and one is pure black, for example.
Now that the system has converted an image into numerical data, it can begin the process of trying to recognise which number is being depicted in the image. In different implementations of neural networks, this step will vary greatly, but in this case, the hidden layers within the net will usually try to identify various shapes in the image by analysing the hard edges of the picture. By analysing the grid in a way that distinguishes between the black markings and white spaces, various regions of the grid can be given scores that may correspond to a specific shape – a curve or a straight line, for example. Across the several hidden layers of neurons, the various shapes recognised will trigger different combinations of neurons, eventually signalling to the last layer of neurons which number it is most likely to be. These weights determine to what extent a given input is relevant to a certain neuron. Since each neuron receives multiple inputs, the weights serve as the filter for these inputs, to let the neuron know what factors should be regarded as most important – much in the same way that dendrites in the biological neural network filter the multitude of stimuli attempting to make their way to the processing centre contained in the cell body.
Thus far, however, the actual analysis and learning process has not yet begun, since the manner in which the system decides on a score for each grid, or any other input for that matter, is based on equivalently or randomly weighting each neuron input at each level in the deep net. The system will not be successful in recognising handwritten numbers unless it optimises its recognition capability by re-weighting each of the neuron input weights throughout the network. In order to do so, a process of reverse engineering takes place on a continuous basis in an advanced form of trial and error. This process involves determining mathematical parameters for each input, based on the success of predicting the output in a particular run. The specific mathematical calculations of these parameters involve relatively simple calculus techniques, specifically the calculation of partial derivatives for each input. Through a process called backpropagation, the system is able to repeatedly re-weight the inputs into each and every neuron at each level in the network (in a backward fashion from the last to the first layer), in order to achieve what is now commonly known as deep learning.
Integral to this process of refining the weightings and biases between neurons to improve their predictions, is the activation function. Think of an activation function as being at the heart of what the neuron does to transform the inputs it receives into an output. The signals that the neuron receives are first converted into a single value that is the weighted sum of all the inputs received from neurons in the previous layer, plus the addition of a bias factor. This number is then essentially ready to be processed by the activation function that sits at the heart of the neuron. The function itself could be simplistically linear in nature, a hyperbolic function, a threshold function, or most commonly, a sigmoid function. What is important is that it is a function that converts a linear weighted sum value into a new value, which then becomes an input for the next layer of neurons. The input to the next neuron is itself then taken through this process again, until finally the neural network’s last layer produces a single output value, which will then be compared with a result.
When first training a network, the weightings and biases that are meant to be able to recognise important information from less important information are set at random. And naturally, because these weightings are random at first, the network will initially be very bad at predicting correct outcomes. To improve these predictions, the network must be trained through backpropagation, often using well-heeled mathematical optimisation techniques, such as “gradient descent”, which make use of a cost function to evaluate the outcomes of the network and to steer it in the right direction as it refines its weightings. This cost function, in simple terms, determines how far off the network is with its predictions.
Let us think back to the example of the image recognition network for handwritten numbers. Initially, when using random weightings, the network may light up or activate totally incorrect neurons in the last layer which is meant to represent a number from zero to nine. With random weightings, when fed the handwritten number “3”, the network may at first light up the corresponding neurons for “8”, “6”, “5” and “3”, for example. To train a network using supervised learning, as is the case here, the cost function will penalise the incorrect outputs using training data that is labelled with the correct output.
To improve the network’s prediction accuracy, this cost function must be minimised. This is where the “gradient descent” methodology is actioned. The best way to visualise this method is to imagine standing in a valley (in the shape of a “U”). Your goal is to find the lowest point of the valley, representing the local minimum of the cost function. To do this, you must calculate the slope of your current position, in order to determine in which direction you must travel to find the bottom of the valley. Using your random input (representing your random or unknown location on the hill of the valley), you can calculate the slope of your current position on the cost function. If this slope is negative, then you know you are on the left hill of the valley and need to step to the right to get closer to the bottom. Conversely, if the slope is positive, you know that you are on the right hill of the valley and need to step to the left to reach the local minimum. Depending on the steepness of the slope, you know how close you are to reaching the bottom, as the slope flattens out near the bottom.
It is this autonomous iterative process that makes modern day neural networks so powerful. It is worth noting, however, that these mathematical techniques, whilst not in and of themselves particularly complex, could not until recently be performed effectively to the extent that real progress was made in simulating intelligence. This is mainly owing to two important factors. Firstly, the enormous data sets required to effectively train these systems did not exist before the explosion of the internet and social media, and secondly, the processing capability required to perform the many, many rounds of backpropagation required across these enormous data sets was not yet available to AI researchers. Thus, it was only when these two requirements were met that artificial intelligence was kick-started to the point where it could have a significant impact on society. This is especially true of more complex, unsupervised neural networks that do not make use of user-defined training sets as a guide, but rather rely on large volumes of data to refine their own training sets and outcomes through continuous refinement of predictions without the guidance of an external source. The benefits of these more data-intensive unsupervised models is that the neural net can identify previously unrecognised paths or tactics to a desired outcome far better than a human, and can be used for more than one strictly-designed task because of their open and more generally applicable methodologies.
Even though the earliest roots of artificial intelligence can be found as far back as in the 1930s with Alan Turing’s various research papers concerning the ideas and proofs for an intelligent machine, it would not be until the late 1990s and early 2000s that the mainstream media and big industry players would take AI seriously. Whilst there existed some useful applications in the technology sector before this time, especially in image and voice recognition, it was very much behind the scenes and out of the eye of the public. But despite the less-than- enthusiastic attitude of big corporations and government towards funding artificial intelligence projects, especially given the underwhelming results it had provided in the twentieth century, there were always a few isolated believers in AI who truly understood the potential of deep learning.
One such important group of researchers, who would struggle through the fallow times in artificial intelligence research and whose steady belief in these methodologies would ultimately be justified, is known to the AI community as the Canadian Mafia. This tightly-knit group of artificial intelligence evangelists – including such luminaries as Geoffrey Hinton, Yoshua Bengio and Yann LeCun – are today considered to be the rockstars of the AI field. They, for example, were the researchers that would make great strides in developing the critically important backpropagation method that would significantly advance the learning capabilities of neural networks.
Geoffrey Hinton, as an example – an English- Canadian cognitive psychologist and computer scientist – is regarded by many as the godfather of neural networks. And whilst his research has today been recognised as fundamental to the success of AI in recent times, this was not always the case. For many years, Hinton and those who studied under him, including LeCun and Bengio, were considered academics in a dying field of study. As the funding freeze of the AI Winter set in and all other researchers set their sights on what were considered more promising areas of speciality, Hinton’s group pressed on regardless with their research into mathematical methodologies to improve neural networks.
Their continued research, however, was ultimately to pay off when, in 1997, a massive turning point came for AI, especially in the mind of the public. This was the year that IBM’s Deep Blue beat the world chess champion, Garry Kasparov, in a televised event that captured the imaginations of millions of onlookers worldwide. It was the first time that many people realised the potential of machines to mimic intelligence and this sparked mass interest in the field of artificial intelligence. The Deep Blue event, which attracted more than 70 million viewers, was also indicative of the progress hardware had made in significantly shortening processing times, allowing for speeds never seen before, with IBM’s chess- playing machine being able to run through a reported 200 million moves per second.
This excitement, combined with the progress made in raw computing power, led to many significant milestones for artificial intelligence in the coming years. Such successes included the driving of a completely autonomous vehicle for 131 miles – on a route previously unknown to the vehicle – in the DARPA challenge, won by a team from Stanford University in 2005. Then in 2011, in another highly publicised event, IBM’s Watson beat a team of two champions at the quiz show game of Jeopardy! by a significant margin, demonstrating that artificial intelligence had moved beyond simple brute force number-crunching and could now process written language within complex contexts. And possibly most impressively to date, an unsupervised neural network beat the world champion Go player, Lee Sedol, in an ancient strategy game that has so many possible combinations that it is impossible for a computer to run through every possible board position, as in chess, but has to learn intuitively how to improve its own gameplay in an autonomous fashion.
Importantly, these are just some of the most publicly exposed examples of artificial intelligence. Many of the more practical and industry-important applications go very much unnoticed by the common user, as the AI is often hidden in technologies and software that we use every day, such as our laptops and smartphones, as well as within the apps and social media platforms that consume so much of our attention. We simply need to think of recent developments in facial recognition that allow us to unlock our phones, or that recognise and tag our friends in our uploaded pictures, and it becomes apparent that we unwittingly use deeply complex artificial intelligence technologies almost every day. In fact, avoiding interaction with artificial intelligence has become near-impossible in today’s connected world, especially since the omnipotence of so-called Big Tech. These behemoth firms, namely Facebook, Apple, Google and Amazon, have so utterly pervaded our daily lives that trust has become a default setting for the users of their services. And in allowing them free access to our lives, we provide them with one of the key components to success in the development of even more powerful and intrusive AI capabilities – our data.
Whilst the artificial intelligence revolution promises to change the world in many positive ways, there is the risk that this exciting field may be wholly absorbed by Big Tech and subsequently used in any way that a handful of powerful executives wish. Unfortunately, as recent history has shown, these mega-corporations’ agendas are not aligned to the best interests of their users, but rather to the maximisation of profits at all costs. What is even more disturbing is that the individuals who once held the torch as AI purists – academics who had always looked at the bigger picture and wanted to use artificial intelligence to solve real pressing problems in the world – have now been lured into the research labs of Big Tech. This includes even the die-hard Canadian researchers who brought AI from the backrooms of academia to the forefront of modern technology, with Hinton working for Google and LeCun for Facebook. And whilst the third and youngest member of the Canadian Mafia, Yoshua Bengio, has managed to resist the extravagant salaries given to AI experts by Big Tech, it does seem as if he is fighting against the tide. One can only hope that such an important field of research will not continue to be overly dominated by a few large corporations, bringing to mind a Terminator-esque future controlled by the likes of an all-powerful Skynet. And in this sense, the story of artificial intelligence has just begun.
This article is from the Monocle Quarterly Journal, Deep Learning. Visit our "Journals" section to read the full issue.
3 Lower Road,
301 New Cumberland,
163 Beach Road,
1 Royal Exchange,
Weteringschans 165 C,
Phone:+27 11 263 5800
Fax:+27 11 263 5811
Phone:+27 (0) 82 952 1415
Phone:+44 (0) 2071 902 990