What deep neural networks look like and why they require so much memory. How neural networks work

And in parts, this guide is intended for anyone who is interested in machine learning but doesn't know where to start. The content of the articles is intended for a wide audience and will be quite superficial. But does anyone really care? The more people who become interested in machine learning, the better.

Object recognition using deep learning

You may have already seen this famous xkcd comic. The joke is that any 3-year-old can recognize a photo of a bird, but getting a computer to do it took the best computer scientists over 50 years. In the last few years, we've finally found a good approach to object recognition using deep convolutional neural networks. This sounds like a bunch of made-up words from a William Gibson science fiction novel, but it will make sense once we take them one by one. So let's do it - write a program that recognizes birds!

Let's start simple

Before we learn how to recognize pictures of birds, let's learn how to recognize something much simpler - the handwritten number "8".

There is a lot of talk and writing about artificial neural networks today – both in the context of big data and machine learning, and outside it. In this article, we will recall the meaning of this concept, once again outline the scope of its application, and also talk about an important approach that is associated with neural networks - deep learning, we will describe its concept, as well as the advantages and disadvantages in specific use cases.

What is a neural network?

As you know, the concept of a neural network (NN) comes from biology and is a somewhat simplified model of the structure of the human brain. But let’s not delve into the wilds of natural science - the easiest way is to imagine a neuron (including an artificial one) as a kind of black box with many input holes and one output.

Mathematically, the artificial neuron performs the vector transformation input signals(impacts) X into the output vector Y using a function called the activation function. Within the connection (artificial neural network - ANN), three types of neurons function: input (receiving information from outside world- values ​​of the variables we are interested in), output (returning the desired variables - for example, predictions, or control signals), as well as intermediate - neurons that perform certain internal (“hidden”) functions. A classical ANN thus consists of three or more layers of neurons, and in the second and subsequent layers (“hidden” and output), each element is connected to all elements of the previous layer.

It is important to remember the concept feedback, which determines the type of ANN structure: direct signal transmission (signals go sequentially from the input layer through the hidden layer and enter the output layer) and recurrent structure, when the network contains connections going back, from more distant to nearer neurons). All these concepts constitute the necessary minimum information for the transition to next level understanding ANN - training a neural network, classifying its methods and understanding the principles of operation of each of them.

Neural network training

We should not forget why such categories are used in general - otherwise there is a risk of getting bogged down in abstract mathematics. In fact, artificial neural networks are understood as a class of methods for solving certain practical problems, among which the main ones are the problems of pattern recognition, decision making, approximation and data compression, as well as the most interesting for us problems of cluster analysis and forecasting.

Without going to the other extreme and without going into details of the operation of ANN methods in each specific case, let us remind ourselves that under any circumstances it is the ability of a neural network to learn (with a teacher or “on its own”) that is key point using it to solve practical problems.

IN general case, ANN training is as follows:

  1. input neurons receive variables (“stimuli”) from the external environment;
  2. in accordance with the information received, the free parameters of the neural network change (intermediate layers of neurons work);
  3. As a result of changes in the structure of the neural network, the network “reacts” to information in a different way.

This is the general algorithm for training a neural network (let’s remember Pavlov’s dog - yes, that’s exactly the internal mechanism for the formation of a conditioned reflex - and let’s immediately forget: after all, our context involves operating with technical concepts and examples).

It is clear that a universal learning algorithm does not exist and, most likely, cannot exist; Conceptually, approaches to learning are divided into supervised learning and unsupervised learning. The first algorithm assumes that for each input (“learning”) vector there is a required value of the output (“target”) vector - thus, these two values ​​form a training pair, and the entire set of such pairs is the training set. In the case of unsupervised learning, the training set consists only of input vectors - and this situation is more plausible from the point of view of real life.

Deep learning

The concept of deep learning refers to a different classification and denotes an approach to training so-called deep structures, which include multi-level neural networks. A simple example from the field of image recognition: it is necessary to teach a machine to identify increasingly abstract features in terms of other abstract features, that is, to determine the relationship between the expression of the entire face, eyes and mouth, and, ultimately, clusters of colored pixels mathematically. Thus, in a deep neural network, each level of features has its own layer; it is clear that to train such a “colossus”, the appropriate experience of researchers and the level of hardware. Conditions developed in favor of deep learning of neural networks only by 2006 - and eight years later we can talk about the revolution that this approach produced in machine learning.

So, first of all, in the context of our article, it is worth noting the following: deep learning in most cases is not supervised by a person. That is, this approach involves training a neural network without a teacher. This is the main advantage of the “deep” approach: supervised machine learning, especially in the case of deep structures, requires enormous time – and labor – costs. Deep learning, on the other hand, is an approach that models (or at least attempts to approximate) human abstract thinking, rather than using it.

The idea, as usual, is wonderful, but quite natural problems arise in the way of the approach - first of all, rooted in its claims to universality. In fact, while deep learning approaches have achieved significant success in the field of image recognition, the same natural language processing still raises many more questions than answers. It is obvious that in the next n years it is unlikely that it will be possible to create an “artificial Leonardo Da Vinci” or even - at least! - “artificial homo sapiens”.

However, before researchers artificial intelligence the question of ethics is already arising: the fears expressed in every self-respecting science fiction film, starting with “Terminator” and ending with “Transformers”, no longer seem funny (modern sophisticated neural networks can already be considered a plausible model of how the insect brain works!), but for now clearly unnecessary.

The ideal technological future appears to us as an era when a person will be able to delegate most of his powers to a machine - or at least be able to allow it to facilitate a significant part of his intellectual work. The concept of deep learning is one step towards this dream. The road ahead is long, but it is already clear that neural networks and all the developing approaches associated with them are capable of realizing the aspirations of science fiction writers over time.

I learned about business trends at a large-scale conference in Kiev. Saturday was filled with insights, in which we gained new knowledge and acquaintances, gained over the course of the hours spent. At the conference there were 4 streams of information for business leaders, top managers, marketers, sales, HR and other specialists. One of the speakers was the Minister of Infrastructure Volodymyr Omelyan, who spoke about the development of Galuzia, the renovation of roads and airports.

Good day to all, dear fellow iOS users, probably each of you has worked with the network and parsed data from JSON. For this process there are a bunch of libraries, all kinds of tools that you can use. Some of them are complex and some are simple. To be honest, I myself parsed JSON by hand for a very long time, not trusting this process to some third-party libraries, and this had its advantages.

On September 9, 2014, during the next presentation, Apple introduced its own mobile payment system - Apple Pay.

By using payment system Apple Pay iPhone 6 and iPhone 6+ users and owners latest versions Apple Watch can shop online, use additional benefits apple pay for mobile applications and make payments using NFC technology(Near Field Communication). Used to authorize payments Touch technology ID or Face ID.

Technologies do not stand still, and development processes move with them. If previously companies worked according to the “Waterfall” model, now, for example, everyone is striving to implement “Scrum”. Evolution is also taking place in the provision of software development services. Previously, companies provided clients with high-quality development within the budget, stopping there, but now they strive to provide maximum benefit to the client and his business by providing their expertise.

Over the past few years, so many good fonts have appeared, including free ones, that we decided to write a continuation of ours for designers.

Every designer has a set of favorite fonts to work with that they are used to working with and that reflect them. graphic style. Designers say " Good fonts there can never be too many,” but now we can safely imagine a situation where this set consists only of free fonts.

How often do project managers find themselves between a rock and a hard place when trying to find a balance between all the requirements and deadlines of the customer and the mental health of the entire team? How many nuances need to be taken into account so that there is peace and order on both sides of responsibility? How do you know if you are a good manager or if you urgently need to improve on all fronts? How to determine in which aspects exactly you, as a PM, are lagging behind, and where you are good and smart? This is exactly what the next Code’n’Coffee conference was about.

Pattern recognition technology is increasingly becoming part of our everyday life. Companies and institutions use it for a variety of tasks, from security to customer satisfaction surveys. Investments in products based on this function, promise to grow to $39 billion by 2021. Here are just a few examples of how pattern recognition is used in different fields.

"(Manning Publications).

The article is intended for people who already have significant experience with deep learning (for example, those who have already read chapters 1-8 of this book). Assumes availability large quantities knowledge.

Deep Learning: Geometric View

The most amazing thing about deep learning is how simple it is. Ten years ago, no one could have imagined the amazing results we would achieve in machine perception problems using simple parametric models trained with gradient descent. Now it turns out that all we need is big enough parametric models trained on big enough number of samples. As Feynman once said about the Universe: “ It's not complicated, there's just a lot of it».

In deep learning, everything is a vector, that is, dot V geometric space. The input data of the model (this can be text, images, etc.) and its targets are first “vectorized”, that is, translated into some initial vector space as an input and a target vector space as an output. Each layer in the model deep learning performs one simple geometric transformation on the data that passes through it. Together, the chain of model layers creates one very complex geometric transformation, broken down into a number of simple ones. This complex transformation attempts to transform the input data space into the target space, for each point. The transformation parameters are determined by the layer weights, which are constantly updated based on how well the model is performing at the moment. Key Feature geometric transformation - what it should be differentiable, that is, we should be able to find out its parameters through gradient descent. Intuitively, this means that geometric morphing must be smooth and continuous—an important constraint.

The entire process of applying this complex geometric transformation to the input data can be visualized in 3D by depicting a person trying to unwrap a paper ball: the crumpled paper ball is the variety of input data that the model begins to work with. Each movement of a person with a paper ball is like a simple geometric transformation performed by a single layer. The complete sequence of unfolding gestures is a complex transformation of the entire model. Deep learning models are mathematical engines for unraveling the intricate variety of multidimensional data.

That's the magic of deep learning: turning value into vectors, into geometric spaces, and then gradually learning complex geometric transformations that transform one space into another. All that is needed is a space of sufficiently large dimension to convey the full range of relationships found in the original data.

Limitations of Deep Learning

The range of problems that can be solved using this simple strategy is almost endless. And yet, many of them are still beyond the reach of current deep learning techniques - even despite the availability of huge amounts of manually annotated data. Let's say, for example, that you can collect a data set of hundreds of thousands - even millions - of English language descriptions of software features written by product managers, as well as the corresponding reference year developed by teams of engineers to meet those requirements. Even with this data, you can't train a deep learning model to simply read a product description and generate the corresponding codebase. This is just one of many examples. In general, anything that requires reasoning - like programming or applying the scientific method, long-term planning, algorithmic-style data manipulation - is beyond the capabilities of deep learning models, no matter how much data you throw at them. Even training a neural network to perform a sorting algorithm is an incredibly difficult task.

The reason is that the deep learning model is "only" a chain of simple, continuous geometric transformations, which transform one vector space into another. All it can do is transform one set of data X into another set Y, provided that there is a possible continuous transformation from X to Y that can be learned, and the availability dense set of samples X:Y transformations as training data. So while a deep learning model can be considered a type of program, most programs cannot be expressed as deep learning models- for most problems, either there is no deep neural network of practically suitable size that solves the problem, or if there is one, it may be unteachable, that is, the corresponding geometric transformation may be too complex, or there is no suitable data to train it.

Scaling up existing deep learning techniques—adding more layers and using more training data—can only superficially mitigate some of these problems. It will not solve the more fundamental problem that deep learning models are very limited in what they can represent, and that most programs cannot be expressed as a continuous geometric morphing of data manifolds.

The Risk of Anthropomorphizing Machine Learning Models

One of the very real risks of modern AI is misinterpreting how deep learning models work and exaggerating their capabilities. A fundamental feature of the human mind is the “model of the human psyche,” our tendency to project goals, beliefs and knowledge onto things around us. A drawing of a smiling face on a stone suddenly makes us “happy” - mentally. When applied to deep learning, this means, for example, that if we can more or less successfully train a model to generate text descriptions pictures, then we tend to think that the model “understands” the content of the images, as well as the generated descriptions. We are then greatly surprised when, due to a small deviation from the set of images presented in the training data, the model begins to generate absolutely absurd descriptions.

In particular, this is most evident in “adversarial examples,” which are samples of deep learning network input data that are specifically selected to be misclassified. You already know that you can do gradient ascent on the input data space to generate samples that maximize the activation of, for example, a particular convolutional neural network filter - this is the basis of the visualization technique we covered in Chapter 5 (note: books "Deep Learning with Python") , just like the Deep Dream algorithm from Chapter 8. In a similar way, through gradient ascent, you can slightly change the image to maximize class prediction for a given class. If we take a photo of a panda and add a "gibbon" gradient, we can force the neural network to classify that panda as a gibbon. This demonstrates both the fragility of these models and the profound difference between the input-to-output transformation it guides and our own human perceptions.

In general, deep learning models have no understanding of the input data, at least not in the human sense. Our own understanding of images, sounds, language, is based on our sensorimotor experience as people - as material earthly beings. Machine learning models do not have access to such experience and therefore cannot “understand” our input data in any human-like way. By annotating a large number of examples for our models to train, we force them to learn a geometric transformation that reduces the data to human concepts for that specific set of examples, but this transformation is only a simplified sketch of our mind's original model, as developed from our experience as bodily agents are like a faint reflection in a mirror.

As a machine learning practitioner, always keep this in mind, and never fall into the trap of believing that neural networks understand the task they are performing - they don't, at least not in a way that makes sense to us. They have been trained on a different, much more specific task than the one we want to train them for: simply transforming input learning patterns into target learning patterns, point to point. Show them anything that's different from the training data and they'll break in the most absurd ways.

Local generalization versus extreme generalization

There seem to be fundamental differences between the direct geometric morphing from input to output that deep learning models do and the way humans think and learn. It is not just that people learn themselves from their bodily experiences, and not through processing a set of training samples. In addition to differences in learning processes, there are fundamental differences in the nature of the underlying beliefs.

Humans are capable of much more than translating an immediate stimulus into an immediate response, like a neural network or perhaps an insect. People hold complex, abstract patterns in their minds current situation, themselves, other people, and can use these models to predict various possible options future, and carry out long-term planning. They are capable of combining known concepts to imagine something they have never known before - like drawing a horse in jeans, for example, or picturing what they would do if they won the lottery. The ability to think hypothetically, to expand our model of mental space far beyond what we have directly experienced, that is, the ability to do abstractions And reasoning, perhaps the defining characteristic of human cognition. I call this “ultimate generalization”: the ability to adapt to new, never-before-experienced situations using little or no data.

This is in stark contrast to what deep learning networks do, which I would call "local generalization": transforming input data into output data quickly ceases to make sense if the new input data is even slightly different from what it encountered during training . Consider, for example, the learning problem suitable parameters launching a rocket that should land on the moon. If you were to use a neural network for this task, supervised or reinforcement trained, you would need to give it thousands or millions of flight trajectories, that is, you would need to produce dense set of examples in the input value space to learn how to reliably transform from the input value space to the output value space. In contrast, humans can use the power of abstraction to create physical models- rocket science - and derive an exact solution that will take the rocket to the Moon in just a few attempts. In the same way, if you developed a neural network to control the human body and want it to learn how to walk safely through a city without being hit by a car, the network would have to die many thousands of times per different situation x before concluding that cars are dangerous and developing appropriate behavior to avoid them. If it were moved to a new city, the network would have to relearn most of what it knew. On the other hand, people are able to learn safe behavior without ever dying - again, thanks to the power of abstract simulation of hypothetical situations.

So, despite our progress in machine perception, we are still very far from human-level AI: our models can only perform local generalization, adapting to new situations that must be very close to past data, while the human mind is capable of extreme generalization, quickly adapting to completely new situations or planning far into the future.

conclusions

Here's what you need to remember: the only real success of deep learning so far is the ability to translate X space to Y space using a continuous geometric transformation, given a large amount of human annotated data. Doing this well represents a revolutionary advance for an entire industry, but human-level AI is still a long way off.

To remove some of these limitations and begin to compete with the human brain, we need to move away from direct input-to-output conversion and move to reasoning And abstractions. Computer programs may be a suitable basis for abstractly modeling various situations and concepts. We've said before (note: in Deep Learning with Python) that machine learning models can be defined as "programs that learn"; at the moment we can only train a narrow and specific subset of all possible programs. But what if we could train each program, modularly and iteratively? Let's see how we can get there.

The Future of Deep Learning

Given what we know about deep learning networks, their limitations, and the current state of research, can we predict what will happen in the medium term? Here are some of my personal thoughts on the matter. Keep in mind that I don't have a crystal ball for predictions, so much of what I expect may not come to fruition. This is complete speculation. I share these predictions not because I expect them to be fully realized in the future, but because they are interesting and applicable to the present.

At a high level, here are the main areas that I consider promising:

  • Models will approach general purpose computer programs built on top of much richer primitives than our current differentiable layers - so we will get reasoning And abstractions, the absence of which is a fundamental weakness of current models.
  • New forms of learning will emerge that will make this possible - and allow models to move away from simply differentiable transformations.
  • Models will require less developer input - it shouldn't be your job to constantly twist knobs.
  • There will be greater, systematic reuse of learned features and architectures; meta-learning systems based on reusable and modular routines.
Additionally, note that these considerations do not apply specifically to supervised learning, which is still the basis of machine learning—they also apply to any form of machine learning, including unsupervised learning, supervised learning, and reinforcement learning. It doesn't fundamentally matter where your labels come from or what your learning cycle looks like; these different branches of machine learning are simply different facets of the same construct.

So, go ahead.

Models as programs

As we noted earlier, a necessary transformational development that can be expected in the field of machine learning is a move away from models that perform purely pattern recognition and capable only of local generalization, to models capable of abstractions And reasoning that can reach ultimate generalization. All current AI programs with basic reasoning are hard-coded by human programmers: for example, programs that rely on search algorithms, graph manipulation, formal logic. In DeepMind's AlphaGo program, for example, much of the on-screen "intelligence" is designed and hard-coded by expert programmers (for example, Monte Carlo tree search); Learning from new data occurs only in specialized submodules - value networks and policy networks. But in the future, such AI systems could be trained entirely without human intervention.

How to achieve this? Let's take a well-known type of network: RNN. Importantly, RNNs have slightly fewer limitations than feedforward neural networks. This is because RNNs are little more than simple geometric transformations: they are geometric transformations that carried out continuously in a for loop. The timing of the for loop is specified by the developer: it is a built-in assumption of the network. Naturally, RNNs are still limited in what they can represent, mainly because each step they take is still a differentiable geometric transformation and because of the way they convey information step by step through points in a continuous geometric space ( state vectors). Now imagine neural networks that would be “expanded” with programming primitives in the same way as for loops- but not just a single hard-coded for loop with stitched geometric memory, but large set programming primitives that the model could freely access to expand its processing capabilities, such as if branches, while statements, variable creation, disk storage for long-term memory, sorting operators, advanced data structures like lists, graphs, hash tables, and much more . The space of programs that such a network can represent will be much wider than can be expressed existing networks deep learning, and some of these programs can achieve superior generalization power.

In short, we will move away from the fact that we have, on the one hand, “hard-coded algorithmic intelligence” (hand-written software), and on the other hand, “trained geometric intelligence” (deep learning). Instead, we will end up with a mixture of formal algorithmic modules that provide capabilities reasoning And abstractions, and geometric modules that provide capabilities informal intuition and pattern recognition. The entire system will be trained with little or no human intervention.

A related area of ​​AI that I think could soon make big strides is software synthesis, in particular, neural software synthesis. Program synthesis consists of automatically generating simple programs using search algorithm(perhaps a genetic search, as in genetic programming) to study large space possible programs. The search stops when a program is found that meets the required specifications, often provided as a set of input-output pairs. As you can see, this is very similar to machine learning: “training data” is provided as input-output pairs, we find a “program” that corresponds to the transformation of inputs to outputs and is capable of generalizations to new inputs. The difference is that instead of training parameter values ​​in a hard-coded program (neural network), we generate source through a discrete search process.

I definitely expect there will be a lot of interest in this area again in the next few years. In particular, I expect mutual penetration of the related areas of deep learning and software synthesis, where we will not just generate programs in general-purpose languages, but where we will generate neural networks (geometric data processing threads), supplemented a rich set of algorithmic primitives, such as for loops - and many others. This should be much more convenient and useful than direct source code generation, and will significantly expand the scope of the problems that can be solved using machine learning - the space of programs that we can generate automatically, given the appropriate training data. A mixture of symbolic AI and geometric AI. Modern RNNs can be considered as the historical ancestor of such hybrid algorithmic-geometric models.


Drawing: The trained program simultaneously relies on geometric primitives (pattern recognition, intuition) and algorithmic primitives (argumentation, search, memory).

Beyond backpropagation and differentiable layers

If machine learning models become more like programs, then they will hardly be differentiable anymore—certainly those programs will still use continuous geometric layers as subroutines that will remain differentiable, but the overall model will not be. As a result, using backpropagation to adjust weight values ​​in a fixed, hard-coded network may not remain the preferred method for training models in the future—at least, it should not be limited to this method alone. We need to figure out how to train non-differentiable systems most efficiently. Current approaches include genetic algorithms, "evolutionary strategies", certain reinforcement learning methods, ADMM (alternating direction method of Lagrange multipliers). Naturally, gradient descent is here to stay - gradient information will always be useful for optimizing differentiable parametric functions. But our models will definitely become more ambitious than just differentiable parametric functions, and so their automated development (“training” in “machine learning”) will require more than backpropagation.

Additionally, backpropagation has an end-to-end framework, which is suitable for learning good concatenated transformations, but is quite computationally inefficient because it does not fully exploit modularity deep networks. To increase the efficiency of anything, there is one universal recipe: introduce modularity and hierarchy. So we can make backpropagation itself more efficient by introducing decoupled learning modules with some synchronization mechanism between them, organized in a hierarchical manner. This strategy is partly reflected in DeepMind's recent work on "synthetic gradients." I expect much, much more work in this direction in the near future.

One can imagine a future where globally non-differentiable models (but with differentiable parts) will learn - grow - using an efficient search process that does not apply gradients, while differentiable parts will learn even faster using gradients using some more effective version backpropagation

Automated Machine Learning

In the future of architecture, models will be created by learning, rather than written by hand by engineers. The resulting models are automatically paired with a richer set of primitives and program-like machine learning models.

Nowadays, most of the time, a deep learning system developer endlessly modifies data with Python scripts, then spends a long time tuning the architecture and hyperparameters of the deep learning network to get a working model - or even to get an outstanding model if the developer is so ambitious. Needless to say, this is not the best better position of things. But AI can help here too. Unfortunately, the data processing and preparation part is difficult to automate because it often requires domain knowledge as well as a clear, high-level understanding of what the developer wants to achieve. However, hyperparameter tuning is a simple search procedure, and in in this case we already know what the developer wants to achieve: this is determined by the loss function of the neural network that needs to be configured. It has now become common practice to install basic systems AutoML, which take care of most of the tweaking of the model settings. I installed one myself to win the Kaggle competition.

At the most basic level, such a system would simply adjust the number of layers in the stack, their order, and the number of elements or filters in each layer. This is usually done using libraries like Hyperopt, which we discussed in Chapter 7 (note: books "Deep Learning with Python"). But you can go much further and try to learn the appropriate architecture from scratch, with minimum set restrictions. This is possible using reinforcement learning, for example, or using genetic algorithms.

Another important direction in the development of AutoML is the training of model architecture simultaneously with model weights. By training a model from scratch each time we try slightly different architectures, which is extremely inefficient, so really powerful system AutoML will drive the evolution of architectures while model properties are tuned through backpropagation to the training data, thus eliminating all computational overhead. As I write these lines, similar approaches have already begun to be applied.

When all this starts to happen, machine learning system developers will not be left without work - they will move to a higher level in the value chain. They will begin to put much more effort into creating complex functions losses that truly reflect business challenges, and will have a deep understanding of how their models impact the digital ecosystems in which they operate (for example, customers who benefit from the model's predictions and generate data to train it) - issues that are now Only the largest companies can afford to consider.

Lifelong learning and reuse of modular routines

If models become more complex and built on richer algorithmic primitives, then this increased complexity will require more intensive reuse between tasks, rather than training a model from scratch every time we have a new task or new data set. Eventually, many datasets do not contain enough information to develop a new complex model from scratch and it will become necessary to use information from previous datasets. You don't relearn English every time you open it. new book- it would be impossible. In addition, training models from scratch on each new problem is very inefficient due to the significant overlap between the current problems and those encountered before.

In addition, a remarkable observation has been made repeatedly in recent years that training the same model to do is somewhat weak related tasks improves its results in each of these tasks. For example, training the same neural network to translate from English to German and from French to Italian will result in a model that is better in each of these language pairs. Training an image classification model simultaneously with an image segmentation model, with a single convolutional base, will result in a model that is better at both tasks. And so on. This is quite intuitive: there is always some kind information that overlaps between these two seemingly different tasks, and therefore general model has access to more information about each individual task than a model that was trained only on that specific task.

What we actually do when we reuse a model on different tasks is use pre-trained weights for models that perform common functions, like visual feature extraction. You saw this in practice in Chapter 5. I expect that a more general version of this technique will be commonly used in the future: we will not only use previously learned features (submodel weights), but also model architectures and training procedures. As models become more program-like, we will begin to reuse subroutines, like functions and classes in regular programming languages.

Think about what the software development process looks like today: once an engineer solves a certain problem (HTTP requests in Python, for example), he packages it up as an abstract library for reuse. Engineers who encounter a similar problem in the future simply look for existing libraries, download them, and use them in their own projects. Likewise, in the future, meta-learning systems will be able to assemble new programs by sifting through a global library of high-level reusable blocks. If the system starts developing similar routines for several different tasks, it will release an "abstract" reusable version of the routine and store it in a global library. This process will open up the opportunity for abstractions, required component to achieve "ultimate generalization": a routine that will be useful for many tasks and domains can be said to "abstract" some aspect of decision making. This definition of "abstraction" does not seem to be the concept of abstraction in software development. These routines can be either geometric (deep learning modules with pre-trained representations) or algorithmic (closer to the libraries that modern programmers work with).

Drawing: A meta-learning system that can quickly develop task-specific models using reusable primitives (algorithmic and geometric), thereby achieving “ultimate generalization.”

The bottom line: a long-term vision

In short, here is my long-term vision for machine learning:
  • Models will become more like programs and will have capabilities that extend far beyond the continuous geometric transformations of source data that we work with now. Perhaps these programs will be much closer to the abstract mental models that people hold about their environment and themselves, and they will be capable of stronger generalization due to their algorithmic nature.
  • In particular, the models will mix algorithmic modules with formal reasoning, search, abstraction abilities - and geometric modules with informal intuition and pattern recognition. AlphaGo (a system that required intensive manual programming and architecture) provides an early example of what the merging of symbolic and geometric AI might look like.
  • They will grow automatically (rather than being written by hand by human programmers), using modular parts from a global library of reusable routines - a library that has evolved by assimilation of high-performance models from thousands of previous problems and data sets. Once the metalearning system has identified common problem-solving patterns, they are converted into reusable routines—much like functions and classes in modern programming—and added to a global library. This is how the ability is achieved abstractions.
  • The global library and associated model growing system will be capable of achieving some form of humanoid "ultimate generalization": faced with new task, a new situation, the system will be able to assemble a new working model for this problem using very little data, thanks to: 1) rich program-like primitives that generalize well and 2) extensive experience with similar problems. In the same way that people can quickly learn a new complex video game because they have previous experience with many other games and because the models from previous experience are abstract and program-like rather than simply converting stimulus into action.
  • Essentially, this continuously learning model-growing system can be interpreted as Strong Artificial Intelligence. But don't expect the onset of some kind of singular robot-apocalypse: it is pure fantasy, which was born from big list deep misunderstandings in understanding intelligence and technology. However, this criticism has no place here.

Today, a graph is one of the most acceptable ways to describe models created in a machine learning system. These computational graphs are composed of neuron vertices connected by synapse edges that describe the connections between the vertices.

Unlike a scalar central or vector graphics processor, an IPU, a new type of processor designed for machine learning, allows you to build such graphs. A computer designed to manipulate graphs is an ideal machine for computing graph models created through machine learning.

One of the easiest ways to describe the process of machine intelligence is to visualize it. The Graphcore development team has created a collection of such images that are displayed on the IPU. It is based on Poplar software, which visualizes the work of artificial intelligence. Researchers from this company also found out why deep networks require so much memory, and what solutions exist to solve the problem.

Poplar includes a graphics compiler that was built from the ground up to translate standard machine learning operations into highly optimized IPU application code. It allows you to collect these graphs together using the same principle as POPNNs are collected. The library contains a set various types vertices for generalized primitives.

Graphs are the paradigm on which all software is based. In Poplar, graphs allow you to define a computation process, where vertices perform operations and edges describe the relationship between them. For example, if you want to add two numbers together, you can define a vertex with two inputs (the numbers you would like to add), some calculations (a function to add two numbers), and an output (the result).

Typically, operations with vertices are much more complex than in the example described above. They are often determined small programs, called codelets (code names). Graphical abstraction is attractive because it makes no assumptions about the structure of the computation and breaks the computation down into components that the IPU can use to operate.

Poplar uses this simple abstraction to build very large graphs that are represented as images. Software generation of the graph means we can tailor it to the specific calculations needed to ensure the most effective use IPU resources.

The compiler translates standard operations, used in machine learning systems, into highly optimized IPU application code. The graph compiler creates an intermediate image of the computational graph, which is deployed on one or more IPU devices. The compiler can display this computational graph, so an application written at the neural network framework level displays an image of the computational graph that is running on the IPU.


Graph of the full AlexNet training cycle in forward and backward directions

The Poplar graphics compiler turned the AlexNet description into a computational graph of 18.7 million vertices and 115.8 million edges. Clearly visible clustering is the result of a strong connection between processes in each layer of the network with more easy communication between levels.

Another example is a simple fully connected network trained on MNIST, a simple data set for computer vision, a kind of “Hello, world” in machine learning. Simple network to explore this dataset helps to understand the graphs that are driven by Poplar applications. By integrating graph libraries with frameworks such as TensorFlow, the company provides one of simple ways for using IPUs in machine learning applications.

After the graph has been constructed using the compiler, it needs to be executed. This is possible using the Graph Engine. The example of ResNet-50 demonstrates its operation.


ResNet-50 graph

The ResNet-50 architecture allows the creation of deep networks from repeating partitions. The processor only has to define these sections once and call them again. For example, the conv4 level cluster is executed six times, but only mapped once to the graph. The image also shows the variety of shapes of convolutional layers, as each one has a graph built according to a natural form of computation.

The engine creates and manages the execution of a machine learning model using a graph generated by the compiler. Once deployed, the Graph Engine monitors and responds to the IPUs, or devices, used by applications.

The ResNet-50 image shows the entire model. At this level it is difficult to identify connections between individual vertices, so it is worth looking at enlarged images. Below are some examples of sections within neural network layers.

Why do deep networks need so much memory?

Large memory footprints are one of the biggest challenges of deep neural networks. Researchers are trying to combat the limited throughput DRAM devices that modern systems must use to store huge amounts of weights and activations in a deep neural network.

The architectures were designed using processor chips designed for sequential processing and DRAM optimization for high-density memory. The interface between these two devices is a bottleneck that introduces bandwidth limitations and adds significant overhead in power consumption.

Although we do not yet have a complete understanding of the human brain and how it works, it is generally understood that there is not a large separate memory store. The function of long-term and short-term memory in the human brain is believed to be embedded in the structure of neurons + synapses. Even simple organisms like worms, with a neural brain structure of just over 300 neurons, have some memory function.

Building memory in conventional processors is one way to circumvent the memory bottleneck problem, unlocking enormous bandwidth at much lower power consumption. However, on-chip memory is expensive and is not designed for the truly large amounts of memory that are attached to the CPUs and GPUs currently used to train and deploy deep neural networks.

So it's useful to look at how memory is used today in CPUs and GPU-based deep learning systems and ask yourself: why do they require such large memory storage devices when the human brain works just fine without them?

Neural networks need memory in order to store input data, weights, and activation functions as the input propagates through the network. In learning, the activation on the input must be maintained until it can be used to compute the errors in the output gradients.

For example, a 50-layer ResNet network has about 26 million weight parameters and computes 16 million forward activations. If you use a 32-bit float to store each weight and activation, it will require about 168MB of space. Using more low value accuracy to store these weights and activations, we could halve or even quadruple this storage requirement.

A major memory problem arises from the fact that GPUs rely on data represented as dense vectors. Therefore they can use single instruction thread (SIMD) to achieve high density calculations. The CPU uses similar vector units for high-performance computing.

GPUs have a synapse width of 1024 bits, so they use 32-bit floating point data, so they often split it into parallel mini-batch of 32 samples to create vectors of 1024-bit data. This approach to vector parallelism increases the number of activations by 32 times and the need for local storage with a capacity of more than 2 GB.

GPUs and other machines designed for matrix algebra are also subject to memory load from weights or neural network activations. GPUs cannot efficiently perform the small convolutions used in deep neural networks. Therefore, a transformation called "reduction" is used to convert these convolutions into matrix-matrix multiplications (GEMMs), which GPUs can handle efficiently.

Additional memory is also required to store input data, temporary values, and program instructions. Measuring memory usage while training ResNet-50 on HPC GPU showed that it requires more than 7.5 GB of local DRAM.

Some might think that lower computational precision might reduce the amount of memory required, but this is not the case. By switching data values ​​to half precision for weights and activations, you will only fill half the SIMD vector width, wasting half the available compute resources. To compensate for this, when you switch from full precision to half precision on the GPU, you will then have to double the size of the mini-batch to force enough data parallelism to use all the available computation. Thus, moving to lower precision weights and activations on the GPU still requires more than 7.5GB of free-access dynamic memory.

With such big amount data that needs to be stored, it is simply impossible to fit it all into the GPU. Each convolutional neural network layer needs to store the state of the external DRAM, load the next network layer, and then load the data into the system. As a result, the interface is already limited by bandwidth and memory latency. external memory suffers from the additional burden of constantly reloading the scales and storing and retrieving activation functions. This significantly slows down training time and significantly increases energy consumption.

There are several ways to solve this problem. First, operations such as activation functions can be performed “in-place,” allowing input data to be rewritten directly to the output. This way existing memory can be reused. Second, the opportunity for memory reuse can be obtained by analyzing the data dependency between operations on the network and the allocation of the same memory to operations that are not currently using it.

The second approach is especially effective when the entire neural network can be analyzed at compile time to create a fixed allocated memory, since memory management overhead is reduced to almost zero. It turned out that the combination of these methods can reduce the memory use of a neural network by two to three times.
A third significant approach was recently discovered by the Baidu Deep Speech team. They applied various memory-saving techniques to achieve a 16-fold reduction in the memory consumption of activation functions, allowing them to train networks with 100 layers. Previously, with the same amount of memory, they could train networks with nine layers.

Combining memory and processing resources into a single device has significant potential to improve the performance and efficiency of convolutional neural networks, as well as other forms of machine learning. Trade-offs can be made between memory and compute resources to achieve a balance of features and performance in the system.

Neural networks and knowledge models in other machine learning methods can be thought of as mathematical graphs. These graphs contain great amount parallelism. A parallel processor designed to exploit parallelism in graphs does not rely on mini-batch and can significantly reduce the amount of local storage required.

Current research results have shown that all these methods can significantly improve the performance of neural networks. Modern GPUs and CPUs have very limited onboard memory, only a few megabytes in total. New processor architectures specifically designed for machine learning balance memory and on-chip compute, delivering significant performance and efficiency improvements over today's CPUs and GPUs.