Neural networks with deep learning. What is deep learning and why is everyone talking about it?

I learned about business trends at a large-scale conference in Kiev. Saturday was filled with insights, in which we gained new knowledge and acquaintances, gained over the course of the hours spent. At the conference there were 4 streams of information for business leaders, top managers, marketers, sales, HR and other specialists. One of the speakers was the Minister of Infrastructure Volodymyr Omelyan, who spoke about the development of Galuzia, the renovation of roads and airports.

Good day to all, dear fellow iOS users, probably each of you has worked with the network and parsed data from JSON. For this process there are a bunch of libraries, all kinds of tools that you can use. Some of them are complex and some are simple. To be honest, I myself parsed JSON by hand for a very long time, not trusting this process to some third-party libraries, and this had its advantages.

On September 9, 2014, during the next presentation, Apple introduced its own mobile payment system - Apple Pay.

With Apple Pay, iPhone 6 and iPhone 6+ users and owners of the latest Apple Watch can shop online, enjoy additional Apple Pay benefits for mobile apps, and make payments using NFC (Near Field Communication) technology. Touch ID or Face ID technologies are used to authorize payments.

Technologies do not stand still, and development processes move with them. If previously companies worked according to the “Waterfall” model, now, for example, everyone is striving to implement “Scrum”. Evolution is also taking place in the provision of software development services. Previously, companies provided clients with high-quality development within the budget, stopping there, but now they strive to provide maximum benefit to the client and his business by providing their expertise.

Over the past few years, so many good fonts have appeared, including free ones, that we decided to write a continuation of ours for designers.

Every designer has a set of favorite fonts to work with that they are used to working with and that reflect their graphic style. Designers say, “You can never have too many good fonts,” but now you can safely imagine a situation where this set consists only of free fonts.

How often do project managers find themselves between a rock and a hard place when trying to find a balance between all the requirements and deadlines of the customer and the mental health of the entire team? How many nuances need to be taken into account so that there is peace and order on both sides of responsibility? How do you know if you are a good manager or if you urgently need to improve on all fronts? How to determine in which aspects exactly you, as a PM, are lagging behind, and where you are good and smart? This is exactly what the next Code’n’Coffee conference was about.

Pattern recognition technology is increasingly becoming part of our everyday life. Companies and institutions use it for a variety of tasks, from security to customer satisfaction surveys. Investments in products based on this function promise to grow to $39 billion by 2021. Here are just a few examples of how pattern recognition is used in different fields.

More than 20 years have passed since the term “deep learning” was coined, but people started talking about it only recently. We briefly explain why this happened, what deep learning is, how it differs from machine learning, and why you need to know about it.

What it is?

Deep learning is a branch of machine learning that uses a model inspired by how the brain works - how neurons interact.

The term itself appeared in the 1980s, but until 2012 there was not enough capacity to implement this technology and almost no one paid attention to it. After a series of articles by famous scientists and publications in scientific journals, the technology quickly became popular and received the attention of major media - The New York Times was the first world media to write about it. One of the reasons for the material was the scientific work of specialists from the universities of Toronto Alex Krizhevsky, Ilya Satskever and Jeff Hinton. They described and analyzed the results of the ImageNet image recognition competition, where their neural network trained using deep learning won by a wide margin - the system identified 85% of the objects. Since then, only the deep neural network has won the competition

Wait, what is machine learning?

This is a subfield of artificial intelligence and a term that describes methods for constructing algorithms that learn from experience, without writing a special program. That is, in this case, a person does not need to explain to the machine how to solve a problem; it finds the answer itself, from the data that is provided to it. For example, if we want the algorithm to identify faces, we must show it ten thousand different faces, note where exactly the face is, and then the program will learn to identify it on its own.

The machine can learn both with the help of a teacher, when he marks the correct answers for the machine, and without him. But the results are better when learning with a teacher. Every time data processing occurs, the system becomes more accurate.

How does deep learning work?

It imitates abstract human thinking and is able to generalize. For example, a machine-trained neural network does not recognize handwritten letters well - and so that it does not get confused in different spellings, all of them must be loaded into it.

Deep learning is used when working with multilayer artificial neural networks and can cope with this task.

“There are three terms that are often used almost interchangeably lately: artificial intelligence, machine learning and deep learning. However, these are actually “nested” terms: artificial intelligence is anything that can help a computer perform human tasks; machine learning is a branch of AI in which programs do not just solve problems, but learn based on the experience they have, and deep learning is a branch of machine learning that studies deep neural networks.

Simply put: 1.if you wrote a program that plays chess, that's artificial intelligence; 2.if it learns on the basis of grandmaster games or by playing against itself, this is machine learning; 3.and if it’s not just something that learns from it, but a deep neural network, that’s deep learning.”.

How does deep learning work?

Let's take a simple example - we will show the neural network photographs depicting a boy and a girl. In the first layer, neurons respond to simple visual images, such as changes in brightness. On the second - more complex ones: angles, circles. By the third layer, neurons are able to respond to inscriptions and human faces. For each subsequent layer, the identified images will be more complex. The neural network itself determines which visual elements are interesting to it for solving this problem, and ranks them in order of importance in order to better understand in the future what is shown in the photograph.

And what have you already developed with it?

Most deep learning projects are used in photo or audio recognition and disease diagnosis. For example, it is already used in Google's image translations: Deep Learning technology detects whether there are letters in a picture and then translates them. Another project that works with photos is a facial recognition system called DeepFace. It can recognize human faces with 97.25% accuracy - approximately the same accuracy as a human.

In 2016, Google released WaveNet, a system that can simulate human speech. To do this, the company loaded into the system millions of minutes of recorded voice queries that were used in the OK Google project, and after studying, the neural network itself was able to compose sentences with the correct stress, emphasis and without illogical pauses.

At the same time, deep learning can semantically segment an image or video - that is, not just indicate that there is an object in the picture, but also ideally highlight its contours. This technology is used in self-driving cars that detect road obstructions, markings and read road signs to avoid accidents. The neural network is also used in medicine - to determine diabetic retinopathy from photographs of patients' eyes, for example. The US Department of Health has already authorized the use of this technology in government clinics.

Why didn’t they start implementing deep learning earlier?

Previously, this was expensive, difficult and time-consuming - you needed powerful graphics processors, video cards and memory. The boom in deep learning is precisely related to the widespread use of graphics processors, which speed up and reduce the cost of calculations, virtually unlimited data storage capabilities, and the development of “big data” technology.

This is a breakthrough technology, will it change everything?

It’s difficult to say for sure about this; opinions vary. On the one hand, Google, Facebook and other large companies have already invested billions of dollars and are optimistic. In their opinion, neural networks with deep learning are capable of changing the technological structure of the world. One of the leading experts in machine learning, Andrew Ng, says: “If a person can perform a task mentally in a second, most likely that task will be automated in the near future.” Ng calls machine learning "the new electricity" - it's a technological revolution, and companies that ignore it will quickly find themselves hopelessly behind the competition.

On the other hand, there are skeptics: they believe that deep learning is a buzzword or a rebranding of neural networks. For example, Sergei Bartunov, a senior lecturer at the Faculty of Computer Science at the Higher School of Economics, believes that this algorithm is just one of the options (and not the best) for training a neural network, which was quickly picked up by mass publications and which everyone now knows about.

Sergey Nikolenko, co-author of the book “Deep Learning”: “The history of artificial intelligence has already known two “winters,” when a wave of hype and high expectations was followed by disappointment. Both times, by the way, it was connected with neural networks. First, in the late 1950s, it was decided that Rosenblatt's perceptron would immediately lead to machine translation and self-aware computers; but, of course, it didn’t work out due to limited hardware, data and lack of suitable models.

And in the late 1980s, the same mistake was made when they figured out how to train any neural network architectures. It seemed that here it was, a golden key that could open any door. This was no longer such a naive conclusion: indeed, if you take a neural network from the late 1980s, mechanically make it larger (increase the number of neurons) and train it on modern data sets and modern hardware, it will work very well! But there was not enough data or hardware at that time, and the deep learning revolution had to be postponed until the end of the 2000s.

We are now living in the third wave of artificial intelligence hype. Whether it will end in a third “winter” or the creation of strong AI, only time will tell.”

And in parts, this guide is intended for anyone who is interested in machine learning but doesn't know where to start. The content of the articles is intended for a wide audience and will be quite superficial. But does anyone really care? The more people who become interested in machine learning, the better.

Object recognition using deep learning

You may have already seen this famous xkcd comic. The joke is that any 3-year-old can recognize a photo of a bird, but getting a computer to do it took the best computer scientists over 50 years. In the last few years, we've finally found a good approach to object recognition using deep convolutional neural networks. This sounds like a bunch of made-up words from a William Gibson science fiction novel, but it will make sense once we take them one by one. So let's do it - write a program that recognizes birds!

Let's start simple

Before we learn how to recognize pictures of birds, let's learn how to recognize something much simpler - the handwritten number "8".

"(Manning Publications).

The article is intended for people who already have significant experience with deep learning (for example, those who have already read chapters 1-8 of this book). A large amount of knowledge is assumed.

Deep Learning: Geometric View

The most amazing thing about deep learning is how simple it is. Ten years ago, no one could have imagined the amazing results we would achieve on machine perception problems using simple parametric models trained with gradient descent. Now it turns out that all we need is big enough parametric models trained on big enough number of samples. As Feynman once said about the Universe: “ It's not complicated, there's just a lot of it».

In deep learning, everything is a vector, i.e. dot V geometric space. The input data of the model (this can be text, images, etc.) and its targets are first “vectorized”, that is, translated into some initial vector space as an input and a target vector space as an output. Each layer in a deep learning model performs one simple geometric transformation on the data that passes through it. Together, the chain of model layers creates one very complex geometric transformation, broken down into a number of simple ones. This complex transformation attempts to transform the input data space into the target space, for each point. The transformation parameters are determined by the layer weights, which are constantly updated based on how well the model is performing at the moment. The key characteristic of a geometric transformation is that it must be differentiable, that is, we should be able to find out its parameters through gradient descent. Intuitively, this means that geometric morphing must be smooth and continuous—an important constraint.

The entire process of applying this complex geometric transformation to the input data can be visualized in 3D by depicting a person trying to unwrap a paper ball: the crumpled paper ball is the variety of input data that the model begins to work with. Each movement of a person with a paper ball is like a simple geometric transformation performed by a single layer. The complete sequence of unfolding gestures is a complex transformation of the entire model. Deep learning models are mathematical machines for unraveling the intricate variety of multidimensional data.

That's the magic of deep learning: turning value into vectors, into geometric spaces, and then gradually learning complex geometric transformations that transform one space into another. All that is needed is a space of sufficiently large dimension to convey the full range of relationships found in the original data.

Limitations of Deep Learning

The range of problems that can be solved using this simple strategy is almost endless. And yet, many of them are still beyond the reach of current deep learning techniques - even despite the availability of huge amounts of manually annotated data. Let's say, for example, that you can collect a data set of hundreds of thousands - even millions - of English language descriptions of software features written by product managers, as well as the corresponding reference year developed by teams of engineers to meet those requirements. Even with this data, you can't train a deep learning model to simply read a product description and generate the corresponding codebase. This is just one of many examples. In general, anything that requires reasoning - like programming or applying the scientific method, long-term planning, algorithmic-style data manipulation - is beyond the capabilities of deep learning models, no matter how much data you throw at them. Even training a neural network to perform a sorting algorithm is an incredibly difficult task.

The reason is that the deep learning model is "only" a chain of simple, continuous geometric transformations, which transform one vector space into another. All it can do is transform one set of data X into another set Y, provided that there is a possible continuous transformation from X to Y that can be learned, and the availability dense set of samples X:Y transformations as training data. So while a deep learning model can be considered a type of program, most programs cannot be expressed as deep learning models- for most problems, either there is no deep neural network of practically suitable size that solves the problem, or if there is one, it may be unteachable, that is, the corresponding geometric transformation may be too complex, or there is no suitable data to train it.

Scaling up existing deep learning techniques—adding more layers and using more training data—can only superficially mitigate some of these problems. It will not solve the more fundamental problem that deep learning models are very limited in what they can represent, and that most programs cannot be expressed as a continuous geometric morphing of data manifolds.

The Risk of Anthropomorphizing Machine Learning Models

One of the very real risks of modern AI is misinterpreting how deep learning models work and exaggerating their capabilities. A fundamental feature of the human mind is the “model of the human psyche,” our tendency to project goals, beliefs and knowledge onto things around us. A drawing of a smiling face on a stone suddenly makes us “happy” - mentally. When applied to deep learning, this means, for example, that if we can more or less successfully train a model to generate text descriptions of pictures, then we tend to think that the model “understands” the content of the images, as well as the generated descriptions. We are then greatly surprised when, due to a small deviation from the set of images presented in the training data, the model begins to generate absolutely absurd descriptions.

In particular, this is most evident in “adversarial examples,” which are samples of deep learning network input data that are specifically selected to be misclassified. You already know that you can do gradient ascent on the input data space to generate samples that maximize the activation of, for example, a particular convolutional neural network filter - this is the basis of the visualization technique we covered in Chapter 5 (note: books "Deep Learning with Python") , just like the Deep Dream algorithm from Chapter 8. In a similar way, through gradient ascent, you can slightly change the image to maximize class prediction for a given class. If we take a photo of a panda and add a "gibbon" gradient, we can force the neural network to classify that panda as a gibbon. This demonstrates both the fragility of these models and the profound difference between the input-to-output transformation it guides and our own human perceptions.

In general, deep learning models have no understanding of the input data, at least not in the human sense. Our own understanding of images, sounds, language, is based on our sensorimotor experience as people - as material earthly beings. Machine learning models do not have access to such experience and therefore cannot “understand” our input data in any human-like way. By annotating a large number of examples for our models to train, we force them to learn a geometric transformation that reduces the data to human concepts for that specific set of examples, but this transformation is only a simplified sketch of our mind's original model, as developed from our experience as bodily agents are like a faint reflection in a mirror.

As a machine learning practitioner, always keep this in mind, and never fall into the trap of believing that neural networks understand the task they are performing - they don't, at least not in a way that makes sense to us. They have been trained on a different, much more specific task than the one we want to train them for: simply transforming input learning patterns into target learning patterns, point to point. Show them anything that's different from the training data and they'll break in the most absurd ways.

Local generalization versus extreme generalization

There seem to be fundamental differences between the direct geometric morphing from input to output that deep learning models do and the way humans think and learn. It is not just that people learn themselves from their bodily experiences, and not through processing a set of training samples. In addition to differences in learning processes, there are fundamental differences in the nature of the underlying beliefs.

Humans are capable of much more than translating an immediate stimulus into an immediate response, like a neural network or perhaps an insect. People hold complex, abstract models of the current situation, themselves, and other people in their minds, and can use these models to predict various possible futures and carry out long-term planning. They are capable of combining known concepts to imagine something they have never known before - like drawing a horse in jeans, for example, or picturing what they would do if they won the lottery. The ability to think hypothetically, to expand our model of mental space far beyond what we have directly experienced, that is, the ability to do abstractions And reasoning, perhaps the defining characteristic of human cognition. I call this “ultimate generalization”: the ability to adapt to new, never-before-experienced situations using little or no data.

This is in stark contrast to what deep learning networks do, which I would call "local generalization": transforming input data into output data quickly ceases to make sense if the new input data is even slightly different from what it encountered during training . Consider, for example, the problem of learning the appropriate launch parameters for a rocket that is supposed to land on the Moon. If you were to use a neural network for this task, supervised or reinforcement trained, you would need to give it thousands or millions of flight trajectories, that is, you would need to produce dense set of examples in the input value space to learn how to reliably transform from the input value space to the output value space. In contrast, humans can use the power of abstraction to create physical models—rocket science—and derive an exact solution that will get a rocket to the moon in just a few tries. In the same way, if you developed a neural network to control the human body and want it to learn how to walk safely through a city without being hit by a car, the network would have to die many thousands of times in different situations before it would conclude that cars are dangerous and fail. appropriate behavior to avoid them. If it were moved to a new city, the network would have to relearn most of what it knew. On the other hand, people are able to learn safe behavior without ever dying - again, thanks to the power of abstract simulation of hypothetical situations.

So, despite our progress in machine perception, we are still very far from human-level AI: our models can only perform local generalization, adapting to new situations that must be very close to past data, while the human mind is capable of extreme generalization, quickly adapting to completely new situations or planning far into the future.

conclusions

Here's what you need to remember: the only real success of deep learning so far is the ability to translate X space to Y space using a continuous geometric transformation, given a large amount of human annotated data. Doing this well represents a revolutionary advance for an entire industry, but human-level AI is still a long way off.

To remove some of these limitations and begin to compete with the human brain, we need to move away from direct input-to-output conversion and move to reasoning And abstractions. Computer programs may be a suitable basis for abstractly modeling various situations and concepts. We've said before (note: in Deep Learning with Python) that machine learning models can be defined as "programs that learn"; at the moment we can only train a narrow and specific subset of all possible programs. But what if we could train each program, modularly and iteratively? Let's see how we can get there.

The Future of Deep Learning

Given what we know about deep learning networks, their limitations, and the current state of research, can we predict what will happen in the medium term? Here are some of my personal thoughts on the matter. Keep in mind that I don't have a crystal ball for predictions, so much of what I expect may not come to fruition. This is complete speculation. I share these predictions not because I expect them to be fully realized in the future, but because they are interesting and applicable to the present.

At a high level, here are the main areas that I consider promising:

Models will approach general purpose computer programs built on top of much richer primitives than our current differentiable layers - so we will get reasoning And abstractions, the absence of which is a fundamental weakness of current models.
New forms of learning will emerge that will make this possible - and allow models to move away from simply differentiable transformations.
Models will require less developer input - it shouldn't be your job to constantly twist knobs.
There will be greater, systematic reuse of learned features and architectures; meta-learning systems based on reusable and modular routines.

Additionally, note that these considerations do not apply specifically to supervised learning, which is still the basis of machine learning—they also apply to any form of machine learning, including unsupervised learning, supervised learning, and reinforcement learning. It doesn't fundamentally matter where your labels come from or what your learning cycle looks like; these different branches of machine learning are simply different facets of the same construct.

So, go ahead.

Models as programs

As we noted earlier, a necessary transformational development that can be expected in the field of machine learning is a move away from models that perform purely pattern recognition and capable only of local generalization, to models capable of abstractions And reasoning that can reach ultimate generalization. All current AI programs with basic reasoning are hard-coded by human programmers: for example, programs that rely on search algorithms, graph manipulation, formal logic. In DeepMind's AlphaGo program, for example, much of the on-screen "intelligence" is designed and hard-coded by expert programmers (for example, Monte Carlo tree search); Learning from new data occurs only in specialized submodules - value networks and policy networks. But in the future, such AI systems could be trained entirely without human intervention.

How to achieve this? Let's take a well-known type of network: RNN. Importantly, RNNs have slightly fewer limitations than feedforward neural networks. This is because RNNs are little more than simple geometric transformations: they are geometric transformations that carried out continuously in a for loop. The timing of the for loop is specified by the developer: it is a built-in assumption of the network. Naturally, RNNs are still limited in what they can represent, mainly because each step they take is still a differentiable geometric transformation and because of the way they convey information step by step through points in a continuous geometric space ( state vectors). Now imagine neural networks that would be “increased” with programming primitives in the same way as for loops - but not just a single hard-coded for loop with stitched geometric memory, but a large set of programming primitives that the model could freely access to expand its capabilities. processing capabilities such as if branches, while statements, variable creation, disk storage for long-term memory, sorting operators, advanced data structures like lists, graphs, hash tables and much more. The space of programs that such a network can represent will be much wider than existing deep learning networks can express, and some of these programs may achieve superior generalization power.

In short, we will move away from the fact that we have, on the one hand, “hard-coded algorithmic intelligence” (hand-written software), and on the other hand, “trained geometric intelligence” (deep learning). Instead, we will end up with a mixture of formal algorithmic modules that provide capabilities reasoning And abstractions, and geometric modules that provide capabilities informal intuition and pattern recognition. The entire system will be trained with little or no human intervention.

A related area of AI that I think could soon make big strides is software synthesis, in particular, neural software synthesis. Program synthesis consists of automatically generating simple programs using a search algorithm (perhaps a genetic search, as in genetic programming) to explore a large space of possible programs. The search stops when a program is found that meets the required specifications, often provided as a set of input-output pairs. As you can see, this is very similar to machine learning: “training data” is provided as input-output pairs, we find a “program” that corresponds to the transformation of inputs to outputs and is capable of generalizations to new inputs. The difference is that instead of training parameter values in a hard-coded program (neural network), we generate source through a discrete search process.

I definitely expect there will be a lot of interest in this area again in the next few years. In particular, I expect mutual penetration of the related areas of deep learning and software synthesis, where we will not just generate programs in general-purpose languages, but where we will generate neural networks (geometric data processing threads), supplemented a rich set of algorithmic primitives, such as for loops - and many others. This should be much more convenient and useful than direct source code generation, and will significantly expand the scope of the problems that can be solved using machine learning - the space of programs that we can generate automatically, given the appropriate training data. A mixture of symbolic AI and geometric AI. Modern RNNs can be considered as the historical ancestor of such hybrid algorithmic-geometric models.

Drawing: The trained program simultaneously relies on geometric primitives (pattern recognition, intuition) and algorithmic primitives (argumentation, search, memory).

Beyond backpropagation and differentiable layers

If machine learning models become more like programs, then they will hardly be differentiable anymore—certainly those programs will still use continuous geometric layers as subroutines that will remain differentiable, but the overall model will not be. As a result, using backpropagation to adjust the values of weights in a fixed, hard-coded network may not remain the preferred method for training models in the future—at least, it should not be limited to this method alone. We need to figure out how to train non-differentiable systems most efficiently. Current approaches include genetic algorithms, "evolutionary strategies", certain reinforcement learning methods, ADMM (alternating direction method of Lagrange multipliers). Naturally, gradient descent is here to stay - gradient information will always be useful for optimizing differentiable parametric functions. But our models will definitely become more ambitious than just differentiable parametric functions, and so their automated development (“training” in “machine learning”) will require more than backpropagation.

Additionally, backpropagation has an end-to-end framework, which is suitable for learning good concatenated transformations, but is quite computationally inefficient because it does not fully exploit the modularity of deep networks. To increase the efficiency of anything, there is one universal recipe: introduce modularity and hierarchy. So we can make backpropagation itself more efficient by introducing decoupled learning modules with some synchronization mechanism between them, organized in a hierarchical manner. This strategy is partly reflected in DeepMind's recent work on "synthetic gradients." I expect much, much more work in this direction in the near future.

One can imagine a future where globally non-differentiable models (but with differentiable parts) will learn - grow - using an efficient search process that does not apply gradients, while differentiable parts will learn even faster using gradients using some more efficient backpropagation versions

Automated Machine Learning

In the future of architecture, models will be created by learning, rather than written by hand by engineers. The resulting models are automatically paired with a richer set of primitives and program-like machine learning models.

Nowadays, most of the time, a deep learning system developer endlessly modifies data with Python scripts, then spends a long time tuning the architecture and hyperparameters of the deep learning network to get a working model - or even to get an outstanding model if the developer is so ambitious. Needless to say, this is not the best state of affairs. But AI can help here too. Unfortunately, the data processing and preparation part is difficult to automate because it often requires domain knowledge as well as a clear, high-level understanding of what the developer wants to achieve. However, tuning hyperparameters is a simple search procedure, and in this case we already know what the developer wants to achieve: this is determined by the loss function of the neural network that needs to be tuned. It has now become common practice to install basic AutoML systems, which take care of most of the tweaking of the model settings. I installed one myself to win the Kaggle competition.

At the most basic level, such a system would simply adjust the number of layers in the stack, their order, and the number of elements or filters in each layer. This is usually done using libraries like Hyperopt, which we discussed in Chapter 7 (note: books "Deep Learning with Python"). But you can go much further and try to learn the appropriate architecture from scratch, with a minimum set of restrictions. This is possible using reinforcement learning, for example, or using genetic algorithms.

Another important direction in the development of AutoML is the training of model architecture simultaneously with model weights. By training a model from scratch we try slightly different architectures each time, which is extremely inefficient, so a really powerful AutoML system will manage the evolution of architectures while model properties are tuned via backpropagation on the training data, thus eliminating all the computational overhead. As I write these lines, similar approaches have already begun to be applied.

When all this starts to happen, machine learning system developers will not be left without work - they will move to a higher level in the value chain. They will begin to put much more effort into creating complex loss functions that truly reflect business problems, and will develop a deep understanding of how their models impact the digital ecosystems in which they operate (for example, customers who use model predictions and generate data for its training) - problems that only the largest companies can now afford to consider.

Lifelong learning and reuse of modular routines

If models become more complex and built on richer algorithmic primitives, then this increased complexity will require more intensive reuse between tasks, rather than training a model from scratch every time we have a new task or new data set. Eventually, many datasets do not contain enough information to develop a new complex model from scratch and it will become necessary to use information from previous datasets. You don't relearn English every time you open a new book - that would be impossible. In addition, training models from scratch on each new problem is very inefficient due to the significant overlap between the current problems and those encountered before.

In addition, the remarkable observation that has been made repeatedly in recent years is that training the same model to do multiple loosely related tasks improves its performance. in each of these tasks. For example, training the same neural network to translate from English to German and from French to Italian will result in a model that is better in each of these language pairs. Training an image classification model simultaneously with an image segmentation model, with a single convolutional base, will result in a model that is better at both tasks. And so on. This is quite intuitive: there is always some kind information that overlaps between these two seemingly different tasks, and therefore the overall model has access to more information about each individual task than a model that was trained only on that specific task.

What we actually do when we reuse a model on different tasks is use pre-trained weights for models that perform common functions, like visual feature extraction. You saw this in practice in Chapter 5. I expect that a more general version of this technique will be commonly used in the future: we will not only use previously learned features (submodel weights), but also model architectures and training procedures. As models become more program-like, we will begin to reuse subroutines, like functions and classes in regular programming languages.

Think about what the software development process looks like today: once an engineer solves a certain problem (HTTP requests in Python, for example), he packages it up as an abstract library for reuse. Engineers who encounter a similar problem in the future simply look for existing libraries, download them, and use them in their own projects. Likewise, in the future, meta-learning systems will be able to assemble new programs by sifting through a global library of high-level reusable blocks. If the system starts developing similar routines for several different tasks, it will release an "abstract" reusable version of the routine and store it in a global library. This process will open up the opportunity for abstractions, a necessary component for achieving "ultimate generalization": a routine that will be useful for many tasks and domains can be said to "abstract" some aspect of decision making. This definition of "abstraction" does not seem to be the concept of abstraction in software development. These routines can be either geometric (deep learning modules with pre-trained representations) or algorithmic (closer to the libraries that modern programmers work with).

Drawing: A meta-learning system that can quickly develop task-specific models using reusable primitives (algorithmic and geometric), thereby achieving “ultimate generalization.”

The bottom line: a long-term vision

In short, here is my long-term vision for machine learning:

Models will become more like programs and will have capabilities that extend far beyond the continuous geometric transformations of source data that we work with now. Perhaps these programs will be much closer to the abstract mental models that people hold about their environment and themselves, and they will be capable of stronger generalization due to their algorithmic nature.
In particular, models will mix algorithmic modules with formal reasoning, search, abstraction abilities - and geometric modules with informal intuition and pattern recognition. AlphaGo (a system that required intensive manual programming and architecture) provides an early example of what the merging of symbolic and geometric AI might look like.
They will grow automatically (rather than being written by hand by human programmers), using modular parts from a global library of reusable routines - a library that has evolved by assimilation of high-performance models from thousands of previous problems and data sets. Once the metalearning system has identified common problem-solving patterns, they are converted into reusable routines—much like functions and classes in modern programming—and added to a global library. This is how the ability is achieved abstractions.
A global library and associated model growing system will be able to achieve some form of human-like "ultimate generalization": when faced with a new task, a new situation, the system will be able to assemble a new working model for that task using very little data, thanks to: 1) rich program-like primitives, who generalize well and 2) extensive experience in solving similar problems. In the same way that people can quickly learn a new complex video game because they have previous experience with many other games and because the models from previous experience are abstract and program-like rather than simply converting stimulus into action.
Essentially, this continuously learning model-growing system can be interpreted as Strong Artificial Intelligence. But don't expect some kind of singular robot apocalypse to occur: it is pure fantasy, born from a long list of deep misunderstandings in the understanding of intelligence and technology. However, this criticism has no place here.

Today, a graph is one of the most acceptable ways to describe models created in a machine learning system. These computational graphs are composed of neuron vertices connected by synapse edges that describe the connections between the vertices.

Unlike a scalar central or vector graphics processor, an IPU, a new type of processor designed for machine learning, allows the construction of such graphs. A computer designed to manipulate graphs is an ideal machine for computing graph models created through machine learning.

One of the easiest ways to describe the process of machine intelligence is to visualize it. The Graphcore development team has created a collection of such images that are displayed on the IPU. It is based on Poplar software, which visualizes the work of artificial intelligence. Researchers from this company also found out why deep networks require so much memory, and what solutions exist to solve the problem.

Poplar includes a graphics compiler that was built from the ground up to translate standard machine learning operations into highly optimized IPU application code. It allows you to assemble these graphs together using the same principle as POPNNs are assembled. The library contains a set of different vertex types for generalized primitives.

Graphs are the paradigm on which all software is based. In Poplar, graphs allow you to define a computation process, where vertices perform operations and edges describe the relationship between them. For example, if you want to add two numbers together, you can define a vertex with two inputs (the numbers you would like to add), some calculations (a function to add two numbers), and an output (the result).

Typically, operations with vertices are much more complex than in the example described above. They are often defined by small programs called codelets (codenames). Graphical abstraction is attractive because it makes no assumptions about the structure of the computation and breaks the computation down into components that the IPU can use to operate.

Poplar uses this simple abstraction to build very large graphs that are represented as images. Software generation of the graph means we can tailor it to the specific calculations needed to ensure the most efficient use of IPU resources.

The compiler translates standard operations used in machine learning systems into highly optimized application code for the IPU. The graph compiler creates an intermediate image of the computational graph, which is deployed on one or more IPU devices. The compiler can display this computational graph, so an application written at the neural network framework level displays an image of the computational graph that is running on the IPU.

Graph of the full AlexNet training cycle in forward and backward directions

The Poplar graphics compiler turned the AlexNet description into a computational graph of 18.7 million vertices and 115.8 million edges. Clearly visible clustering is the result of strong communication between processes in each layer of the network, with easier communication between layers.

Another example is a simple fully connected network trained on MNIST, a simple computer vision dataset, a kind of “Hello, world” in machine learning. A simple network to explore this dataset helps to understand the graphs driven by Poplar applications. By integrating graph libraries with frameworks such as TensorFlow, the company provides one of the simplest ways to use IPUs in machine learning applications.

After the graph has been constructed using the compiler, it needs to be executed. This is possible using the Graph Engine. The example of ResNet-50 demonstrates its operation.

ResNet-50 graph

The ResNet-50 architecture allows the creation of deep networks from repeating partitions. The processor only needs to define these sections once and call them again. For example, the conv4 level cluster is executed six times, but only mapped once to the graph. The image also shows the variety of shapes of convolutional layers, as each one has a graph built according to a natural form of computation.

The engine creates and manages the execution of a machine learning model using a graph generated by the compiler. Once deployed, the Graph Engine monitors and responds to the IPUs, or devices, used by applications.

The ResNet-50 image shows the entire model. At this level it is difficult to identify connections between individual vertices, so it is worth looking at enlarged images. Below are some examples of sections within neural network layers.

Why do deep networks need so much memory?

Large memory footprints are one of the biggest challenges of deep neural networks. Researchers are trying to combat the limited bandwidth of DRAM devices, which modern systems must use to store huge numbers of weights and activations in a deep neural network.

The architectures were designed using processor chips designed for sequential processing and DRAM optimization for high-density memory. The interface between these two devices is a bottleneck that introduces bandwidth limitations and adds significant overhead in power consumption.

Although we do not yet have a complete understanding of the human brain and how it works, it is generally understood that there is not a large separate memory store. The function of long-term and short-term memory in the human brain is believed to be embedded in the structure of neurons + synapses. Even simple organisms like worms, with a neural brain structure of just over 300 neurons, have some memory function.

Building memory into conventional processors is one way to circumvent the memory bottleneck problem, unlocking enormous bandwidth while consuming much less power. However, on-chip memory is expensive and is not designed for the truly large amounts of memory that are attached to the CPUs and GPUs currently used to train and deploy deep neural networks.

So it's useful to look at how memory is used today in CPUs and GPU-based deep learning systems and ask yourself: why do they require such large memory storage devices when the human brain works just fine without them?

Neural networks need memory in order to store input data, weights, and activation functions as the input propagates through the network. In learning, the activation on the input must be maintained until it can be used to compute the errors of the output gradients.

For example, a 50-layer ResNet network has about 26 million weight parameters and computes 16 million forward activations. If you use a 32-bit float to store each weight and activation, it will require about 168MB of space. By using a lower precision value to store these weights and activations, we could halve or even quadruple this storage requirement.

A major memory problem arises from the fact that GPUs rely on data represented as dense vectors. Therefore, they can use single instruction thread (SIMD) to achieve high computing density. The CPU uses similar vector units for high-performance computing.

GPUs have a synapse width of 1024 bits, so they use 32-bit floating point data, so they often split it into parallel mini-batch of 32 samples to create vectors of 1024-bit data. This approach to vector parallelism increases the number of activations by 32 times and the need for local storage with a capacity of more than 2 GB.

GPUs and other machines designed for matrix algebra are also subject to memory load from weights or neural network activations. GPUs cannot efficiently perform the small convolutions used in deep neural networks. Therefore, a transformation called "reduction" is used to convert these convolutions into matrix-matrix multiplications (GEMMs), which GPUs can handle efficiently.

Additional memory is also required to store input data, temporary values, and program instructions. Measuring memory usage when training ResNet-50 on a high-end GPU showed that it required more than 7.5 GB of local DRAM.

Some might think that lower computational precision might reduce the amount of memory required, but this is not the case. By switching data values to half precision for weights and activations, you will only fill half the SIMD vector width, wasting half the available compute resources. To compensate for this, when you switch from full precision to half precision on the GPU, you will then have to double the size of the mini-batch to force enough data parallelism to use all the available computation. Thus, moving to lower precision weights and activations on the GPU still requires more than 7.5GB of free-access dynamic memory.

With so much data to store, it's simply impossible to fit it all into a GPU. Each convolutional neural network layer needs to store the state of the external DRAM, load the next network layer, and then load the data into the system. As a result, the already bandwidth-limited external memory interface suffers from the additional burden of constantly reloading the scales and storing and retrieving activation functions. This significantly slows down training time and significantly increases power consumption.

There are several ways to solve this problem. First, operations such as activation functions can be performed “in-place,” allowing input data to be rewritten directly to the output. This way existing memory can be reused. Second, the opportunity for memory reuse can be obtained by analyzing the data dependency between operations on the network and the allocation of the same memory to operations that are not currently using it.

The second approach is especially effective when the entire neural network can be analyzed at compile time to create a fixed allocated memory, since memory management overhead is reduced to almost zero. It turned out that the combination of these methods can reduce the memory use of a neural network by two to three times.
A third significant approach was recently discovered by the Baidu Deep Speech team. They applied various memory-saving techniques to achieve a 16-fold reduction in the memory consumption of activation functions, allowing them to train networks with 100 layers. Previously, with the same amount of memory, they could train networks with nine layers.

Combining memory and processing resources into a single device has significant potential to improve the performance and efficiency of convolutional neural networks, as well as other forms of machine learning. Trade-offs can be made between memory and compute resources to achieve a balance of features and performance in the system.

Neural networks and knowledge models in other machine learning methods can be thought of as mathematical graphs. There is a huge amount of parallelism concentrated in these graphs. A parallel processor designed to exploit parallelism in graphs does not rely on mini-batch and can significantly reduce the amount of local storage required.

Current research results have shown that all these methods can significantly improve the performance of neural networks. Modern GPUs and CPUs have very limited onboard memory, only a few megabytes in total. New processor architectures specifically designed for machine learning balance memory and on-chip compute, delivering significant performance and efficiency improvements over today's CPUs and GPUs.