An engineering view of things. Introduction to Deep Learning

I learned about business trends at a large-scale conference in Kiev. Saturday was filled with insights, in which we gained new knowledge and acquaintances, gained over the course of the hours spent. At the conference there were 4 streams of information for business leaders, top managers, marketers, sales, HR and other specialists. One of the speakers was the Minister of Infrastructure Volodymyr Omelyan, who spoke about the development of Galuzia, the renovation of roads and airports.

Good day to all, dear fellow iOS users, probably each of you has worked with the network and parsed data from JSON. For this process there are a bunch of libraries, all kinds of tools that you can use. Some of them are complex and some are simple. To be honest, I myself parsed JSON by hand for a very long time, not trusting this process to some third-party libraries, and this had its advantages.

On September 9, 2014, during the next presentation, Apple company presented its own system mobile payments— Apple Pay.

Using payment Apple systems Pay iPhone users 6 and iPhone 6+, as well as owners latest versions Apple Watch can make purchases online, use additional benefits apple pay For mobile applications and make payments using NFC technology(Near Field Communication). Used to authorize payments Touch technology ID or Face ID.

Technologies do not stand still, and development processes move with them. If previously companies worked according to the “Waterfall” model, now, for example, everyone is striving to implement “Scrum”. Evolution is also taking place in the provision of software development services. Previously, companies provided clients with high-quality development within the budget, stopping there, but now they strive to provide maximum benefit to the client and his business by providing their expertise.

Over the past few years, so many good fonts have appeared, including free ones, that we decided to write a continuation of ours for designers.

Every designer has a set of favorite fonts to work with that they are used to working with and that reflect them. graphic style. Designers say " Good fonts there can never be too many,” but now we can safely imagine a situation where this set consists only of free fonts.

How often do project managers find themselves between a rock and a hard place when trying to find a balance between all the requirements and deadlines of the customer and the mental health of the entire team? How many nuances need to be taken into account so that there is peace and order on both sides of responsibility? How do you know if you are a good manager or if you urgently need to improve on all fronts? How to determine in which aspects exactly you, as a PM, are lagging behind, and where you are good and smart? This is exactly what the next Code’n’Coffee conference was about.

Pattern recognition technology is increasingly becoming part of our everyday life. Companies and institutions use it to solve the most different tasks: from security to customer satisfaction research. Investments in products based on this function, promise to grow to $39 billion by 2021. Here are just a few examples of how pattern recognition is used in different fields.

He told me how artificial neural networks how they differ from traditional ones computer programs and why this trend will stay with us for a long time.

What is deep learning?

For the first time about success deep learning (deep learning) became heard in 2012, and three years later everyone is talking about it. The same thing happened with the Internet during the era of the investment bubble. And since considerable investments are now being made in neural networks, we can safely talk about a new bubble.

The Internet was easy to demonstrate: at first it was fast (compared to paper) Email, then colorful websites accessible on any Internet-connected computer. In deep learning, everything is different: there is attention to it, but it is impossible to demonstrate something specific. Indeed, what connects speech recognition programs and programs automatic translation, programs for identifying faults in oil and gas equipment and programs for synthesizing text describing photographs?

This diversity is not accidental: if the Internet is just a type of communication, then deep neural networks (DNN) are essentially new type programs that are as versatile as traditional computer programs. This universality has been proven theoretically: a neural network in theory can infinitely accurately approximate any function of many variables - and also carry out calculations equivalent to the calculations of a Turing machine.

Networks you need to learn

Information can be transmitted over the Internet in a very monotonous way, in unified packages, and it is built on this idea. But you can generate information and consume it in different ways. The computer programs that do this are very different. Neural networks are the same, they provide the same variety of processing.

To describe today what neural networks are is to describe in the late fifties what traditional computer programs are (and the FORTRAN language was released in 1957) - if you started to say that computers will control the ignition in every car, and also showing porn movies on phone screens would make you laugh.

If I tell you now that you will be talking with a neural computer network in your tablet, and the neural network will control a car without a driver, you won’t believe it either - but in vain.

By the way, “porn pictures” in in social networks It is no longer discovered by people, but by the networks themselves. But 100 thousand people in the world were doing this, watching terabytes and terabytes of photos and videos. With the advent of deep learning, the world of data processing suddenly began to change, and rapidly.

Unlike traditional computer programs, neural networks do not need to be “written”, they need to be “taught.” And they can be taught what is infinitely difficult (if not impossible) to implement with traditional software engineering. For example, neural networks have already learned to recognize audio and video at the level of people - and even better than them. Or vice versa, create audio and video - if you have an understanding of images of some objects embodied in a trained deep neural network, this same understanding can be used to create images of these objects. Synthesis of voice, text and images has not yet hit the market, but experiments are already showing successes previously unattainable in this area. Moreover, neural networks can not only analyze data, but also issue commands. So, they learned to play Atari 2600 games, many even better than man, and they did not have to be specially programmed for this.

How did this become possible only today? Why weren’t such results achieved a long time ago, even before the advent of the Internet? After all, discussions about the capabilities of neural networks have been going on since the 50s of the last century!

Firstly, it became clear how to teach deep neural networks - what kind of mathematics works there. A deep neural network means one with a depth of more than two layers. If there are fewer layers, then we are talking about shallow learning. If the number of layers is more than ten, then they talk about very deep learning, but so far this is rare. Previously, they tried to teach neural networks by trial and error (aka the “scientific poking” method), and this way it was possible to train only small networks. Over time, an understanding of the mathematics of multilayer neural networks emerged, it became possible to design them, and an understanding came of how to create new types of networks and ensure their learning ability.

Secondly, the neural network works quickly, but learns very slowly, and this requires huge amounts of data - big data . And the more layers in a neural network, the more requests such a network has to computing power during training. In fact, until recently, neural networks could only be taught anything on a supercomputer.

Today the situation has changed, since video cards have been connected to working with neural networks - and this has accelerated their training tenfold. But even such accelerated learning often means many hours and even days, and sometimes weeks, of calculations. The only consolation is that in the case traditional programming solving the same problems would require not just weeks, but years of work by programmers.

But once a deep neural network is trained, it is typically hundreds to thousands of times faster than traditional algorithms. The program takes hundreds of times less random access memory at best quality results.

« Neural Network Masters"

The unusual properties of these networks have led to the fact that almost all international competitions in data analysis are won by deep neural networks. And if you have some kind of data analysis task, and there is a lot, a lot of this data, then there is a good chance that in this case deep neural networks will also benefit.

The profession of those involved in neural networks does not even have a name yet. If at the dawn of the Internet the concept of “webmaster” appeared (and lasted for five or six years), then there is no similar “neural network master” profession yet. IN big areas Data specialists like these call themselves “data scientists,” but their work is still of the same engineering nature as the work of programmers. Engineers measure, analyze, design, build and target systems, and tools for engineering. Software Engineering(software engineering) is different from computer science(computer science). It’s the same with neural networks: there is no name for the profession yet, but there are already engineers who will help you create, train and use them. Fortunately, for Last year developed infrastructure for new profession: university training courses, dozens of tutorials, books, competition and training grounds, great amount free software. Only in the Russian-speaking deep learning community VKontakte today

What is deep learning? March 3rd, 2016

Nowadays they talk about fashionable deep learning technologies as if it were manna from heaven. But do the speakers understand what it really is? But this concept has no formal definition, and it combines a whole stack of technologies. In this post I want to explain as popularly as possible and essentially what is behind this term, why it is so popular and what these technologies give us.

In short, this newfangled term (deep learning) is about how to assemble a more complex and deeper abstraction (representation) from some simple abstractions. despite the fact that even the simplest abstractions must be assembled by the computer itself, and not by a person. Those. It’s no longer just about learning, but about meta-learning. Figuratively speaking, the computer itself must learn how best to learn. And, in fact, this is exactly what the term “deep” implies. Almost always, this term is applied to artificial neural networks that use more than one hidden layer, so formally “deep” also means a deeper neural network architecture.

Here in the development slide you can clearly see how deep learning differs from ordinary learning. I repeat, What's unique about deep learning is that the machine finds the features itself(the key features of something by which it is easiest to separate one class of objects from another) and structures these signs hierarchically: simpler ones are combined into more complex ones. Below we will look at this with an example.

Let's look at an example of an image recognition task: before, they stuffed a huge one into a regular neural network with one layer (1024×768 - about 800,000 numerical values) picture and watched the computer slowly die, suffocating from lack of memory and the inability to understand which pixels are important for recognition and which are not. Not to mention the effectiveness of this method. Here is the architecture of such a regular (shallow) neural network.

Then they listened to how the brain distinguishes features, and it does this in a strictly hierarchical manner, and they also decided to extract a hierarchical structure from the pictures. To do this, it was necessary to add more hidden layers (layers that are between the input and output; roughly speaking, information transformation stages) to the neural network. Although they decided to do this almost immediately when neurons were invented, then networks with only one hidden layer were successfully trained. Those. In principle, deep networks have been around for about as long as regular ones, we just couldn’t train them. What has changed?

In 2006, several independent researchers solved this problem at once (besides, hardware capabilities had already developed enough, quite powerful video cards appeared). These researchers are: Geoffrey Hinton (and his colleague Ruslan Salakhutidinov) with the technique of pre-training each layer of a neural network with a constrained Boltzmann machine (forgive me for these terms...), Yann LeCun with convolutional neural networks, and Yoshuay Bengio with cascaded autoencoders. The first two were immediately recruited by Google and Facebook, respectively. Here are two lectures: one - Hinton, the other - Lyakuna, in which they tell what deep learning is. No one can tell you about this better than them. Another cool one lecture Schmidhuber about the development of deep learning, also one of the pillars of this science. And Hinton also has an excellent course on neurons.

What can deep neural networks do now? They are able to recognize and describe objects; one might say they “understand” what it is. It's about about recognizing meanings.

Just watch this video of real-time recognition of what the camera sees.

As I already said, deep learning technologies are a whole group of technologies and solutions. I have already listed several of them in the paragraph above, another example is recurrent networks, which are used in the video above to describe what the network sees. But the most popular representative of this class of technologies is still LyaKun’s convolutional neural networks. They are built by analogy with the principles of operation of the visual cortex of the cat’s brain, in which the so-called simple cells were discovered, reacting to straight lines at different angles, and complex cells - the reaction of which is associated with activation a certain set simple cells. Although, to be honest, LaCun himself was not focused on biology, he decided specific task(watch his lectures), and then it coincided.

To put it very simply, convolutional networks are networks where the main structural element of learning is a group (combination) of neurons (usually a 3x3, 10x10 square, etc.), and not just one. And at each level of the network, dozens of such groups are trained. The network finds combinations of neurons that maximize information about the image. At the first level, the network extracts the most basic, structural simple elements pictures are, one might say, building units: boundaries, strokes, segments, contrasts. Higher up are already stable combinations of elements of the first level, and so on up the chain. I would like to emphasize once again main feature deep learning: the networks themselves form these elements and decide which of them are more important and which are not. This is important because in the area machine learning, the creation of features is key and now we are moving to the stage when the computer itself learns to create and select features. The machine itself identifies a hierarchy of informative features.

So, during the learning process (viewing hundreds of pictures), the convolutional network forms a hierarchy of features various levels depths. At the first level, they can highlight, for example, such elements (reflecting contrast, angle, border, etc.).

At the second level, this will already be an element from the elements of the first level. On the third - from the second. We must understand that this picture- just a demonstration. Now in industrial use, such networks have from 10 to 30 layers (levels).

After such a network has trained, we can use it for classification. Having given some image as input, groups of neurons in the first layer run across the image, activating in those places where there is an element of the picture corresponding to a specific element. Those. this network parses the picture into parts - first into lines, strokes, angles of inclination, then more complex parts and in the end it comes to the conclusion that the picture is from this kind of combination basic elements- this face.

More about convolutional networks -

And in parts, this guide is intended for anyone who is interested in machine learning but doesn't know where to start. The content of the articles is intended for a wide audience and will be quite superficial. But does anyone really care? The more people who become interested in machine learning, the better.

Object recognition using deep learning

You may have already seen this famous xkcd comic. The joke is that any 3-year-old can recognize a photo of a bird, but getting a computer to do it took the best computer scientists over 50 years. In the last few years, we've finally found a good approach to object recognition using deep convolutional neural networks. This sounds like a bunch of made-up words from a William Gibson science fiction novel, but it will make sense once we take them one by one. So let's do it - write a program that recognizes birds!

Let's start simple

Before learning how to recognize bird images, let's learn how to recognize something much simpler - handwritten number"8".

Today, a graph is one of the most acceptable ways to describe models created in a machine learning system. These computational graphs are composed of neuron vertices connected by synapse edges that describe the connections between the vertices.

Unlike a scalar central or vector graphics processor, an IPU, a new type of processor designed for machine learning, allows you to build such graphs. A computer that is designed to manage graphs - perfect car for computational graph models created within machine learning.

One of the most simple ways The way to describe the process of machine intelligence is to visualize it. The Graphcore development team has created a collection of such images that are displayed on the IPU. It was based on software Poplar, which visualizes work artificial intelligence. Researchers from this company also found out why deep networks require so much memory, and what solutions exist to solve the problem.

Poplar includes a graphics compiler that was built from the ground up to translate standard machine learning operations into highly optimized IPU application code. It allows you to collect these graphs together using the same principle as POPNNs are collected. The library contains a set various types vertices for generalized primitives.

Graphs are the paradigm on which all software is based. In Poplar, graphs allow you to define a computation process, where vertices perform operations and edges describe the relationship between them. For example, if you want to add two numbers together, you can define a vertex with two inputs (the numbers you would like to add), some calculations (a function to add two numbers), and an output (the result).

Typically, operations with vertices are much more complex than in the example described above. They are often determined small programs, called codelets (code names). Graphical abstraction is attractive because it makes no assumptions about the structure of the computation and breaks the computation down into components that the IPU can use to operate.

Poplar uses this simple abstraction to build very large graphs that are represented as images. Software generation of the graph means we can tailor it to the specific calculations needed to ensure the most effective use IPU resources.

The compiler translates standard operations, used in machine systems training into highly optimized IPU application code. The graph compiler creates an intermediate image of the computational graph, which is deployed on one or more IPU devices. The compiler can display this computational graph, so an application written at the neural network framework level displays an image of the computational graph that is running on the IPU.

Graph of the full AlexNet training cycle in forward and backward directions

The Poplar graphics compiler turned the AlexNet description into a computational graph of 18.7 million vertices and 115.8 million edges. Clearly visible clustering is the result of a strong connection between processes in each layer of the network with more easy communication between levels.

Another example is a simple fully connected network trained on MNIST, a simple data set for computer vision, a kind of “Hello, world” in machine learning. Simple network to explore this dataset helps to understand the graphs that are driven by Poplar applications. By integrating graph libraries with frameworks such as TensorFlow, the company provides one of simple ways for using IPUs in machine learning applications.

After the graph has been constructed using the compiler, it needs to be executed. This is possible using the Graph Engine. The example of ResNet-50 demonstrates its operation.

ResNet-50 graph

The ResNet-50 architecture allows the creation of deep networks from repeating partitions. The processor only has to define these sections once and call them again. For example, the conv4 level cluster is executed six times, but only mapped once to the graph. The image also shows the variety of shapes of convolutional layers, as each one has a graph built according to a natural form of computation.

The engine creates and manages the execution of a machine learning model using a graph generated by the compiler. Once deployed, the Graph Engine monitors and responds to the IPUs, or devices, used by applications.

The ResNet-50 image shows the entire model. At this level it is difficult to identify connections between individual vertices, so it is worth looking at enlarged images. Below are some examples of sections within neural network layers.

Why do deep networks need so much memory?

Large amounts of occupied memory are one of the most big problems deep neural networks. Researchers are trying to combat the limited throughput DRAM devices that should be used modern systems to store a huge number of weights and activations in a deep neural network.

The architectures were designed using processor chips designed for sequential processing and DRAM optimization for high-density memory. The interface between these two devices is a bottleneck that introduces bandwidth limitations and adds significant overhead in power consumption.

Although we do not yet have a complete understanding of human brain and how it works, it is generally clear that there is no large separate memory store. The function of long-term and short-term memory in the human brain is believed to be embedded in the structure of neurons + synapses. Even simple organisms like worms, with a neural brain structure of just over 300 neurons, have some memory function.

Building memory in conventional processors is one way to circumvent the memory bottleneck problem, unlocking enormous bandwidth at much lower power consumption. However, on-chip memory is expensive and is not designed for the truly large amounts of memory that are attached to the CPUs and GPUs currently used to train and deploy deep neural networks.

So it's useful to look at how memory is used today in CPUs and GPU-based deep learning systems and ask yourself: why do they require such large memory storage devices when the human brain works just fine without them?

Neural networks need memory in order to store input data, weights, and activation functions as the input propagates through the network. In learning, the activation on the input must be maintained until it can be used to compute the errors in the output gradients.

For example, a 50-layer ResNet network has about 26 million weight parameters and computes 16 million forward activations. If you use a 32-bit float to store each weight and activation, it will require about 168MB of space. Using more low value accuracy to store these weights and activations, we could halve or even quadruple this storage requirement.

A major memory problem arises from the fact that GPUs rely on data represented as dense vectors. Therefore they can use single instruction thread (SIMD) to achieve high density calculations. The CPU uses similar vector units for high-performance computing.

GPUs have a synapse width of 1024 bits, so they use 32-bit floating point data, so they often split it into parallel mini-batch of 32 samples to create vectors of 1024-bit data. This approach to vector parallelism increases the number of activations by 32 times and the need for local storage with a capacity of more than 2 GB.

GPUs and other machines designed for matrix algebra are also subject to memory load from weights or neural network activations. GPUs cannot efficiently perform the small convolutions used in deep neural networks. Therefore, a transformation called "reduction" is used to convert these convolutions into matrix-matrix multiplications (GEMMs), which GPUs can handle efficiently.

Additional memory is also required to store input data, temporary values, and program instructions. Measuring memory usage while training ResNet-50 on HPC GPU showed that it requires more than 7.5 GB of local DRAM.

Some might think that lower computational precision might reduce the amount of memory required, but this is not the case. By switching data values to half precision for weights and activations, you will only fill half the SIMD vector width, wasting half the available compute resources. To compensate for this, when you switch from full precision to half precision on the GPU, you will then have to double the size of the mini-batch to force enough data parallelism to use all the available computation. So going to lower precision weights and activations on the GPU still requires more than 7.5GB dynamic memory with free access.

With such big amount data that needs to be stored, it is simply impossible to fit it all into the GPU. Each convolutional neural network layer needs to store the state of the external DRAM, load the next network layer, and then load the data into the system. As a result, the interface is already limited by bandwidth and memory latency. external memory suffers from additional burden constant reboot weights, and storing and retrieving activation functions. This significantly slows down training time and significantly increases energy consumption.

There are several ways to solve this problem. First, operations such as activation functions can be performed “in-place,” allowing input data to be rewritten directly to the output. Thus, existing memory can be reused. Secondly, the opportunity for reuse memory can be obtained by analyzing the data dependency between operations on the network and the allocation of the same memory to operations that are not using it at that moment.

The second approach is especially effective when the entire neural network can be analyzed at compile time to create a fixed allocated memory, since memory management overhead is reduced to almost zero. It turned out that the combination of these methods can reduce the memory use of a neural network by two to three times.
A third significant approach was recently discovered by the Baidu Deep Speech team. They applied various methods saving memory to get a 16x reduction in memory consumption of activation functions, which allowed them to train networks with 100 layers. Previously, with the same amount of memory, they could train networks with nine layers.

Combining memory and processing resources into a single device has significant potential to improve the performance and efficiency of convolutional neural networks, as well as other forms of machine learning. Trade-offs can be made between memory and compute resources to achieve a balance of features and performance in the system.

Neural networks and knowledge models in other machine learning methods can be thought of as mathematical graphs. There is a huge amount of parallelism concentrated in these graphs. A parallel processor designed to exploit parallelism in graphs does not rely on mini-batch and can significantly reduce the amount of local storage required.

Current research results have shown that all these methods can significantly improve the performance of neural networks. Modern GPUs and CPUs have very limited onboard memory, only a few megabytes in total. New processor architectures specifically designed for machine learning balance memory and on-chip compute, delivering significant performance and efficiency improvements over current technologies. central processors and graphics accelerators.