Deep learning neural networks. What is a neural network? Videos and lectures

More than 20 years have passed since the term “deep learning” was coined, but people started talking about it only recently. We briefly explain why this happened, what deep learning is, how it differs from machine learning, and why you need to know about it.

What it is?

Deep learning is a branch of machine learning that uses a model inspired by how the brain works - how neurons interact.

The term itself appeared in the 1980s, but until 2012 there was not enough capacity to implement this technology and almost no one paid attention to it. After a series of articles by famous scientists and publications in scientific journals, the technology quickly became popular and received the attention of major media - The New York Times was the first world media to write about it. One of the reasons for the material was scientific work specialists from the universities of Toronto Alex Krizhevsky, Ilya Satskever and Jeff Hinton. They described and analyzed the results of the ImageNet image recognition competition, where their neural network trained using deep learning won by a wide margin - the system identified 85% of the objects. Since then, only the deep neural network has won the competition

Wait, what is machine learning?

This is a subfield of artificial intelligence and a term that describes methods for constructing algorithms that learn from experience, without writing special program. That is, in this case, a person does not need to explain to the machine how to solve a problem; it finds the answer itself, from the data that is provided to it. For example, if we want the algorithm to identify faces, we must show it ten thousand different faces, note where exactly the face is, and then the program will learn to identify it on its own.

The machine can learn both with the help of a teacher, when he marks the correct answers for the machine, and without him. But the results are better when learning with a teacher. Every time data processing occurs, the system becomes more accurate.

How does deep learning work?

It imitates human abstract thinking and is able to generalize. For example, a machine-trained neural network does not recognize handwritten letters well - and so that it does not get confused in various options writing, they all must be loaded into it.

Deep learning is used when working with multilayer artificial neural networks and can cope with this task.

“There are three terms that are often used almost interchangeably lately: artificial intelligence, machine learning and deep learning. However, these are actually “nested” terms: artificial intelligence is anything that can help a computer perform human tasks; machine learning is a branch of AI in which programs do not just solve problems, but learn based on the experience they have, and deep learning is a branch of machine learning that studies deep neural networks.

Simply put: 1.if you wrote a program that plays chess, that's artificial intelligence; 2.if it learns on the basis of grandmaster games or by playing against itself, this is machine learning; 3.and if it learns from it not just anything, but a deep neural network, that’s deep learning.”.

How does deep learning work?

Let's take a simple example - we will show the neural network photographs depicting a boy and a girl. In the first layer, neurons respond to simple visual images, such as changes in brightness. On the second - more complex ones: angles, circles. By the third layer, neurons are able to respond to inscriptions and human faces. For each subsequent layer, the identified images will be more complex. The neural network itself determines which visual elements she is interested in solving this problem, and ranks them in order of importance in order to better understand in the future what is shown in the photograph.

And what have you already developed with it?

Most deep learning projects are used in photo or audio recognition and disease diagnosis. For example, it is already used in Google translations from images: technology Deep Learning allows you to determine whether there are letters in the picture, and then translates them. Another project that works with photos is a facial recognition system called DeepFace. It can recognize human faces with 97.25% accuracy - approximately the same accuracy as a human.

In 2016, Google released WaveNet, a system that can simulate human speech. To do this, the company uploaded millions of minutes of recorded data into the system. voice requests, which were used in the OK Google project, and after studying, the neural network itself was able to compose sentences with the correct stress, emphasis and without illogical pauses.

At the same time, deep learning can semantically segment an image or video - that is, not just indicate that there is an object in the picture, but also ideally highlight its contours. This technology is used in self-driving cars that detect road obstructions, markings, and read traffic signs to avoid accidents. The neural network is also used in medicine - to determine diabetic retinopathy from photographs of patients' eyes, for example. The US Department of Health has already authorized the use of this technology in government clinics.

Why didn’t they start implementing deep learning earlier?

Previously, this was expensive, difficult and time-consuming - you needed powerful graphics processors, video cards and memory. The boom in deep learning is precisely related to the widespread GPUs, which speed up and reduce the cost of computing, virtually unlimited data storage capabilities and the development of “big data” technology.

This is a breakthrough technology, will it change everything?

It’s difficult to say for sure about this; opinions vary. On the one hand, Google, Facebook and other large companies have already invested billions of dollars and are optimistic. In their opinion, neural networks with deep learning can change technological device peace. One of the leading experts in machine learning, Andrew Ng, says: “If a person can perform a task mentally in a second, most likely that task will be automated in the near future.” Ng calls machine learning "the new electricity" - it's a technological revolution, and companies that ignore it will quickly find themselves hopelessly behind the competition.

On the other hand, there are skeptics: they believe that deep learning is a buzzword or a rebranding of neural networks. For example, senior lecturer of the faculty computer science HSE Sergei Bartunov believes that this algorithm is just one of the options (and not the best) for training a neural network, which was quickly picked up by mass publications and which everyone now knows about.

Sergey Nikolenko, co-author of the book “Deep Learning”: “The history of artificial intelligence has already known two “winters,” when a wave of hype and high expectations was followed by disappointment. Both times, by the way, it was connected with neural networks. First, in the late 1950s, it was decided that Rosenblatt's perceptron would immediately lead to machine translation and self-aware computers; but, of course, it didn’t work out due to limited hardware, data and lack of suitable models.

And in the late 1980s, the same mistake was made when they figured out how to train any neural network architectures. It seemed that here it was, a golden key that could open any door. This was no longer such a naive conclusion: indeed, if you take a neural network from the late 1980s, mechanically make it larger (increase the number of neurons) and train it on modern data sets and modern hardware, it will work very well! But there was not enough data or hardware at that time, and the revolution deep learning had to be postponed until the end of the 2000s.

We are now living in the third wave of artificial intelligence hype. Whether it will end in a third “winter” or the creation of strong AI, only time will tell.”

There is a lot of talk and writing about artificial neural networks today, both in the context of big data and machine learning and outside it. In this article, we will recall the meaning of this concept, once again outline the scope of its application, and also talk about an important approach that is associated with neural networks - deep learning, we will describe its concept, as well as the advantages and disadvantages in specific use cases.

What is a neural network?

As you know, the concept of a neural network (NN) comes from biology and is a somewhat simplified model of the structure of the human brain. But let’s not delve into the wilds of natural science - the easiest way is to imagine a neuron (including an artificial one) as a kind of black box with many input holes and one output.

Mathematically, artificial neuron converts the vector of input signals (impacts) X into the vector of output signals Y using a function called the activation function. Within the connection (artificial neural network - ANN), three types of neurons function: input (receiving information from the outside world - the values of the variables we are interested in), output (returning the desired variables - for example, forecasts, or control signals), as well as intermediate neurons , performing certain internal (“hidden”) functions. A classical ANN thus consists of three or more layers of neurons, and in the second and subsequent layers (“hidden” and output), each element is connected to all elements of the previous layer.

It is important to remember the concept feedback, which determines the type of ANN structure: direct signal transmission (signals go sequentially from the input layer through the hidden layer and enter the output layer) and recurrent structure, when the network contains connections going back, from more distant to nearer neurons). All these concepts make up minimum required information to move to the next level of understanding ANN - training a neural network, classifying its methods and understanding the principles of operation of each of them.

Neural network training

We should not forget why such categories are used in general - otherwise there is a risk of getting bogged down in abstract mathematics. In fact, artificial neural networks are understood as a class of methods for solving certain practical problems, among which the main ones are the problems of pattern recognition, decision making, approximation and data compression, as well as the most interesting for us problems of cluster analysis and forecasting.

Without going to the other extreme and without going into details of the operation of ANN methods in each specific case, let us remind ourselves that under any circumstances it is the ability of a neural network to learn (with a teacher or “on its own”) that is key point using it to solve practical problems.

In general, training an ANN is as follows:

input neurons receive variables (“stimuli”) from the external environment;
in accordance with the information received, the free parameters of the neural network change (intermediate layers of neurons work);
As a result of changes in the structure of the neural network, the network “reacts” to information in a different way.

This is the general algorithm for training a neural network (let’s remember Pavlov’s dog - yes, that’s exactly the internal mechanism for the formation of a conditioned reflex - and let’s immediately forget: after all, our context involves operating with technical concepts and examples).

It is clear that a universal learning algorithm does not exist and, most likely, cannot exist; Conceptually, approaches to learning are divided into supervised learning and unsupervised learning. The first algorithm assumes that for each input (“learning”) vector there is a required value of the output (“target”) vector - thus, these two values form a training pair, and the entire set of such pairs is the training set. In the case of unsupervised learning, the training set consists only of input vectors - and this situation is more plausible from the point of view of real life.

Deep learning

The concept of deep learning refers to a different classification and denotes an approach to training so-called deep structures, which include multi-level neural networks. A simple example from the field of image recognition: it is necessary to teach a machine to identify increasingly abstract features in terms of other abstract features, that is, to determine the relationship between the expression of the entire face, eyes and mouth, and, ultimately, clusters of colored pixels mathematically. Thus, in a deep neural network, each level of features has its own layer; It is clear that training such a “colossus” requires the appropriate experience of researchers and the level of hardware. Conditions developed in favor of deep neural learning only by 2006 - and eight years later we can talk about the revolution that this approach has produced in machine learning.

So, first of all, in the context of our article, it is worth noting the following: deep learning in most cases is not supervised by a person. That is, this approach involves training a neural network without a teacher. This is the main advantage of the “deep” approach: supervised machine learning, especially in the case of deep structures, requires enormous time – and labor – costs. Deep learning, on the other hand, is an approach that models (or at least attempts to approximate) human abstract thinking, rather than using it.

The idea, as usual, is wonderful, but quite natural problems arise in the way of the approach - first of all, rooted in its claims to universality. In fact, while deep learning approaches have achieved significant success in the field of image recognition, the same natural language processing still raises many more questions than answers. It is obvious that in the next n years it is unlikely that it will be possible to create an “artificial Leonardo Da Vinci” or even - at least! - “artificial homo sapiens”.

However, artificial intelligence researchers are already faced with the question of ethics: the fears expressed in every self-respecting science fiction film, from “Terminator” to “Transformers”, no longer seem funny (modern sophisticated neural networks can already be considered a plausible model the work of the insect's brain!), but are clearly unnecessary for now.

The ideal technological future appears to us as an era when a person will be able to delegate most of his powers to a machine - or at least be able to allow it to facilitate a significant part of his intellectual work. The concept of deep learning is one step towards this dream. The road ahead is long, but it is already clear that neural networks and the ever-evolving approaches associated with them are capable of realizing the aspirations of science fiction writers over time.

Deep learning is changing the paradigm of working with texts, but it is causing skepticism among computational linguists and data scientists. Neural networks are a powerful but trivial machine learning tool.

03.05.2017 Dmitry Ilvovsky, Ekaterina Chernyak

Neural networks make it possible to find hidden connections and patterns in texts, but these connections cannot be presented explicitly. Neural networks, although powerful, are quite a trivial tool, causing skepticism among companies developing industrial solutions in the field of data analysis, and among leading computational linguists.

The general fascination with neural network technologies and deep learning has not bypassed computer linguistics - automatic processing of texts in natural language. At recent conferences of the Association for Computational Linguistics ACL, the main scientific forum in this field, the vast majority of reports were devoted to the use of neural networks both for solving already known problems and for exploring new ones that have not been solved using standard means machine learning. The increased attention of linguists to neural networks is due to several reasons. The use of neural networks, firstly, significantly improves the quality of solving some standard problems of text and sequence classification, secondly, it reduces the labor intensity when working directly with texts, and thirdly, it allows solving new problems (for example, creating chat bots). At the same time, neural networks cannot be considered a completely independent mechanism for solving linguistic problems.

First work on deep learning(deep learning) date back to the middle of the 20th century. In the early 1940s, Warren McCulloch and Walter Pitts proposed a formal model of the human brain - an artificial neural network, and a little later Frank Rosenblatt generalized their work and created a neural network model on a computer. The first work on training neural networks using the backpropagation algorithm dates back to the 1960s (the algorithm calculates the prediction error and minimizes it using stochastic optimization methods). However, it turned out that, despite the beauty and elegance of the idea of simulating the brain, training “traditional” neural networks takes a lot of time, and the classification results on small data sets are comparable to the results obtained more simple methods, for example, support vector machines (SVM). As a result, neural networks were forgotten for 40 years, but today they have again become in demand when working with large volumes of unstructured data, images and texts.

From a formal point of view, a neural network is a directed graph of a given architecture, the vertices or nodes of which are called neurons. The first level of the graph contains input nodes, the last level contains output nodes, the number of which depends on the task. For example, to classify into two classes, one or two neurons can be placed at the output layer of the network; for classification into k classes, k neurons can be placed. All other levels in the neural network graph are usually called hidden layers. All neurons located at the same level are connected by edges to all neurons next level, each edge has a weight. Each neuron is assigned an activation function that models the work of biological neurons: they are “silent” when input signal is weak, and when its value exceeds a certain threshold, they are triggered and transmit the input value further along the network. The task of training a neural network from examples (that is, from object-correct answer pairs) is to find the edge weights that best predict correct answers. It is clear that it is the architecture - the topology of the structure of the neural network graph - that is its the most important parameter. Although there is no formal definition for “deep networks” yet, it is generally accepted that all neural networks consisting of large number layers or having “non-standard” layers (for example, containing only selected links or using recursion with other layers).

An example of the most successful use of neural networks so far is image analysis, but neural network technologies have also radically changed the work with text data. If previously each element of the text (letter, word or sentence) had to be described using many features of a different nature (morphological, syntactic, semantic, etc.), now in many tasks the need for complex descriptions disappears. Theorists and practitioners of neural network technologies often talk about “representation learning” - in raw text, broken down only into words and sentences, a neural network is able to find dependencies and patterns and independently compose a feature space. Unfortunately, in such a space a person will not understand anything - during training, the neural network assigns each element of the text one dense vector consisting of certain numbers representing the discovered “deep” relationships. The emphasis when working with text is shifting from constructing a subset of features and searching external bases knowledge to select data sources and markup texts for subsequent training of a neural network, which requires significantly more data compared to standard methods. It is precisely because of the need to use large amounts of data and because of poor interpretability and unpredictability that neural networks are not in demand in real applications industrial scale, unlike other well-established learning algorithms such as random forest and support vector machines. However, neural networks are used in a number of tasks automatic processing texts (Fig. 1).

One of the most popular applications of neural networks is the construction of vectors of words related to the field of distributional semantics: it is believed that the meaning of a word can be understood by the meaning of its context, by the surrounding words. Indeed, if we are unfamiliar with some word in the text on known language, then in most cases you can guess its meaning. Mathematical model the meanings of a word are vectors of words: rows in a large “word-context” matrix, built from a fairly large corpus of texts. Neighboring words, words included in the same syntactic or semantic construction with the given word, etc. can act as “contexts” for a particular word. Frequencies can be recorded in the cells of such a matrix (how many times the word occurs in a given context), but more often use the coefficient of positive pairwise mutual information (Positive Pointwise Mutual Information, PPMI), which shows how non-random the appearance of a word was in a particular context. Such matrices can be quite successfully used for clustering words or for searching for words that are close in meaning to the search word.

In 2013, Tomas Mikolov published a paper in which he proposed using neural networks to train word vectors, but for a smaller dimension: using tuples (word, contexts), a neural network of the simplest architecture was trained, and at the output, each word was assigned a vector of 300 elements. It turned out that such vectors better convey the semantic proximity of words. For example, they can be used to determine arithmetic operations adding and subtracting meanings and obtaining the following equations: “Paris – France + Russia = Moscow”; “king – man + woman = queen.” Or find an extra word in the series “apple, pear, cherry, kitten.” The work presented two architectures, skip-gram and CBOW (Continuous Bag of Words), under the general name word2vec. As later shown in , word2vec is nothing more than a factorization of a word-context matrix with PPMI weights. It is now customary to classify word2vec as distributive semantics rather than deep learning, but the initial impetus for the creation of this model was the use of a neural network. In addition, it turned out that word2vec vectors serve as a convenient representation of the meaning of a word, which can be fed as input to deep neural networks used for text classification.

The task of text classification is one of the most pressing for marketers, especially when we're talking about about the analysis of consumer opinions or attitudes towards a product or service, so researchers are constantly working to improve the quality of its solution. However, opinion analysis is a task of classifying sentences rather than texts - in a positive review, the user can write one or two negative sentences, and it is also important to be able to identify and analyze them. A well-known difficulty in classifying sentences lies in the variable length of the input - since sentences in texts can be of arbitrary length, it is not clear how to submit them to the input of a neural network. One approach is borrowed from the field of image analysis and involves the use of convolutional neural networks (CNN) (Fig. 2).

The input of the convolutional neural network is a sentence in which each word is already represented by a vector (vector of vectors). Typically, pre-trained word2vec models are used to represent words as vectors. A convolutional neural network consists of two layers: a “deep” convolution layer and a regular hidden layer. The convolution layer, in turn, consists of filters and a “subsampling” layer. A filter is a neuron whose input is formed using windows that move through the text and select a certain number of words sequentially (for example, a window of length “three” will select the first three words, words from the second to the fourth, from the third to the fifth, etc.) . At the output of the filter, one vector is formed that aggregates all vectors of words included in it. Then, a single vector corresponding to the entire proposal is generated at the subsampling layer, which is calculated as the component-wise maximum of all the output filter vectors. Convolutional neural networks are easy to train and implement. To train them, a standard backpropagation algorithm is used, and due to the fact that the weights of the filters are uniformly distributed (the weight of the i-th word from the window is the same for any filter), the number of parameters for a convolutional neural network is small. From the point of view of computer linguistics, convolutional neural networks are powerful tool for classification, which, however, does not have any linguistic intuition behind it, which significantly complicates the analysis of algorithm errors.

Sequence classification is a task in which each word needs to be assigned one label: morphological analysis (each word is assigned a part of speech), named entity extraction (determining whether each word is part of a person’s name, geographical name, etc.) etc. When classifying sequences, methods are used that take into account the context of the word: if the previous word is part of a person’s name, then the current one may also be part of the name, but is unlikely to be part of the name of the organization. Recurrent neural networks, which expand the idea of language models proposed at the end of the last century, help to implement this requirement in practice. The classical language model predicts the probability that the word i will occur after the word i-1. Language models can also be used to predict the next word: what word is most likely to appear after this one?

To train language models, large corpora are needed - the larger the training corpus, the more pairs of words the model “knows”. Using neural networks to develop language models reduces the amount of data stored. Let's imagine a simple network architecture in which the words i-2 and i-1 are input, and at the output the neural network predicts the word i. Depending on the number of hidden layers and the number of neurons on them, the trained network can be stored as a number of dense matrices of relatively small dimension. In other words, instead of a training corpus and all pairs of words, it can store only a few matrices and a list of unique words. However, such a neural language model does not allow long connections between words to be taken into account. This problem is solved by recurrent neural networks (Fig. 3), in which the internal state of the hidden layer is not only updated after a new word arrives at the input, but is also passed on to the next step. Thus, the hidden layer of the recurrent network accepts two types of inputs: the state of the hidden layer at the previous step and the new word. If a recurrent neural network processes a sentence, then hidden states allow long connections in sentences to be remembered and transmitted. It has been repeatedly verified experimentally that recurrent neural networks remember the gender of the subject in a sentence and select the correct pronouns (she - her, he - his) when generating a sentence, but show explicitly how exactly this kind of information is stored in the neural network or how it is used , still failed.

Recurrent neural networks are also used for text classification. In this case, the outputs from the intermediate steps are not used, and the last output of the neural network returns the predicted class. Today, bidirectional (transmitting the hidden state not only “to the right” but also “to the left”) recurrent networks with several dozen neurons on the hidden layer have become standard tool for solving problems of text and sequence classification, as well as text generation, and essentially replaced other algorithms.

The development of recurrent neural networks has become architectures of the Seq2seq type, consisting of two connected recurrent networks, one of which is responsible for representing and analyzing the input (for example, a question or sentence in one language), and the second for generating the output (an answer or sentence in another language) . Seq2seq networks underlie modern systems“question-answer”, chat bots and machine translation systems.

In addition to convolutional neural networks, so-called autoencoders are used for text analysis, which are used, for example, to create effects on images in Photoshop or Instagram and have found application in linguistics in the problem of dimensionality reduction (searching for the projection of a vector representing text onto a space of a lower dimension). Projection onto two-dimensional space makes it possible to represent text as a point on a plane and allows you to visually depict a collection of texts as a set of points, that is, it serves as a means preliminary analysis before clustering or classifying texts. Unlike the classification task, the dimensionality reduction task does not have clear quality criteria, but the images obtained when using autoencoders look quite “convincing.” From a mathematical point of view, an autoencoder is an unsupervised neural network that learns the linear function f(x) = x and consists of two parts: an encoder and a decoder. An encoder is a network with multiple hidden layers with a decreasing number of neurons. The decoder is a similar network with an increasing number of neurons. They are connected by a hidden layer, which has as many neurons as there should be dimensions in a new space of lower dimensionality, and it is this layer that is responsible for reducing the dimensionality. Like convolutional neural networks, an autoencoder does not have any linguistic interpretation, so it can be considered an engineering tool rather than an analytical one.

Despite the impressive results, a neural network cannot be considered an independent tool for text analysis (searching for patterns in language), much less for understanding text. Yes, neural networks make it possible to find hidden connections between words and discover patterns in texts, but until these connections are presented in an interpretable form, neural networks will remain fairly trivial machine learning tools. Moreover, in industrial analytical solutions deep learning is not yet in demand, since it requires unreasonable costs for data preparation and unpredictability of results. Even in the research community, there is criticism of attempts to make neural networks a universal tool. In 2015, Chris Manning, head of the computational linguistics group at Stanford and president of ACL, clearly outlined the applicability of neural networks. In it he included the tasks of text classification, sequence classification and dimensionality reduction. However, thanks to the marketing and popularization of deep learning, attention to computational linguistics itself and its new applications has increased.

Literature

Tomas Mikolov et. al. Efficient Estimation of Word Representations in Vector Space, arxiv.org. URL: http://arxiv.org/pdf/1301.3781.pdf
Levy Omer, Yoav Goldberg, Ido Dagan. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3. - 2015. - P. 211–225. URL: https://www.transacl.org/ojs/index.php/tacl/article/view/570/124 (access date: 05/18/2017).
Pavel Velikhov. Machine learning for natural language understanding // Open Systems.DBMS. - 2016. - No. 1. - P.18–21. URL: (access date: 05/18/2017).
Christopher Manning. Computational linguistics and deep learning. Computational Linguistics. - 2016. URL: http://www.mitpressjournals.org/doi/full/10.1162/COLI_a_00239#.WQH8MBhh2qA (access date: 05/18/2017).

Dmitry Ilvovsky ([email protected]) - employee of the International Laboratory intelligent systems and structural analysis, Ekaterina Chernyak ([email protected]) - teacher at the Center for Continuing Education, Faculty of Computer Science, National Research University Higher School of Economics (Moscow). The work was carried out within the framework of the Basic Research Program of the National Research University Higher School of Economics.

From the article you will learn what deep learning is. The article also contains many resources that you can use to master this area.

IN modern world From healthcare to manufacturing, deep learning is being used everywhere. Companies are turning to this technology to solve complex problems such as speech and object recognition, machine translation, and so on.

One of the most impressive achievements this year was AlphaGo beating the world's best Go player. In addition to Go, machines have beaten people in other games: checkers, chess, reversi, and Jeopardy.

Possibly victory in board game seems inapplicable to solving real problems, but this is not at all true. Go was designed to be unbeatable by artificial intelligence. To do this, he would need to learn one important thing for this game - human intuition. Now, with the help of this development, it is possible to solve many problems that were inaccessible to a computer before.

Obviously, deep learning is still far from perfect, but it is already close to being commercially useful. For example, these self-driving cars. Well-known companies like Google, Tesla and Uber are already trying to introduce autonomous cars on city streets.

Ford Predicts Significant Increase in Self-Driving Share Vehicle by 2021. The US government also managed to develop a set of safety rules for them.

What is deep learning?

To answer this question, you need to understand how it interacts with machine learning, neural networks and artificial intelligence. To do this, we use the visualization method using concentric circles:

The outer circle is artificial intelligence in general (e.g. computers). A little further is machine learning, and right in the center are deep learning and artificial neural networks.

Roughly speaking, deep learning is simply a more convenient name for artificial neural networks. “Deep” in this phrase refers to the degree of complexity (depth) of the neural network, which can often be very superficial.

The creators of the first neural network were inspired by the structure of the cerebral cortex. The network's base layer, the perceptron, is essentially the mathematical analogue of a biological neuron. And, as in the brain, perceptrons intersecting with each other can appear in a neural network.

The first layer of the neural network is called the input layer. Each node in this layer receives some information as input and transmits it to subsequent nodes in other layers. Most often, there are no connections between the nodes of one layer, and the last node of the chain outputs the result of the neural network.

The nodes in the middle are called hidden nodes because they do not have connections to the outside world like the output and input nodes. They are called only when previous layers are activated.

Deep learning is essentially a neural network training technique that uses many layers to solve complex problems (like speech recognition) using patterns. In the eighties, most neural networks were single-layer due to high cost and limited data capabilities.

If we consider machine learning as a branch or variant of artificial intelligence, then deep learning is a specialized type of such branch.

Machine learning uses computer intelligence that does not immediately provide the answer. Instead, the code will run on test data and, based on the correctness of its results, adjust its progress. For the success of this process, a variety of techniques, special software and computer science are usually used to describe static methods and linear algebra.

Deep learning methods

Deep learning methods are divided into two main types:

Tutored training
Unsupervised learning

The first method uses specially selected data to achieve the desired result. It requires quite a lot of human intervention, because the data has to be selected manually. However, it is useful for classification and regression.

Imagine that you are the owner of a company and you want to determine the impact of bonuses on the duration of contracts with your subordinates. With pre-collected data, a supervised learning method would be indispensable and very effective.

The second method does not imply pre-prepared answers and work algorithms. It aims to identify hidden patterns in data. It is typically used for clustering and association tasks, such as grouping customers by behavior. “They also choose with this” on Amazon is a variant of the association task.

While supervised learning is often quite convenient, it is more difficult option still better. Deep learning has proven itself to be a neural network that does not require human supervision.

The Importance of Deep Learning

Computers have long used technology to recognize certain features in an image. However, the results were far from successful. Computer vision has had an incredible impact on deep learning. It is these two techniques that currently solve all recognition problems.

In particular, Facebook has succeeded in recognizing faces in photographs using deep learning. This is not a simple improvement in technology, but a turning point that changes all earlier beliefs: “A person can determine with 97.53% probability whether the same person is shown in two different photographs. The program developed by the Facebook team can do this with a 97.25% probability, regardless of the lighting or whether the person is looking directly at the camera or turned sideways towards it.”

Speech recognition has also undergone significant changes. The team at Baidu, one of China's leading search engines, has developed a speech recognition system that has managed to outpace humans in the speed and accuracy of writing text on mobile devices. In English and Mandarin.

What’s especially interesting is that writing a common neural network for two completely different languages did not require much work: “Historically, people saw Chinese and English as two completely different languages, so each of them required a different approach,” says Andrew Ng, head of the Baidu research center. “Learning algorithms are now so generalized that you can Just learn."

Google uses deep learning to manage energy in the company's data centers. They were able to reduce cooling resource costs by 40%. That's about a 15% improvement in energy efficiency and millions of dollars in savings.

Deep learning microservices

Here's a quick overview of deep learning-related services.

Illustration Tagger. Enhanced by Illustration2Vec, this service allows you to mark images with a rating of “protected”, “questionable”, “dangerous”, “copyright” or “general” in order to understand the content of the image in advance.

Google's Theano Add-on
Editable in Python and Numpy
Often used to solve a specific range of problems

Not general purpose. Focus on machine vision
Edited in C++
There is an interface in Python

Online courses on deep learning

Google and Udacity have teamed up to create a free course on deep learning, part of the Udacity Machine Learning Course. This program is led by experienced developers who want to develop the field of machine learning and, in particular, deep learning.

Another popular option is the machine learning course from Andrew Ng, supported by Coursera and Stanford.

Machine Learning - Stanford by Andrew Ng on Coursera (2010-2014)
Machine Learning - Caltech by Yaser Abu-Mostafa (2012-2014)
Machine Learning - Carnegie Mellon by Tom Mitchell (Spring 2011)
Neural networks for machine learning – Geoffrey Hinton on Coursera (2012)
Neural networks class– Hugo Larochelle from Université de Sherbrooke (2013)

Books on deep learning

While the resources in the previous section draw on a fairly extensive knowledge base, Grokking Deep Learning, on the contrary, is aimed at beginners. As the authors say: “If you have completed 11th grade and have a rough understanding of how to write Python, we will teach you deep learning.”

A popular alternative to this book is a book with the self-explanatory title Deep Learning Book. It's especially good because it covers all the math you'll need to get into this area.

"Deep Learning" by Yoshua Bengio, Ian Goodfellow and Aaron Courville (2015)
“Neural networks and deep learning” by Michael Nielsen (2014)
"Deep Learning" from Microsoft Research (2013)
“Deep Learning Tutorials” from LISA Laboratory, University of Montreal (2015)
“neuraltalk” by Andrej Karpathy
"Introduction to Genetic Algorithms"
"Modern approach to artificial intelligence"
"Overview of deep learning and neural networks"

Videos and lectures

Deep Learning Simplified is a wonderful YouTube channel. Here's their first video:

"(Manning Publications).

This article is intended for people who already have significant experience with deep learning (for example, those who have already read Chapters 1-8 of this book). Assumes availability large quantity knowledge.

Deep Learning: Geometric View

The most amazing thing about deep learning is how simple it is. Ten years ago, no one could have imagined the amazing results we would achieve in machine perception problems using simple parametric models trained with gradient descent. Now it turns out that all we need is big enough parametric models trained on big enough number of samples. As Feynman once said about the Universe: “ It's not complicated, there's just a lot of it».

In deep learning, everything is a vector, i.e. dot V geometric space. The input data of the model (this can be text, images, etc.) and its targets are first “vectorized”, that is, translated into some initial vector space as an input and a target vector space as an output. Each layer in a deep learning model performs one simple geometric transformation on the data that passes through it. Together, the chain of model layers creates one very complex geometric transformation, broken down into a number of simple ones. This complex transformation attempts to transform the input data space into the target space, for each point. The transformation parameters are determined by the layer weights, which are constantly updated based on how well the model is performing at the moment. Key Feature geometric transformation - what it should be differentiable, that is, we should be able to find out its parameters through gradient descent. Intuitively, this means that geometric morphing must be smooth and continuous—an important constraint.

The entire process of applying this complex geometric transformation to the input data can be visualized in 3D by depicting a person trying to unwrap a paper ball: the crumpled paper ball is the variety of input data that the model begins to work with. Each movement of a person with a paper ball is like a simple geometric transformation performed by a single layer. The complete sequence of unfolding gestures is a complex transformation of the entire model. Deep learning models are mathematical machines for unraveling the intricate variety of multidimensional data.

That's the magic of deep learning: turning value into vectors, into geometric spaces, and then gradually learning complex geometric transformations that transform one space into another. All that is needed is a space of sufficiently large dimension to convey the full range of relationships found in the original data.

Limitations of Deep Learning

The range of problems that can be solved using this simple strategy is almost endless. And yet, many of them are still beyond the reach of current deep learning techniques - even despite the availability of huge amounts of manually annotated data. Let's say, for example, that you can collect a data set of hundreds of thousands - even millions - of English language descriptions of functions software, written by product managers, as well as a corresponding reference year developed by engineering teams to meet these requirements. Even with this data, you can't train a deep learning model to simply read a product description and generate an appropriate one. code base. This is just one of many examples. In general, anything that requires reasoning - like programming or applying the scientific method, long-term planning, algorithmic-style data manipulation - is beyond the capabilities of deep learning models, no matter how much data you throw at them. Even training a neural network to perform a sorting algorithm is an incredibly difficult task.

The reason is that the deep learning model is "only" a chain of simple, continuous geometric transformations, which transform one vector space into another. All it can do is transform one set of data X into another set Y, provided that there is a possible continuous transformation from X to Y that can be learned, and the availability dense set of samples X:Y transformations as training data. So while a deep learning model can be considered a type of program, most programs cannot be expressed as deep learning models- for most tasks there is practically no deep neural network suitable size, which solves the problem, or if it exists, it can be unteachable, that is, the corresponding geometric transformation may be too complex, or there is no suitable data to train it.

Scaling up existing deep learning techniques—adding more layers and using more training data—can only superficially mitigate some of these problems. It will not solve the more fundamental problem that deep learning models are very limited in what they can represent, and that most programs cannot be expressed as a continuous geometric morphing of data manifolds.

The Risk of Anthropomorphizing Machine Learning Models

One of the very real risks of modern AI is misinterpreting how deep learning models work and exaggerating their capabilities. A fundamental feature of the human mind is the “model of the human psyche,” our tendency to project goals, beliefs and knowledge onto things around us. A drawing of a smiling face on a stone suddenly makes us “happy” - mentally. When applied to deep learning, this means, for example, that if we can more or less successfully train a model to generate text descriptions pictures, then we tend to think that the model “understands” the content of the images, as well as the generated descriptions. We are then greatly surprised when, due to a small deviation from the set of images presented in the training data, the model begins to generate absolutely absurd descriptions.

In particular, this is most evident in “adversarial examples,” which are samples of deep learning network input data that are specifically selected to be misclassified. You already know that you can do gradient ascent on the input data space to generate samples that maximize the activation of, for example, a particular convolutional neural network filter - this is the basis of the visualization technique we covered in Chapter 5 (note: books "Deep Learning with Python") , just like the Deep Dream algorithm from Chapter 8. In a similar way, through gradient ascent, you can slightly change the image to maximize class prediction for a given class. If we take a photo of a panda and add a "gibbon" gradient, we can force the neural network to classify that panda as a gibbon. This demonstrates both the fragility of these models and the profound difference between the input-to-output transformation it guides and our own human perceptions.

In general, deep learning models have no understanding of the input data, at least not in the human sense. Our own understanding of images, sounds, language, is based on our sensorimotor experience as people - as material earthly beings. Machine learning models do not have access to such experience and therefore cannot “understand” our input data in any human-like way. By annotating a large number of examples for our models to train, we force them to learn a geometric transformation that reduces the data to human concepts for that specific set of examples, but this transformation is only a simplified sketch of our mind's original model, as developed from our experience as bodily agents are like a faint reflection in a mirror.

As a machine learning practitioner, always keep this in mind, and never fall into the trap of believing that neural networks understand the task they are performing - they don't, at least not in a way that makes sense to us. They have been trained on a different, much more specific task than the one we want to train them for: simply transforming input learning patterns into target learning patterns, point to point. Show them anything that's different from the training data and they'll break in the most absurd ways.

Local generalization versus extreme generalization

There seem to be fundamental differences between the direct geometric morphing from input to output that deep learning models do and the way humans think and learn. It is not just that people learn themselves from their bodily experiences, and not through processing a set of training samples. In addition to differences in learning processes, there are fundamental differences in the nature of the underlying beliefs.

Humans are capable of much more than translating an immediate stimulus into an immediate response, like a neural network or perhaps an insect. People hold complex, abstract patterns in their minds current situation, themselves, other people, and can use these models to predict various possible futures, and carry out long-term planning. They are capable of combining known concepts to imagine something they have never known before - like drawing a horse in jeans, for example, or picturing what they would do if they won the lottery. The ability to think hypothetically, to expand our model of mental space far beyond what we have directly experienced, that is, the ability to do abstractions And reasoning, perhaps the defining characteristic of human cognition. I call this “ultimate generalization”: the ability to adapt to new, never-before-experienced situations using little or no data.

This is in stark contrast to what deep learning networks do, which I would call "local generalization": transforming input data into output data quickly ceases to make sense if the new input data is even slightly different from what it encountered during training . Consider, for example, the problem of learning the appropriate launch parameters for a rocket that is supposed to land on the Moon. If you were to use a neural network for this task, supervised or reinforcement trained, you would need to give it thousands or millions of flight trajectories, that is, you would need to produce dense set of examples in the input value space to learn how to reliably transform from the input value space to the output value space. In contrast, humans can use the power of abstraction to create physical models—rocket science—and derive an exact solution that will get a rocket to the moon in just a few tries. In the same way, if you developed a neural network to control the human body and want it to learn how to walk safely through a city without being hit by a car, the network would have to die many thousands of times in different situations before it would conclude that cars are dangerous and fail. appropriate behavior to avoid them. If it were moved to a new city, the network would have to relearn most of what it knew. On the other hand, people are able to learn safe behavior without ever dying - again, thanks to the power of abstract simulation of hypothetical situations.

So, despite our progress in machine perception, we are still very far from human-level AI: our models can only perform local generalization, adapting to new situations that must be very close to past data, while the human mind is capable of extreme generalization, quickly adapting to completely new situations or planning far into the future.

conclusions

Here's what you need to remember: the only real success of deep learning so far is the ability to translate X space to Y space using a continuous geometric transformation, given a large amount of human annotated data. Good execution This challenge represents a revolutionary advance for an entire industry, but human-level AI is still a long way off.

To remove some of these limitations and begin to compete with the human brain, we need to move away from direct input-to-output conversion and move to reasoning And abstractions. Computer programs may be a suitable basis for abstractly modeling various situations and concepts. We've said before (note: in Deep Learning with Python) that machine learning models can be defined as "programs that learn"; at the moment we can only train a narrow and specific subset of all possible programs. But what if we could train each program, modularly and iteratively? Let's see how we can get there.

The Future of Deep Learning

Given what we know about deep learning networks, their limitations, and the current state of research, can we predict what will happen in the medium term? Here are some of my personal thoughts on the matter. Keep in mind that I don't have a crystal ball for predictions, so much of what I expect may not come to fruition. This is complete speculation. I share these predictions not because I expect them to be fully realized in the future, but because they are interesting and applicable to the present.

At a high level, here are the main areas that I consider promising:

Models will approach general purpose computer programs built on top of much richer primitives than our current differentiable layers - so we will get reasoning And abstractions, the absence of which is a fundamental weakness of current models.
New forms of learning will emerge that will make this possible - and allow models to move away from simply differentiable transformations.
Models will require less developer input - it shouldn't be your job to constantly twist knobs.
There will be greater, systematic reuse of learned features and architectures; meta-learning systems based on reusable and modular routines.

Additionally, note that these considerations do not apply specifically to supervised learning, which is still the basis of machine learning—they also apply to any form of machine learning, including unsupervised learning, supervised learning, and reinforcement learning. It doesn't fundamentally matter where your labels come from or what your learning cycle looks like; these different branches of machine learning are simply different facets of the same construct.

So, go ahead.

Models as programs

As we noted earlier, a necessary transformational development that can be expected in the field of machine learning is a move away from models that perform purely pattern recognition and capable only of local generalization, to models capable of abstractions And reasoning that can reach ultimate generalization. All current AI programs with basic reasoning are hard-coded by human programmers: for example, programs that rely on search algorithms, graph manipulation, formal logic. In DeepMind's AlphaGo program, for example, much of the on-screen "intelligence" is designed and hard-coded by expert programmers (for example, Monte Carlo tree search); Learning from new data occurs only in specialized submodules - value networks and policy networks. But in the future, such AI systems could be trained entirely without human intervention.

How to achieve this? Let's take a well-known type of network: RNN. Importantly, RNNs have slightly fewer limitations than feedforward neural networks. This is because RNNs are little more than simple geometric transformations: they are geometric transformations that carried out continuously in a for loop. The timing of the for loop is specified by the developer: it is a built-in assumption of the network. Naturally, RNNs are still limited in what they can represent, mainly because each step they take is still a differentiable geometric transformation and because of the way they convey information step by step through points in a continuous geometric space ( state vectors). Now imagine neural networks that would be “increased” with programming primitives in the same way as for loops - but not just a single hard-coded for loop with stitched geometric memory, but large set programming primitives that the model could freely access to expand its processing capabilities, such as if branches, while statements, variable creation, disk storage for long-term memory, sorting operators, advanced data structures like lists, graphs, hash tables and much more. The space of programs that such a network can represent will be much wider than can be expressed existing networks deep learning, and some of these programs can achieve superior generalization power.

In short, we will move away from the fact that we have, on the one hand, “hard-coded algorithmic intelligence” (hand-written software), and on the other hand, “trained geometric intelligence” (deep learning). Instead, we will end up with a mixture of formal algorithmic modules that provide capabilities reasoning And abstractions, and geometric modules that provide capabilities informal intuition and pattern recognition. The entire system will be trained with little or no human intervention.

A related area of AI that I think could soon make big strides is software synthesis, in particular, neural software synthesis. Software synthesis consists of automatic generation simple programs, using a search algorithm (perhaps genetic search, as in genetic programming) to explore a large space of possible programs. The search stops when a program is found that meets the required specifications, often provided as a set of input-output pairs. As you can see, this is very similar to machine learning: “training data” is provided as input-output pairs, we find a “program” that corresponds to the transformation of inputs to outputs and is capable of generalizations to new inputs. The difference is that instead of training parameter values in a hard-coded program (neural network), we generate source through discrete search process.

I definitely expect there will be a lot of interest in this area again in the next few years. In particular, I expect mutual penetration of the related areas of deep learning and software synthesis, where we will not just generate programs in general-purpose languages, but where we will generate neural networks (geometric data processing threads), supplemented a rich set of algorithmic primitives, such as for loops - and many others. This should be much more convenient and useful than direct source code generation, and will significantly expand the scope of the problems that can be solved using machine learning - the space of programs that we can generate automatically, given the appropriate training data. A mixture of symbolic AI and geometric AI. Modern RNNs can be considered as the historical ancestor of such hybrid algorithmic-geometric models.

Drawing: The trained program simultaneously relies on geometric primitives (pattern recognition, intuition) and algorithmic primitives (argumentation, search, memory).

Beyond backpropagation and differentiable layers

If machine learning models become more like programs, then they will hardly be differentiable anymore—certainly those programs will still use continuous geometric layers as subroutines that will remain differentiable, but the overall model will not be. As a result, using backpropagation to adjust weight values in a fixed, hard-coded network may not remain the preferred method for training models in the future—at least, it should not be limited to this method alone. We need to figure out how to train non-differentiable systems most efficiently. Current approaches include genetic algorithms, "evolutionary strategies", certain reinforcement learning methods, ADMM (alternating direction method of Lagrange multipliers). Naturally, gradient descent is here to stay - gradient information will always be useful for optimizing differentiable parametric functions. But our models will definitely become more ambitious than just differentiable parametric functions, and so their automated development (“training” in “machine learning”) will require more than backpropagation.

Additionally, backpropagation has an end-to-end framework, which is suitable for learning good concatenated transformations, but is quite computationally inefficient because it does not fully exploit the modularity of deep networks. To increase the efficiency of anything, there is one universal recipe: introduce modularity and hierarchy. So we can make backpropagation itself more efficient by introducing decoupled learning modules with some synchronization mechanism between them, organized in a hierarchical manner. This strategy is partly reflected in DeepMind's recent work on "synthetic gradients." I expect much, much more work in this direction in the near future.

One can imagine a future where globally non-differentiable models (but with differentiable parts) will learn - grow - using an efficient search process that does not apply gradients, while differentiable parts will learn even faster using gradients using some more effective version backpropagation

Automated Machine Learning

In the future of architecture, models will be created by learning, rather than written by hand by engineers. The resulting models are automatically paired with a richer set of primitives and program-like machine learning models.

Nowadays, most of the time, a deep learning system developer endlessly modifies data with Python scripts, then spends a long time tuning the architecture and hyperparameters of the deep learning network to get a working model - or even to get an outstanding model if the developer is so ambitious. Needless to say, this is not the best better position of things. But AI can help here too. Unfortunately, the data processing and preparation part is difficult to automate because it often requires domain knowledge as well as a clear, high-level understanding of what the developer wants to achieve. However, tuning hyperparameters is a simple search procedure, and in this case we already know what the developer wants to achieve: this is determined by the loss function of the neural network that needs to be tuned. It has now become common practice to install basic AutoML systems, which take care of most of the tweaking of the model settings. I installed one myself to win the Kaggle competition.

At the most basic level, such a system would simply adjust the number of layers in the stack, their order, and the number of elements or filters in each layer. This is usually done using libraries like Hyperopt, which we discussed in Chapter 7 (note: books "Deep Learning with Python"). But you can go much further and try to learn the appropriate architecture from scratch, with a minimum set of restrictions. This is possible using reinforcement learning, for example, or using genetic algorithms.

Another important direction in the development of AutoML is the training of model architecture simultaneously with model weights. By training a model from scratch each time we try slightly different architectures, which is extremely inefficient, so really powerful system AutoML will drive the evolution of architectures while model properties are tuned through backpropagation to the training data, thus eliminating all computational overhead. As I write these lines, similar approaches have already begun to be applied.

When all this starts to happen, developers of machine learning systems will not be left without work - they will move to more high level in the value chain. They will begin to put much more effort into creating complex loss functions that truly reflect business objectives, and will develop a deep understanding of how their models impact the digital ecosystems in which they operate (for example, customers who use model predictions and generate data for its training) - problems that only the largest companies can now afford to consider.

Lifelong learning and reuse of modular routines

If models become more complex and built on richer algorithmic primitives, then this increased complexity will require more intensive reuse between tasks, rather than training a model from scratch every time we have a new task or new data set. Eventually, many datasets do not contain enough information to develop a new complex model from scratch and it will become necessary to use information from previous datasets. You're not re-learning English language every time you open it new book- it would be impossible. In addition, training models from scratch on each new problem is very inefficient due to the significant overlap between the current problems and those encountered before.

In addition, a remarkable observation has been made repeatedly in recent years that training the same model to do is somewhat weak related tasks improves its results in each of these tasks. For example, training the same neural network to translate from English to German and from French to Italian will result in a model that is better in each of these language pairs. Training an image classification model simultaneously with an image segmentation model, with a single convolutional base, will result in a model that is better at both tasks. And so on. This is quite intuitive: there is always some kind information that overlaps between these two seemingly different tasks, and therefore the overall model has access to more information about each individual task than a model that was trained only on that specific task.

What we actually do when we reuse a model on different tasks is use pre-trained weights for models that perform common functions, like visual feature extraction. You saw this in practice in Chapter 5. I expect that a more general version of this technique will be commonly used in the future: we will not only use previously learned features (submodel weights), but also model architectures and training procedures. As models become more program-like, we will begin to reuse subroutines, like functions and classes in regular programming languages.

Think about what the software development process looks like today: once an engineer solves a certain problem (HTTP requests in Python, for example), he packages it up as an abstract library for reuse. Engineers who encounter a similar problem in the future simply look for existing libraries, download them, and use them in their own projects. Likewise, in the future, meta-learning systems will be able to assemble new programs by sifting through a global library of high-level reusable blocks. If the system starts developing similar routines for several different tasks, it will release an "abstract" reusable version of the routine and store it in a global library. This process will open up the opportunity for abstractions, a necessary component for achieving "ultimate generalization": a routine that will be useful for many tasks and domains can be said to "abstract" some aspect of decision making. This definition of "abstraction" does not seem to be the concept of abstraction in software development. These routines can be either geometric (deep learning modules with pre-trained representations) or algorithmic (closer to the libraries that modern programmers work with).

Drawing: A meta-learning system that can quickly develop task-specific models using reusable primitives (algorithmic and geometric), thereby achieving “ultimate generalization.”

The result: a long-term vision

In short, here is my long-term vision for machine learning:

Models will become more like programs and will have capabilities that extend far beyond the continuous geometric transformations of source data that we work with now. Perhaps these programs will be much closer to the abstract mental models that people hold about their environment and themselves, and they will be capable of stronger generalization due to their algorithmic nature.
In particular, models will mix algorithmic modules with formal reasoning, search, abstraction abilities - and geometric modules with informal intuition and pattern recognition. AlphaGo (a system that required intensive manual programming and architecture) provides an early example of what the merging of symbolic and geometric AI might look like.
They will grow automatically (rather than being written by hand by human programmers), using modular parts from a global library of reusable routines - a library that has evolved by assimilation of high-performance models from thousands of previous problems and data sets. Once the metalearning system has identified common problem-solving patterns, they are converted into reusable routines - much like functions and classes in modern programming- and added to the global library. This is how the ability is achieved abstractions.
A global library and associated model growing system will be able to achieve some form of human-like "ultimate generalization": when faced with a new task, a new situation, the system will be able to assemble a new working model for that task using very little data, thanks to: 1) rich program-like primitives, who generalize well and 2) extensive experience in solving similar problems. In the same way that people can quickly learn a new complex video game because they have previous experience with many other games and because the models from previous experience are abstract and program-like rather than simply converting stimulus into action.
Essentially, this continuously learning model-growing system can be interpreted as Strong Artificial intelligence. But don't expect some kind of singular robot apocalypse to occur: it is pure fantasy, born from a long list of deep misunderstandings in the understanding of intelligence and technology. However, this criticism has no place here.