Yandex opens CatBoost machine learning technology. Comparison of deep learning libraries using the example of handwritten digit classification problem

  • Python,
  • Machine learning,
  • Search technologies
  • Today Yandex posted in open source own library CatBoost, developed taking into account the company's many years of experience in the field machine learning. With its help, you can effectively train models on heterogeneous data, including those that are difficult to represent in the form of numbers (for example, types of clouds or product categories). Source, documentation, benchmarks and necessary tools already published on GitHub under the Apache 2.0 license.

    CatBoost is new method machine learning based on gradient boosting. It is being implemented in Yandex to solve problems of ranking, prediction and building recommendations. Moreover, it is already being used within the framework of cooperation with the European Organization for Nuclear Research (CERN) and industrial Yandex clients Data Factory. So how does CatBoost differ from other open-source analogues? Why boosting and not the neural network method? How is this technology related to the already known Matrixnet? And what do cats have to do with it? Today we will answer all these questions.

    The term “machine learning” appeared back in the 50s. This term refers to an attempt to teach a computer to solve problems that are easy for humans, but it is difficult to formalize the way to solve them. As a result of machine learning, a computer can exhibit behavior that was not explicitly designed into it. IN modern world We encounter the fruits of machine learning many times every day, many of us without even knowing it. It is used to construct ribbons in in social networks, lists " similar products» in online stores, when issuing loans from banks and determining the cost of insurance. Machine learning technologies are used to search for faces in photographs or numerous photo filters. For the latter, by the way, neural networks are usually used, and they are written about so often that there may be a mistaken opinion that this is a “silver bullet” for solving problems of any complexity. But that's not true.

    Neural networks or gradient boosting

    In fact, machine learning is very different: there are a large number of different methods, and neural networks are just one of them. This is illustrated by the results of competitions on the Kaggle platform, where people win various competitions different methods, and gradient boosting wins for many.

    Neural networks are excellent at solving certain problems – for example, those where you need to work with homogeneous data. Homogeneous data consists of, for example, images, sound or text. In Yandex, they help us better understand search queries, looking for similar pictures on the Internet, recognize your voice in Navigator and much more. But these are not all the tasks for machine learning. There is a whole layer of serious challenges that cannot be solved by neural networks alone - they need gradient boosting. This method is indispensable where there is a lot of data and its structure is heterogeneous.

    For example, if you need an accurate weather forecast, which takes into account great amount factors (temperature, humidity, radar data, user observations and many others). Or if you need to rank qualitatively search results- this is what prompted Yandex to develop its own machine learning method.

    Matrixnet

    First search engines were not as complex as they are now. In fact, at first it was just a word search - there were so few sites that there wasn’t much competition between them. Then there were more pages, and it became necessary to rank them. Various complications began to be taken into account - word frequency, tf-idf. Then there were too many pages on any topic, the first important breakthrough occurred - they began to take into account links.

    Soon the Internet became commercially important, and many scammers appeared trying to fool the simple algorithms that existed at the time. And a second important breakthrough occurred - search engines began to use their knowledge of user behavior to understand which pages are good and which are not.

    About ten years ago, the human mind was no longer sufficient to figure out how to rank documents. You've probably noticed that the number of results found for almost any query is huge: hundreds of thousands, often millions of results. Most of them are uninteresting, useless, only randomly mention the query words, or are generally spam. To answer your request, you need to instantly select the top ten from all the results found. Write a program that does this with acceptable quality, has become beyond the power of a human programmer. The next transition has occurred - search engines began to actively use machine learning.

    Back in 2009, Yandex introduced their own Matrixnet method, based on gradient boosting. We can say that ranking is helped by the collective intelligence of users and the “wisdom of the crowd.” Information about sites and people's behavior is converted into many factors, each of which is used by Matrixnet to build a ranking formula. In fact, the ranking formula is now written by a machine. By the way, we also use the results of neural networks as individual factors (for example, this is how the Palekh algorithm works, which we talked about last year).

    An important feature of Matrixnet is that it is resistant to overfitting. This allows you to take into account a lot of ranking factors and at the same time learn from relatively small quantity data without fear that the machine will find non-existent patterns. Other machine learning methods can either build more simple formulas with fewer factors, or require a larger training sample.

    Another one important feature Matrixnet is that the ranking formula can be configured separately for fairly narrow classes of queries. For example, improve the quality of search only for queries about music. At the same time, ranking for other classes of queries will not deteriorate.

    It was Matrixnet and its advantages that formed the basis of CatBoost. But why did we need to invent something new at all?

    Almost any modern method Based on gradient boosting, it works with numbers. Even if your input is genres of music, types of clouds or colors, this data still needs to be described in the language of numbers. This leads to a distortion of their essence and a potential decrease in the accuracy of the model.

    Let's demonstrate this using a primitive example with a product catalog in a store. The products have little relationship with each other, and there is no pattern between them that would allow them to be ordered and assigned a meaningful number to each product. Therefore, in this situation, each product is simply assigned a serial id (for example, in accordance with the store’s accounting program). The order of these numbers does not mean anything, but the algorithm will use this order and draw false conclusions from it.

    An experienced machine learning specialist can come up with a more intelligent way to turn categorical features into numbers, but such pre-processing will lead to the loss of some information and will lead to a deterioration in the quality of the final solution.

    That is why it was important to teach the machine to work not only with numbers, but also with categories directly, the patterns between which it will identify independently, without our manual “help”. And we designed CatBoost to work equally well out of the box with both numeric and categorical features. Thanks to this, it shows more high quality training when working with heterogeneous data than alternative solutions. It can be used in the most different areas- from banking sector to industry.

    By the way, the name of the technology comes from Categorical Boosting. And not a single cat was harmed during development.

    Benchmarks

    We can talk for a long time about the theoretical differences of the library, but it’s better to show it in practice once. For clarity, we compared the work of the CatBoost library with open analogues XGBoost, LightGBM and H20 on a set of public datasets. And here are the results (the smaller the better): https://catboost.yandex/#benchmark

    We don’t want to be unfounded, therefore, along with the library, a description of the comparison process, code for launching comparison of methods, and a container with used versions of all libraries are posted in open source. Any user can repeat the experiment at home or on their own data.

    CatBoost in practice

    The new method has already been tested on Yandex services. It was used to improve search results, ranking the Yandex.Zen recommendation feed, and to calculate the weather forecast in Meteum technology - and in all cases it showed itself to be better than Matrixnet. In the future, CatBoost will work on other services. We won’t stop here – it’s better to tell you right away about the Large Hadron Collider (LHC).

    CatBoost has also found application within the framework of cooperation with the European Organization for Nuclear Research. The LHC operates the LHCb detector, which is used to study the asymmetry of matter and antimatter in the interactions of heavy pretty quarks. To accurately track the different particles detected in an experiment, there are several specific parts in the detector, each of which determines the special properties of the particles. Most challenging task this involves combining information with various parts detector into the most accurate, aggregated knowledge about the particle. This is where machine learning comes to the rescue. Using CatBoost to combine data, the scientists managed to improve the quality characteristics of the final solution. CatBoost results were better results obtained using other methods.

    How to start using CatBoost?

    To work with CatBoost, just install it on your computer. The library supports OS Linux, Windows and macOS and available in languages Python programming and R. Yandex has also developed a visualization program

    Hi all!

    In this article I will talk about a new comfortable way to program in Python.

    This is less like programming and more like creating articles (reports/demonstrations/research/examples): you can insert regular explanatory text among blocks of Python code. The result of executing the code is not only numbers and text (as is the case with the console when working with Python), but also graphs, diagrams, pictures...

    Examples of documents you can create:

    Looks cool? Do you want to create the same documents? Then this article is for you!

    Neural networks are created and trained mainly in Python. Therefore, it is very important to have a basic understanding of how to write programs in it. In this article I will briefly and clearly talk about the basic concepts of this language: variables, functions, classes and modules.

    The material is intended for people unfamiliar with programming languages.

    First you need to install Python. Then you need to install a convenient environment for writing programs in Python. The portal is dedicated to these two steps.

    If everything is installed and configured, you can start.

    Neural networks you have to write it in some programming language. There are a great many of them, but I recommend (and use in the textbook and articles) exactly Python language. Why?

    1. It's very easy to learn
    2. A large number of ready-made libraries
    3. When you look at a program, you immediately see the algorithm that it implements
    4. Most machine learning specialists use Python and most libraries are also created specifically for this programming language

    In the previous part, we learned how to calculate signal changes when passing through a neural network. We got acquainted with matrices, their products and derived simple formulas for calculations.

    In part 6 of the translation I post 4 sections of the book at once. All of them are dedicated to one of the most important topics in the field of neural networks - the backpropagation method. You will learn to calculate the error of all neurons in a neural network based only on the final network error and connection weights.

    The material is complex, so feel free to ask your questions on the forum.

    You can transfer.

    Enjoy reading!

    In Part 5 of the translation I present 3 sections related in meaning.

    First, we will personally calculate the outputs of a two-layer neural network. Then we will get acquainted with matrices and their products. Using the knowledge gained, we will derive simple formulas for calculating signal conversion in a neural network. And in the last section we will check the obtained formulas in practice, calculating the outputs of a three-layer neural network.

    You can transfer.

    Enjoy reading!

    Part 4 of the translation is ready!

    Let's stop beating around the bush and move directly to the topic of the book - neural networks.

    In this part of the translation, we will look at biological neural networks and compare them with traditional computers. Then we will build the model artificial neuron and in the end we will move directly to artificial neural networks.

    You can transfer.

    Enjoy reading!

    The third part of the translation!

    The article is not very long. It only covers one section of the book. The goal is to show that each method has its limitations. The article discusses the limitations of the linear classifier. The concepts are also introduced logical functions and XOR problems.

    You can transfer.

    Enjoy reading!

    In this article I will talk about an interesting music generator that works on neural networks. The generator is called Amper. With its help, any person, even someone who is far from composing compositions, can independently create a unique melody and use it for their own purposes.

    Here, for example, is what the neural network developed for me.

    Historically, artificial neural networks, over their more than half-century history, have experienced both periods of rapid rise and increased public attention, as well as periods of skepticism and indifference that followed them. IN Good times It seems to scientists and engineers that they have finally found a universal technology that can replace humans in any cognitive tasks. Like mushrooms after rain, various new models of neural networks are appearing, and there are intense debates between their authors, professional mathematicians, about the greater or lesser degree of biologicality of the models they propose. Professional biologists observe these discussions from the sidelines, periodically breaking down and exclaiming “Yes, this does not happen in real nature!” – and without much effect, since neural network mathematicians listen to biologists, as a rule, only when the biologists’ facts are consistent with their own theories. However, over time, a pool of tasks gradually accumulates for which neural networks perform frankly poorly and people’s enthusiasm cools down.

    These days, neural networks are back at the zenith of their fame thanks to the invention of the unsupervised pre-training method based on Restricted Bolzmann Machines (RBM), which makes it possible to train deep neural networks (i.e. with extra-large, on the order of tens of thousands, number of neurons) and the success of deep neural networks in practical problems of speech and image recognition. For example, speech recognition in Android is implemented using deep neural networks. How long this will last and how well deep neural networks will live up to the expectations placed on them is unknown.
    Meanwhile, in parallel with all scientific disputes, currents and trends, a community of neural network users clearly stands out - practicing software engineers who are interested in the applied aspect of neural networks, their ability to learn from collected data and solve recognition problems. With many practical tasks Classification and prediction work well with well-designed, relatively small Multilayer Perceptron (MLP) models and Radial Basis Function networks (RBF). These neural networks have been described many times; I would recommend the following books, in order of my personal sympathy for them: Osovsky, Bishop, Khaikin; There are also good courses on Coursera and similar resources.

    However, as regards common approach the use of neural networks in practice, it is radically different from the usual deterministic development approach “programmed, it works, which means it always works.” Neural networks are by their nature probabilistic models, and the approach to them should be completely different. Unfortunately, many new programmers of machine learning technologies in general and neural networks in particular do system errors when working with them, they become disappointed and abandon the matter. The idea of ​​writing this treatise on Habr arose after communicating with such disappointed users of neural networks - excellent, experienced, self-confident programmers.

    Here is my list of rules and typical mistakes use of neural networks.

    1. If it is possible not to use neural networks, do not use them.
    Neural networks allow you to solve a problem if it is impossible to propose an algorithm by repeatedly (or very repeatedly) viewing the data with your eyes. For example, if there is a lot of data, it is nonlinear, noisy and/or large in size.

    2. The complexity of neural networks must be adequate to the complexity of the task.
    Modern personal computers(for example, Core i5, 8 GB RAM) allow you to train neural networks in a comfortable time using samples of tens of thousands of examples, with input data dimensions up to hundreds. Large samples are a challenge for the deep neural networks mentioned above, which are trained on multi-processor GPUs. These models are very interesting, but are beyond the focus of this habr article.

    3. Training data must be representative.
    The training sample should fully and comprehensively represent the phenomenon being described and include various possible situations. It’s good to have a lot of data, but that in itself doesn’t always help. There is a widespread joke in narrow circles when a geologist comes to a recognizer, places a piece of mineral in front of him and asks him to develop a system for recognizing such a substance using it. “Can I have more examples of data?” - asks the recognizer. "Certainly!" - the geologist answers, takes out a pickaxe and splits his piece of mineral into several more pieces. As you understand, there will be no use from such an operation - no new information such an increased sample does not carry within itself.

    4. Mix the selection.
    After the input and output data vectors have been collected, if the measurements are independent of each other, change the order of the vectors randomly. This is critical for the correct division of the sample into Train/Test/Validation and all sample-by-sample training methods.

    5. Normalize and center the data.
    For multilayer perceptrons, and for many other models, the input data values ​​must lie in the range [-1;1]. Before feeding them to the neural network, subtract the average from the data and divide all values ​​by the maximum value.

    6. Divide the sample into Train, Test and Validation.
    The main mistake of beginners is to provide minimal error work of a neural network on a training sample, at the same time hellishly retraining it, and then wish for the same good quality on new real data. This is especially easy to do if there is little data (or they are all from one piece). The result can be very disappointing: the neural network will adapt as much as possible to the sample and will lose its functionality on real data. In order to control the generalizing abilities of your model, divide all the data into three samples in a ratio of 70: 20: 10. Train on Train, periodically checking the quality of the model on Test. For the final unbiased assessment – ​​Validation.
    The cross-validation technique, when Train and Test are generated several times in a random manner from the same data, can be insidious and give a false impression of good quality system operation - for example, if the data is taken from different sources and this is critical. Use the correct Validation!

    7. Apply regularization.
    Regularization is a technique that allows you to avoid overtraining a neural network during training, even if there is little data. If you find a checkbox with this word, be sure to check it. Sign of an overtrained neural network – large values weights, on the order of hundreds and thousands, such a neural network will not work normally on new, previously unseen data

    8. There is no need to retrain the neural network online.
    The idea of ​​retraining the neural network permanently all the time on new incoming data is correct in itself; in real biological systems this is exactly what happens. We learn every day and rarely go crazy. However, for conventional artificial neural networks at the current stage of technical development, this practice is risky: the network can overtrain or adapt to the latest data received - and lose its generalization abilities. In order for the system to be used in practice, the neural network needs to: 1) train, 2) test the quality on test and validation samples, 3) select a successful network option, fix its weights and 4) use the trained neural network in practice, weights in the process do not change usage.

    9. Use new learning algorithms: Levenberg-Marquardt, BFGS, Conjugate Gradients, etc.
    I am deeply convinced that implementing backpropagation learning is the sacred duty of everyone who works with neural networks. This method is the simplest, relatively easy to program and allows you to thoroughly study the learning process of neural networks. Meanwhile, backpropagation was invented in the early 70s and became popular in the mid-80s of the last century; since then, more advanced methods have appeared that can significantly improve the quality of learning. Better use them.

    10. Train neural networks in MATLAB and similar user-friendly environments.
    If you are not a scientist developing new methods for training neural networks, but a practicing programmer, I would not recommend coding the procedure for training neural networks yourself. There is a large number software packages, mainly in MATLAB and Python, which allow you to train neural networks, while controlling the training and testing process using convenient means visualization and debugging. Enjoy the heritage of humanity! I personally like the “train in MATLAB with a good library - implement the trained model by hand” approach; it is quite powerful and flexible. An exception is the STATISTICA package, which contains advanced methods for training neural networks and allows you to generate them in the form program code in C, convenient for implementation.

    In the next article, I plan to describe in detail the full industrial cycle of preparing a neural network implemented on the basis of the principles described above, used for recognition tasks in commercial software product.

    Good luck!

    Literature

    Hinton G., Deng L., Yu D., Dahl G., Mohamed A., Jaitly N., Senior A., ​​Vanhoucke V., Nguyen P., Sainath T. and Kingsbury B. Deep Neural Networks for Acoustic Modeling in Speech Recognition, IEEE Signal Processing Magazine, Vol. 29, No. 6, 2012, pp. 82 – 97.
    Ciresan D., Meier U., Masci J and Schmidhuber J. Multi-column Deep Neural Network for Traffic Sign Classification. Neural Networks, Vol. 34, August 2012, pp. 333 – 338
    S. Osovsky. Neural networks for information processing - trans. from Polish. M.: Finance and Statistics, 2002. – 344 p.
    Bishop C.M. Pattern Recognition and Machine Learning. Springer, 2006 – 738 p.
    S. Khaikin. Neural networks: full course. Williams, 2006.