Password hash what. Data hashing algorithms. "Hash functions" based on multiplication

One of the keywords that newbies hear when learning about blockchain is the concepts of hash and hashing algorithm, which seem to be common for security. Running a decentralized network and consensus, such as the Bitcoin or Ethereum network with tens of thousands of nodes connected via p2p, requires both “reliability” and verification efficiency. That is, these systems need ways to encode information in a compact format that allows for secure and quick verification by its participants

To bookmarks

The main primitive processed by both Bitcoin and Ethereum is the concept block, which is a data structure that includes transactions, a timestamp, and other important metadata. A critical part of their security includes the ability to compress large chunks of global network state information into a short message standard that can be efficiently audited if necessary, known as hash.

Even changing one character in the input will result in a completely different hash.

Cryptographic hashes are used in everything from password storage to file verification systems. The basic idea is to use a deterministic algorithm (an algorithmic process that produces a unique and predetermined output for a given input) that takes one input and produces a string of a fixed length each time. That is, using the same input always produces the same result. Determinism is important not only for hashes, but also for a single bit that changes in the input data, creating a completely different hash. The problem with hashing algorithms is the inevitability of collisions. That is, the fact that hashes are a fixed-length string means that for every input we can imagine, there are other possible inputs that will result in the same hash. Collision is bad. This means that if an attacker can create collisions, he can transmit malicious files or data as having a correct and incorrect hash and hide under the correct hash. The goal of a good hash function is to make it extremely difficult for attackers to find ways to generate inputs that hash to the same value. The hash calculation should not be too simple, as this makes it easier for attackers to artificially calculate collisions. Hash algorithms must be resistant to “preimage attacks.” That is, given a hash, it would be extremely difficult to calculate the reverse deterministic steps taken to reproduce the value that created the hash (i.e., finding the preimage).

Given S = hash(x), finding X should be nearly impossible.

Recall that “good” hashing algorithms have the following properties:

  • Changing one bit in the input should have the effect of changing the entire hash;
  • Hash calculations should not be too simple, the difficulty of finding the preimage is high;
  • Must have a very low probability of collision;

Hacking hashes

One of the first hashing algorithm standards was MD5 hash, which was widely used to verify file integrity (checksums) and store hashed passwords in web application databases. Its functionality is quite simple, as it outputs a fixed 128-bit string for each input and uses trivial one-way operations over multiple rounds to compute a deterministic result. Its short output length and ease of operation made the MD5 very easy to hack and susceptible to a birthday attack.

What is the Birthday Attack?

Have you ever heard that if you put 23 people in a room, there is a 50% chance that two of them will have the same birthday? Getting the number to 70 people in the room gives you a 99.9% chance. If the pigeons are seated in boxes, and the number of pigeons is greater than the number of boxes, then at least one of the cages contains more than one pigeon. That is, fixed output constraints mean that there is a fixed degree of permutations on which a collision can be found.

At least one compartment will have 2 pigeons inside.

In fact, MD5 is so weak in collision resistance that a simple household 2.4 GHz Pentium processor can calculate artificial hash collisions within seconds. Additionally, its widespread use in the earlier days of the current web has created tons of leaked MD5 pre-images on the internet that can be found with a simple Google search of their hash.

Differences and development of hashing algorithms Start: SHA1 and SHA2

The NSA (National Security Agency) has long been a pioneer of hashing algorithm standards, with their initial proposal of the Secure Hashing Algorithm, or SHA1, creating 160-bit fixed-length outputs. Unfortunately, SHA1 simply improves on MD5 by increasing the output length, the number of one-way operations, and the complexity of those one-way operations, but does not provide any fundamental improvements against more powerful machines attempting various attacks. So how can we do things better?

Using SHA3

In 2006, the National Institute of Standards and Technology (NIST) launched a competition to find an alternative to SHA2 that would be fundamentally different in its architecture in order to become a standard. Thus, SHA3 emerged as part of a larger hashing algorithm scheme known as KECCAK (pronounced Ketch-Ak). Despite the name, SHA3 is very different due to its internal mechanism known as "sponge design", which uses random permutations to "Sink" and "Squeeze" data, working as a source of randomness for future inputs that go into the hashing algorithm.

Hashing and proof-of-work

When it came to integrating the hashing algorithm into blockchain protocols, Bitcoin used SHA256, while Ethereum used a modified SHA3 (KECCAK256) for its PoW. However, an important quality of choosing a hash function for a proof-of-work blockchain is the computational efficiency of said hash. Bitcoin's SHA256 hashing algorithm can be calculated quite simply using specialized hardware known as application-specific integrated circuits (or ASICs). Much has been written about the use of ASICs in a mining pool and how they make the protocol aimed at centralizing computation. That is, proof of work incentivizes groups of computationally efficient machines to pool together and increase what we call “hash power,” or a measure of the number of hashes a machine can compute in an interval of time. Ethereum chose a modified SHA3 known as KECCAK 256. Additionally, Ethereum's PoW algorithm, Dagger-Hashimoto, was supposed to be hard to compute on hardware.

Why does Bitcoin use double SHA256 encryption?

Bitcoin has an interesting way of hashing data using SHA256 as it runs two iterations of the algorithm in its protocol. Note that this is not a countermeasure for birthday attacks, since it is clear that if hash(x) = hash(y) then hash(hash(x)) = hash(hash(y)). Instead, double SHA256 is used to mitigate "Message elongation attack - a type of hash function attack that involves adding new information to the end of the original message." The attack is dangerous because it is possible to change the request and, accordingly, carry out what this request is responsible for (for example, transferring money)

SHA3 ​​wasn't the only breakthrough to come out of the 2006 NIST hashing competition. Although SHA3 won, the algorithm known as BLAKE came in second place. To implement sharding, Ethereum 2.0 uses a more efficient one. The BLAKE2b hashing algorithm, which is a highly advanced version of BLAKE from competitors, is being intensively studied for its fantastic efficiency compared to KECCAK256 while maintaining a high degree of security. The calculation of BLAKE2b is actually 3 times faster than KECCAK on a modern processor.

The Future of Hash Algorithms

It seems like no matter what we do, we just either (1) increase the complexity of internal hash operations, or (2) we increase the length of the hash output, hoping that the attackers' computers will not be fast enough to effectively calculate its collision. We rely on the ambiguity of one-way transaction precursors to keep our networks secure. That is, the security goal of a hashing algorithm is to make it as difficult as possible for anyone trying to find two values ​​that hash to the same output, even though there are an infinite number of possible collisions. “What about the future of quantum computers? Will hashing algorithms be secure? The short answer and current understanding is that yes, hashing algorithms will stand the test of time against quantum computing. What quantum computing will be able to break are those problems that have a rigorous mathematical structure based on neat tricks and theory, such as RSA encryption. On the other hand, hashing algorithms have a less formal structure in their internal constructions. Quantum computers do offer increased speed in computing unstructured problems such as hashing, but at the end of the day, they will still brute force the same way a computer would try to do today. Regardless of what algorithms we choose for our protocols, it is clear that we are moving towards a computationally efficient future, and we must use our best judgment to select the right tools for the job and ones that will hopefully stand the test of time.

_____________________________________________________________________________

______________________________________________________________________________

Often, when downloading torrents or the files themselves, the description says something like “ad33e486d0578a892b8vbd8b19e28754” (for example, in ex.ua), often with the prefix “md5”. This is the hash code - the result that the hash function produces after processing the incoming data. Translated from English, hash means confusion, marijuana, weed, or a dish of finely chopped meat and vegetables. very, very difficult, one might say almost impossible. Then the question arises: “Why is all this needed at all? They give out incomprehensible gobbledygook, which is also not decipherable?” This is what will be discussed in this article.

What is a hash function and how does it work?

This function is designed to convert incoming data of arbitrarily large size into a fixed-length result. The process of such a conversion is called hashing, and the result is a hash or hash code. Sometimes the words “fingerprint” or “message digest” are also used, but in practice they are much less common. There are a lot of different algorithms for how you can turn any data array into a certain sequence of characters of a certain length. The most widely used algorithm is called md5, which was developed back in 1991. Despite the fact that today md5 is somewhat outdated and is not recommended for use, it is still in use and often instead of the word “hash code”, sites simply write md5 and indicate the code itself.

Why is a hash function needed?

Knowing the result, it is almost impossible to determine the input data, but the same input data gives the same result. Therefore, the hash function (also called the convolution function) is often used to store very important information such as password, login, ID number and other personal information. Instead of comparing the information the user enters with what is stored in the database, their hashes are compared. This ensures that in the event of an accidental information leak, no one will be able to use important data for their own purposes. By comparing the hash code it is also convenient to check that files are being downloaded correctly from the Internet, especially if there were connection interruptions during the download.

Hash functions: what are they? T

Depending on its purpose, a hash function can be one of three types:

1. Function for checking the integrity of information

When happening over the network, a hash of the packet is calculated, and this result is also transmitted along with the file. Upon reception, the hash code is again calculated and compared with the value received over the network. If the code does not match, then this indicates errors, and the damaged packet will be transmitted again. This function has fast calculation speed, but a small number of hash values ​​and poor stability. An example of this type: CRC32, which has only 232 different values.

2. Cryptographic function

Used for protection against (ND). They allow you to check whether data corruption has occurred as a result of an accident during file transfer over the network. The true hash in this case is publicly available, and the hash of the resulting file can be calculated using many different programs. Such functions have a long and stable lifespan, and searching for collisions (possible coincidences of results from different source data) is very difficult. These are the functions that are used to store passwords (SH1, SH2, MD5) and other valuable information in the database.

3. A function designed to create an efficient data structure

Its goal is a compact and fairly orderly organization of information in a special structure called a hash table. Such a table allows you to add new information, delete information, and search for the necessary data at very high speed.

hashing when solving problems in C++.

The process of searching for data in large volumes of information is time-consuming due to the need to view and compare a significant number of elements with the search key. Reducing the search can be done by localizing the viewing area. For example, sort the data by search key, split it into non-overlapping blocks according to some group characteristic, or match the real data with a certain code that will simplify the search procedure.

Currently, a widely used method for providing quick access to information stored in external memory is hashing.

Hashing(or hashing, English hashing) is a transformation of an input data array of a certain type and arbitrary length into an output bit string of a fixed length. Such transformations are also called hash functions or convolution functions, and their results are called hash, hash code, hash table or digest messages (English) message digest).

Hash table- This data structure, which implements the associative array interface, that is, it allows you to store key-value pairs and perform three operations: the operation of adding a new pair, the search operation and the operation of deleting a pair by key. A hash table is an array formed in a specific order by a hash function.

  • the function must be simple from a computational point of view;
  • the function should distribute the keys in the hash table as evenly as possible;
  • the function must not map any relationship between key values ​​into a relationship between address values;
  • the function must minimize the number of collisions - that is, situations when different keys correspond to the same hash value (the keys in this case are called synonyms).

In this case, the first property of a good hash function depends on the characteristics of the computer, and the second - on the data values.

If all data were random, then hash functions would be very simple (for example, a few key bits). However, in practice, random data is rare enough that you have to create a function that depends on the entire key. If a hash function distributes a population possible keys uniformly across multiple indexes, hashing effectively splits multiple keys. The worst case is when all keys are hashed into one index.

When collisions occur, it is necessary to find a new location to store the keys that claim the same hash table cell. Moreover, if collisions are allowed, then their number must be minimized. In some special cases, it is possible to avoid collisions altogether. For example, if all element keys are known in advance (or change very rarely), then some injective hash function can be found for them that will distribute them across the cells of the hash table without collisions. Hash tables that use similar hash functions do not need a collision resolution mechanism, and are called hash tables with direct addressing.

Hash tables must match the following properties.

  • Performing an operation on a hash table begins by calculating the hash function of the key. The resulting hash value is an index into the original array.
  • The number of stored array elements divided by the number of possible hash values ​​is called hash table fill factor (load factor) and is an important parameter on which the average execution time of operations depends.
  • Search, insert, and delete operations should be completed in O(1) time on average. However, this estimate does not take into account the possible hardware costs of rebuilding the hash table index associated with increasing the array size value and adding a new pair to the hash table.
  • The collision resolution mechanism is an important component of any hash table.

Hashing is useful when a wide range of possible values ​​must be stored in a small amount of memory and a way of fast, near-random access is needed. Hash tables are often used in databases, and especially in language processors such as compilers and assemblers, where they increase the speed of processing the identifier table. Examples of the use of hashing in everyday life include the distribution of books in a library into thematic catalogs, ordering in dictionaries by the first letters of words, encryption of specialties in universities, etc.

Collision resolution methods

Collisions complicate the use of hash tables because they break the unique correspondence between hash codes and data. However, there are ways to overcome the difficulties that arise:

  • chaining method (external or open hashing);
  • open addressing method (closed hashing).

Chain method. The technology of adhesion of elements is that elements of the set, which correspond to the same hash value, are linked into a chain-list. Position number i stores a pointer to the head of the list of those elements whose key hash value is equal to i; if there are no such elements in the set, NULL is written at position i. In Fig. Figure 38.1 demonstrates the implementation of the chaining method for resolving collisions. The key 002 is claimed by two values, which are organized into a linear list.


Rice. 38.1.

Each array cell is a pointer to a linked list (chain) of key-value pairs corresponding to the same hash value of the key. Collisions simply result in chains longer than one element.

Search or delete operations require viewing all elements of the corresponding chain in order to find an element with a given key in it. To add data, you need to add an element to the end or beginning of the corresponding list, and, if the fill factor becomes too large, increase the size of the array and rebuild the table.

Under the assumption that each element can end up in any position in the table with equal probability and regardless of where any other element ends up,

Hash tables

Hash table(shuffled table, table with calculated addresses) is dynamic set supporting operations adding, searching and deleting an element and using special methods addressing.

The main difference between tables and other dynamic sets is element address calculation by key value.

The idea of ​​a hash implementation is that working with one large array is reduced to working with a number of small sets.

For example, a notebook. The pages of the book are marked with letters. A page marked with a certain letter contains surnames starting with that letter. A large set of surnames is divided into 28 subsets. When searching, the book immediately opens to the desired letter and the search speeds up.

In programming hash table- This structure data that stores pairs (key or index + value) and with which three operations are performed: adding a new pair, searching and deleting a pair by key.

Searching hash tables carried out in two stages:

first step - computing a hash function that converts key search in tabular address:

second step – the process of resolving conflicts when processing such keys.

If different meanings table keys hash function generates the same addresses, then they say that it arises collision(conflict, clash).

Hash functions

The main purpose of a hash function is to match different keys if possible different not negative whole numbers.

Topic hash function better, how less identical values ​​it generates.

The hash function must be selected in such a way that the following properties are satisfied:

    hash function is defined on the elements of the set and takes non-negative integers values;

    hash function easy to calculate;

    hash function can accept various values ​​approximately from equal probability(minimizing collisions);

    on loved ones argument values hash function takes distant meanings from each other.

To build a good hash function you need to know the key distribution. If the key distribution is known, then, ideally, the key distribution density and the hash function density distribution density should be identical.

Let p ( key ) - distribution density of key requests. Then, in the ideal case, the query distribution density of the table inputs g ( H ( key )) d. be such that on average the number of elements, cat. you have to go through the chains of twins, it was minimal.

Example.

Let there be a set keys

{0, 1, 4, 5, 6, 7, 8, 9, 15, 20, 30, 40}

and let the table admit 4 entrance.

You can build a hash function:

h(key) = key % 4 .

Then you get the following addresses for inputs

{0, 1, 2, 3} tables:

h(key)

Entry number

Maximum chain length

% of hits

3 0.5+1.5 0.25+0.5 0.08+1 0.17 ≈ 2.1 list element.

Example with a different hash function.

h(key)

Entry number

% of hits

On average it will take 4 1.5 0.25 = 1.5 list element.

If it is an information retrieval system, then its search performance will increase by about 25%.

Methods for constructing hash functions

Modular hashing

A simple, effective and commonly used hashing method.

The table size is selected in the form simple numbers m and the hash function is calculated as remainder of the division:

h(key) = key % m

key– integer numeric value of the key,

m- number of hash values ​​(hash table entries).

This function is called modular and varies from 0 before ( m - 1 ).

Modular hash function in C++:

typedefintHashIndexType;

HashIndexTypeHash(intKey)

{ returnKey % m; }

Example

key = {1, 3, 56, 4, 32, 40, 23, 7, 41,13, 6,7}

Let m = 5

h(key) = {1, 3, 1, 4, 2, 0, 3, 2, 1, 3, 1, 2}

Choice matters m.To get a random distribution of keys you need to take simple number.

Multiplicative method

Hash function:

h(key) =

0 < A < 1 – constant.

12 mod5 = 2 (remainder of division of 12 by 5).

5,04 mod1= 0,04 (stands out fraction)

Example

key = 123456

m = 10000

A = 0,6180339887499 = 0,618…

h(key) = =

Additive method

Is used for lines variable length (table size m equals 256).

{ HashIndexType h = 0;

while (*str)

h += (*str)++;

returnh;

Disadvantage of the additive method: similar words and anagrams are not distinguished, i.e. h(XY ) = h(YX )

Additive method, in which the key is a character string. In a hash function, a string is converted to an integer by summing all the characters and returning the remainder after division by m (usually table size m = 256).int h(char *key, int m) (int s = 0;while(*key)s += *key++;return s % m;)Collisions occur in strings consisting of the same set of characters, for example, abc And cab.This method can be slightly modified, obtaining the result by summing only the first and last characters of the key string.int h(char *key, int m) (int len ​​= strlen(key), s = 0;if(len< 2) // Если длина ключа равна 0 или 1,s = key; // возвратить keyelse s = key + key;return s % m;}В этом случае коллизии будут возникать только в строках, например, abc And amc.

the hash function takes the key and calculates the address in the table from it (the address can be an index in the array to which the chains are attached), that is, for example, it can get the number 3 from the string “abcd”, and from the string “efgh” it can get the number 7 and then the first structure of the chain is taken through hash, or through hash the search continues along the chain until “abcd” is found in the chain of structures from hash, or “efgh” is found in the chain of structures from hash when the structure with “abcd” " is found, the rest of its data is taken and returned, or all of it is returned (its address), so that the remaining data can be taken from it, and a chain of structures is created because many different keys have the same address in the table, that is , for example, the hash function for "abcd" can produce 3 and for "zxf9" can also produce 3, thus they are linked into a chain that hangs on the third index of the array......

The array H stores the key-value pairs themselves. The element insertion algorithm checks the cells of the array H in some order until the first free cell is found, into which the new element will be written.

The search algorithm searches the hash table cells in the same order as when inserting until it finds either an element with the searched key or a free cell (meaning the element is not in the hash table).

Exclusive OR

Used for variable length strings. The method is similar to the additive method, but distinguishes between similar words. It consists in applying the “exclusive OR” operation to the elements of the string sequentially.

typedef unsigned char HashIndexType;

unsigned char Rand8;

HashIndexType Hash(char *str)

( unsigned char h = 0;

while (*str) h = Rand8;

returnh; }

Here rand8 – a table of 256 eight-bit random numbers.

table size<= 65536

typedef unsigned short int HashIndexType;

unsigned char Rand8;

HashIndexType Hash(char *str)

( HashIndexType h; unsigned char h1, h2;

if (*str == 0) return 0;

h1 = *str; h2 = *str + 1; str++;

while (*str)

( h1 = Rand8; h2 = Rand8;

str++; )

h = ((HashIndexType)h1<< 8) | (HashIndexType)h2;

return h % HashTableSize )

Universal hashing

Implies random choosing a hash function from a certain set during execution programs.

If in the multiplicative method we use it as A subsequence random values ​​instead of a fixed number, you get a universal hash function.

However, it will take too long to generate random numbers big.

Can be used pseudorandom numbers.

// pseudorandom number generator

typedefintHashIndexType;

HashIndexTypeHash(char *v, int m)

( int h, a = 31415, b = 27183;

for(h = 0;*v != 0; v++, a = a*b % (m - l))

h = (a*h + *v) % m;

return(h< 0) ? (h + m) : h;