What are NoSQL databases and how do they work?

 
NoSQL databases have come veritably popular. They're used by large organizations to store hundreds of petabytes of data and to reuse millions of queries per second.

But what's a NoSQL database? How does it work and why does it scale so much better than traditional relational databases?  

Let's see the problem with relational databases like my SQL, Maria DB, SQL Server.


These are built to store relational data as efficiently as possible. You can have a table for customers, orders, and products linking them together logically. 

Customers place orders and orders containing products. This tight organization is great for managing your data. But it comes at a cost. Relational databases have a hard time scaling. They've to maintain these relationships and that is a ferocious process taking a lot of memory and compute power.

So, for a while, you can keep upgrading your database server, but at some points, it will not be able to handle the load. In technical terms, we say that relational databases can scale vertically but not horizontally, whereas NoSQL databases can scale both vertically and horizontally.

You can compare this to a structure. Vertical scaling means adding further floors to a being structure, while vertical scaling means adding further structures.
You intimately understand that vertical scaling is only possible to a certain extent, while horizontal scaling is much more powerful.

Now, why do NoSQL databases scale so well? First of all, they do away with these costly relationships. In NoSQL every item in the database is stand alone. This simple modification means that there are essentially key value stores. Each item in the database only has two fields, a unique key and a value. For example, you can use the product barcode as the key and the product name as the value when you want to store product information.  This seems restrictive, but the value can be something like adjacent document containing more data, like the price and description. 

This simpler design is why NoSQL databases scale better. If a single database server is not enough to store all your data or handle all the queries, you can split the workload across two or more servers. 

Each server will then be responsible for only a part of your database. To give an instance, Apple runs in NoSQL database that consists out of 75,000 servers. 

In NoSQL terms, these parts of your database are called partitions and it brings up a question if your database is split across potentially thousands of partitions, how do you know where an item is stored? 

That is where the primary key comes in. Remember, no SQL databases are key-value stores, and the key determines on what partition an item will be stored.

Behind the scenes, NoSQL databases use a hash function to convert each primary key of an item into a number that falls into a fixed range, say between zero and 100.

This hash value and the range is also used to determine where to store an item. If your database is small enough or does not get many requests you can put everything on a single server. This one will also be responsible for the entire range.

You can add a secondary server, which means that the range will be split in half If that server becomes overloaded. 


While the server two will store everything between 50 and 100, if server one will be responsible for all particulars with a hash between 0 and 50.

Theoretically, you've now doubled your database capacity, both in terms of storage and in the number of queries you can execute.

This range is also called a keyspace. It's a simple system that solves two problems where to store new items and where to find being.

Once all you have to do is calculate the hash of a key of the item and keep track of which server is responsible for which part of the keyspace.

 Now, in this illustration, the range of zero to 100 is a bit small. It would only allow you to split your database into 100 pieces at most. So, NoSQL databases have much bigger key spaces allowing them to scale nearly without any restrictions.

Besides great scalability, NoSQL is schemaless, which means that items in the database do not need to have the same structure. Each one can be completely different.

In a relational database, you must first define the structure of your table, and also each item must conform to it. Changing this structure is not straightforward and could indeed lead to data loss. Not having a schema can be a big advantage if your application and data structure is constantly evolving.

 Now at this point, it's clear that NoSQL databases have certain advantages over relational ones, but that is not to say that relational databases are obsolete.

Far from it, NoSQL is more limited in the way you can retrieve your data, only allowing you to retrieve items by their primary key. Finding orders by ID is no problem, but finding all orders above a certain amount would be very inefficient. On the other hand, Relational databases have no trouble with this.

 Now there are workarounds for this issue, but only if you know how you are going to access your data and that might not always be the case.

 Another downside is that NoSQL databases are ultimately consistent. When you write a new item to the database and try to read it back straight away, it might not be returned. As I have mentioned, NoSQL splits your database into partitions, but each partition is imaged across multiple servers. That way, a server can go down without much impact. When you write a new item to the DB, one of these mirrors will store the new item and then copy it to the others in the background. A little bit of time might be taken in this process. When you read that item, the NoSQL database might try to read it from a mirror that does not have it yet. This is not a big issue in practice, because, data is replicated in just a few milliseconds. And if you want consistency, most NoSQL databases do have that option.

Let’s look at real-world examples. 


Cloud providers heavily promote NoSQL because they can scale it more easily. AWS has DynamoDB, Google Cloud has BigTable, and Azure has CosmosDB. Another example of their scalability, during Amazon Prime Day in 2019, the NoSQL database peaked at 45 million requests per second. But you can also run NoSQL databases yourself with software like Cassandra, Scylla, CouchDB, MongoDB, and more. 


 




Comments