Key data structures that are used in modern database

Lokendra jain
8 min readFeb 8, 2023

--

Save the blog for the future and CRACK any interview in the WORLD.

SkipList

A Skip List is a probabilistic data structure that is used to implement data structures such as ordered lists and dictionaries. It’s a type of data structure used to store ordered lists of elements, similar to a linked list, but with faster search times.

A Skip List is essentially a linked list, where each node has a set of forward pointers to other nodes at different levels. These pointers are chosen randomly, and their goal is to “skip” as many elements as possible when searching for an element in the list. The number of levels is determined by the height of the node, which is also chosen randomly. The higher the level, the more forward pointers the node will have.

Skip Lists have an average time complexity of O(log n) for search, insertion and deletion operations, which is faster than the average O(n) time complexity of a linked list.

In terms of real modern database usage, Skip Lists are used as an index data structure to improve the speed of search operations. For example, in databases that support multi-dimensional indexes, a Skip List can be used to store the indices, providing faster search times compared to a traditional B-Tree index. Additionally, some NoSQL databases, such as Riak, use Skip Lists for their ordered data structures, such as ordered dictionaries.

Hash Index

A Hash Index is a data structure used in databases to quickly search for values within a collection of records. It works by taking a value being searched for and using a hash function to calculate a unique identifier, known as a hash code, for that value. This hash code is then used as an index into an array or table of data, where the actual value can be found.

One of the key benefits of using a hash index is its ability to perform search operations in constant time, meaning that the time taken to perform a search does not increase as the size of the database grows. This makes hash indices an ideal choice for databases that need to perform frequent and fast searches, such as online shopping websites or financial trading systems.

In real modern databases, hash indices are widely used as a method of indexing data. For example, they are commonly used in NoSQL databases such as MongoDB or Cassandra to provide fast and efficient lookups of documents in a collection. They are also used in relational databases like MySQL or PostgreSQL to improve the performance of SELECT statements that use equality conditions on columns. Additionally, hash indices are often used as secondary indices in databases that use other data structures like B-Trees or R-Trees as their primary indexing mechanism.

To ensure that hash indices are efficient and scalable, modern databases will often use techniques like load balancing and collision resolution to ensure that the indices remain well-structured and can quickly retrieve data even as the database grows. Additionally, databases will often use various strategies to manage hash indices, such as choosing the right hash function, resizing the index when needed, or using multiple indices to support different types of searches.

SST Table

The SSTable (Sorted String Table) is a data structure used to store the mapping of keys to values in a database. It is an important building block in modern databases because of its ability to efficiently store, retrieve and search for key-value pairs. SSTables are used in many popular databases, including Google’s Bigtable, Apache Cassandra and Hbase.

SSTables are designed to be memory-efficient, disk-friendly and to support fast range searches. They are implemented as a collection of binary search trees or B-Trees, where each tree is stored in a separate file on disk. The keys in the SSTable are stored in lexicographic order, and they are indexed to allow fast searching and retrieval of values.

SSTables are used in database systems to provide fast, efficient and scalable storage for large amounts of data. For example, in Apache Cassandra, each node in the cluster maintains an SSTable for a specific range of keys. This allows for fast and efficient data retrieval for specific keys, as well as efficient searching for a range of keys.

SSTables are used in databases for the following purposes:

  • Key-value storage: The keys in an SSTable are stored in lexicographic order and are used as an index to efficiently store, retrieve, and search for values.
  • Range queries: SSTables support fast range searches, allowing users to efficiently find all values associated with keys within a specific range.
  • Data compaction: Over time, as data is added to the database, it is periodically compacted into a new SSTable to remove any redundant or outdated data, resulting in more efficient storage and faster searches.

In summary, SSTables are an important component in modern databases that provide efficient and scalable storage, retrieval and search for large amounts of data.

LSM-Tree

LSM-Tree (Log-Structured Merge-Tree) is a data structure used in modern databases to store key-value pairs. It’s a combination of log-structured storage and a B-tree. The data is first written to a log-structured memory (LSM) before it’s later merged into a B-tree. The goal of the LSM-Tree is to reduce write amplification, which is a problem in B-tree-based databases, by batching multiple writes into a single write operation. This results in more efficient disk usage and improved write performance.

In modern databases, the LSM-Tree data structure is often used in NoSQL databases such as Cassandra, RocksDB, and LevelDB. These databases use the LSM-Tree to store large amounts of data in a highly scalable manner. The LSM-Tree provides high write performance and is also good for handling read-intensive workloads by reducing the number of disk seeks required to retrieve data. Additionally, it’s good for data that is frequently updated because the LSM-Tree can batch multiple updates into a single write operation.

B-Tree

B-Tree is a data structure used for indexing and searching large datasets efficiently. It is widely used in modern databases and is often the primary data structure for storing indices.

B-Tree is a balanced tree structure that ensures that the keys are stored in a sorted order, and all nodes in the tree have nearly equal number of keys. This makes B-Tree well-suited for implementing indices, as it provides logarithmic time complexity for searching, inserting and deleting operations, which is important for databases where read and write operations are performed frequently.

B-Tree is a variation of the traditional binary search tree, with the difference being that in a B-Tree, a node can have multiple children, and not just two as in a binary search tree. This allows for larger keys to be stored in a single node, reducing the height of the tree and thereby improving the search performance. The number of children a node can have is referred to as its order.

In modern databases, B-Trees are often used to implement indices on the primary data stored in the database. The indices are used to speed up search operations by narrowing down the search space. For instance, if a user wants to search for all the customers with a certain name, the database will first search the index to find the exact location of the data and then retrieve the relevant data from the primary store.

Another common usage of B-Trees in modern databases is for implementing on-disk data structures, where the keys and values are stored on a hard disk, as opposed to being kept in memory. This allows databases to handle large datasets that are too big to fit into memory, and is essential for databases that need to persist data even after the process has terminated.

In summary, the B-Tree data structure is an important component of modern databases, providing efficient indexing and searching capabilities for large datasets.

Inverted Index

An Inverted Index is a data structure that maps terms or words to their corresponding document IDs in a database. It’s a key data structure used in modern databases to enable efficient full-text search capabilities.

The basic idea behind an inverted index is to store a mapping of terms to their occurrences in a set of documents. For each term, the inverted index lists all of the documents that contain the term along with the frequency of the term in each document. This information allows for fast searching of documents based on the presence of specific terms.

One of the most common use cases of inverted indexes is in full-text search engines, where users submit search queries consisting of one or more terms and the search engine returns a list of documents that match the query terms. Inverted indexes make it possible for search engines to quickly identify the documents that contain the terms being searched for and return them in the order of relevance.

Inverted indexes are also used in other areas such as information retrieval, spell correction, and text classification. They are commonly used in modern databases such as Apache Solr, Elasticsearch, and Lucene.

In terms of implementation, an inverted index can be represented as a hash table or a tree-based data structure, such as a B-tree, to store the mapping of terms to their occurrences. This allows for fast retrieval of the information associated with a term. The specific choice of data structure depends on the specific requirements of the database and the use case.

Suffix Tree

A Suffix Tree is a data structure used to represent the set of all suffixes of a given string or a set of strings. It is a compressed trie where each leaf node represents a suffix of the given text, and the paths from root to leaf nodes correspond to the suffixes of the text.

In real modern databases, Suffix Trees are used to solve a variety of problems related to string matching, pattern matching, and text searching. For example:

  1. Text Searching: Suffix Trees can be used to find substrings, patterns, or regular expressions in a text efficiently.
  2. Longest Common Substring: Suffix Trees can be used to find the longest common substring between two or more strings.
  3. String Matching: Suffix Trees can be used to match a string with a set of patterns, and find all occurrences of the pattern in the text.
  4. Sequence Alignment: Suffix Trees can be used to align sequences in bioinformatics.
  5. Natural Language Processing: Suffix Trees can be used to extract information from large volumes of unstructured text, such as named entity recognition and sentiment analysis.

Suffix Trees have been used in various real-world applications, including search engines, text editors, and bioinformatics tools. However, despite their many advantages, Suffix Trees have a high memory overhead and can be slow to construct. This makes them unsuitable for some applications, especially those that require real-time performance.

R-Tree

R-Tree is a tree data structure that is commonly used for indexing multi-dimensional data, such as geographical coordinates or rectangles in 2D space. It provides an efficient way to search for objects that intersect with a specific region, making it ideal for queries such as “find all objects within a certain area”.

In an R-Tree, each node represents a rectangular area, and the children of the node are other rectangles that are contained within that area. The leaves of the tree contain the actual data objects. When a new data object is inserted, the R-Tree is updated to maintain a balance between the number of objects in each node, so that the tree remains efficient for search operations.

R-Trees are commonly used in modern databases for spatial indexing, such as in geographic information systems (GIS) and computer-aided design (CAD) systems. They are also used in databases that handle large amounts of two-dimensional data, such as image databases or document management systems.

Some popular databases that use R-Trees include PostgreSQL, MySQL, SQLite, and Microsoft SQL Server. The use of R-Trees can significantly improve the performance of complex spatial queries, making them a valuable tool for many modern databases.

🌎 Let’s Connect!
LinkedIn

--

--

Lokendra jain
Lokendra jain

Written by Lokendra jain

Sr. Engineer@Google. Strong Interest in Distributed Systems, and System Design. https://www.linkedin.com/in/lokendrajain/

No responses yet