Database Types

Databases types and how to choose them?

When we start building a new product or application, choosing database becomes one the most debatable aspect. In early days this was very simple - RDBMS. However, now choosing the right database has become a bit more complicated and a critical step, because once real data lands into the database, it is a lot of effort to migrate it from one database to another database.

In this post we are not going to discuss about databases like MySQL or MongoDB but rather we will discuss about the types of database like RDBMS, Graph DB or Document DB. We will go through most popular types of databases and try to understand which type of database to use in any given situation. Any system which have decent scale and lots of moving parts use more than one type of databases.

Tip for Interview: If you are not sure which database fits in a given solution, then avoid talking about the actual database like Postgresql or MongoDB, until you know all the nuances around it. Instead make use of terms like RDBMS or DocumentDB or GraphDB that refer to database types.

Three factors that help choose the right database

Structure of the data: How will the data be organized? Is it a fixed schema or flexible schema etc.
Query pattern: How is the data going to be accessed, what types of operations will need to be performed, and what will be the frequency of these operations?
Scale: What will be the volume of data and expected growth? This impacts performance and storage requirement.

Database types and their usage pattern

Flat File Databases This is the oldest and most basic form of data store, where data is stored in plain text files.
- Usage pattern: This is useful for small applications where a file can contain simple and non-relational data.
- Use case: Think of simple application that wants to store few settings information, or a simple spreadsheet.
- Examples: Data stored in .txt text files or .csv comma separated files
Relational Databases (RDBMS) These are the most widely used databases. They store data in tables of rows and columns, where relationships between the tables are established using foreign keys. Data can be queried using SQL (Structured Query Language). Only area where these struggle are in efficiently handling large scale and unstructured data.
- Uses Pattern: Applications requiring fixed schema, high consistency and honoring ACID properties. These are optimized for storage and capable of handling complex queries.
- Use case: Widely used most of the cases, from building e-commerce websites to banking systems.
- Example: MySQL, MS SQL Server, Oracle, and PostgresSQL
NoSQL Database These are designed handle unstructured or semi-structured data, providing more flexibility in how data is stored. These databases often sacrifice strict consistency for availability and partition tolerance (refer to the CAP theorem topic). There are different types of NoSQL Databases.
- Document Stores Here data is stored in documents form. Usually these are in JSON format.
  - Usage Pattern: These are ideal for applications requiring flexible schema where documents can be different in structure, semi structured, nested and typically in JSON or XML formats.
  - Use case: Content management systems, e-commerce applications, logging and analytics, and user data storage.
  - Examples: MongoDB, Apache CouchDB, Couchbase, and Amazon DocumentDB
- Key-Value Stores Here data is stored as key-value pairs. These are optimized for very fast lookup. This is the simplest type of NoSQL Database.
  - Usage Pattern: These are ideal for caching and session management. These are optimized for quickly retrieving data based on a specific key.
  - Use case: Web session management, storing user profiles, gaming leader-boards, and real-time recommendation engines.
  - Examples: Redis, and Amazon DynamoDB
- Column Stores Here data is stored in columns instead of rows. These are optimized for large scale analytics.
  - Usage Pattern: These are ideal for analytics and big data systems where aggregating large scale data across columns is common. Here data can be queried based on column family.
  - Use case: Useful in Data warehouses, OLAP, and real-time analytics. Applications that need to handle wide tables with large number of attributes like telemetry or logs.
  - Examples: Apache Cassandra used for large scale distributed systems and Apache HBase built on top of Hadoop.
- Graph Databases Here data is stored as nodes (entities) and relationships (edges). These are optimized for analyzing networks of relationships. These excel in managing highly connected data such as social networks.
  - Usage Pattern: These are useful for scenarios where the relationships between data points are as important as the data itself, and queries often involve exploring connections
  - Use case: Ideal for Social networks, recommendation engines, fraud detection, knowledge graphs, and any scenario where exploring relationships between data entities is crucial.
  - Examples: Neo4j, Amazon Neptune, Apache TinkerPop (with various implementations like Gremlin, Datastax Stargate)
Specialized Databases These are databases and storage systems that are designed for specialized use cases. They help us tackle diverse set of modern data storage needs.
- In-Memory Databases These store data directly in the main memory (RAM) instead of storing data on the disks. This makes retrieval of data extremely fast.
  - Usage Pattern: These are used in applications requiring low-latency data access, such as caching, real-time analytics, leader boards, or high-speed transaction processing.
  - Use case: These are ideal for gaming leader boards, trading platforms, caching to reduce load on persistent databases, or any application which require high throughput and low latency
  - Examples: Redis, Memcached, etcd etc.
- Time Series Databases These are designed to store, manage, and query time-stamped data. These are structured to handle data points indexed by time, involve append only writes and support large sequential reads.
  - Usage Pattern: These are used in applications that track time-based data such as data from sensors, stock prices or logs.
  - Use case: Monitoring performance metrics, Historical stock prices, and IoT sensor data
  - Examples: InfluxDB, Prometheus, and TimeScaleDB
- Search Databases These are optimized for full-text search operations, allowing fast, complex queries over large datasets, especially unstructured or semi-structured data.
  - Usage Pattern: These are useful for applications requiring indexing of documents or data records in a way that allows for fast searches, including keyword matching, filtering, fuzzy searches, and ranking results by relevance.
  - Use case: Ideal for website wide searches, search engines, product searches on e-commerce sites, and document searches in a content management system.
  - Examples: Elasticsearch, Solr
- Object or Blob(Binary Large Object) Storage These are used for storing unstructured or sem-structured data and are highly scalable and offer robust metadata management for the stored files. Each object includes the data itself, metadata, and a unique identifier.
  - Usage Pattern: These are useful for application requiring to store large amounts of unstructured or semi-structured data, usually in the cloud.
  - Use case: Ideal for storing files, videos, images, backups, logs and access these files vis REST API
  - Examples: AWS S3, and Google Cloud Storage
- Block Storage These break data into fixed-sized blocks and store them across a storage array. Each block is assigned a unique address, and applications can directly manage the blocks for more granular control over data.
  - Usage Pattern: These are useful for applications requiring high-performance (high IOPS and low latency) storage and direct access to raw data.
  - Use case: High performing computing, database storage, and storing data for virtual machines in cloud.
  - Example: Amazon EBS, Azure Manage disk, SSD or NVMe devices, iSCSI and SANs (Storage Area Networks)
- Data Warehouse These are a specialized type of database designed for analytical processing. These store large volumes of historical data from various sources and is optimized for complex queries, reporting, and analysis rather than transaction processing. Data is cleaned, transformed, and stored in the data warehouse.
  - Usage Pattern: These are useful for application requiring to do aggregations and rollup reportings on large scale structured historical data that comes from various systems, like ERP (Enterprise Resource Planning) systems, CRMs (Customer Relationship Management), and other business software.
  - Use cases: Business Intelligence (BI), Data Analytics, Trend forecasting, and Decision making based on roll-ups and aggregates.
  - Examples: AWS Redshift, Google BigQuery, and Snowflake
- Data Lake This is a storage system designed to hold vast amounts of raw data in its native format, whether structured, semi-structured, or unstructured. Data lakes are often used in big data environments and can handle everything from text and images to IoT sensor data. Here data is in its raw form and not cleaned or transformed
  - Usage Pattern: These are useful for applications that work with analyzing and querying raw data to run advanced analytics or to do big data processing.
  - Use cases: Storing data for future analysis, storing data for running machine learning workflows and big data analytics.
  - Examples: AWS S3, Azure Data Lake, Google Cloud Storage.
- Vector Databases These are designed to store, index, and search across high-dimensional vector representations of data. These vectors often come from machine learning models that transform text, images, audio, or other types of data into numerical vectors, enabling similarity searches. These are essential for RAG based AI use cases.
  - Usage Pattern: These are useful for application that need similarity searches, where the goal is to find data points (e.g., documents, images) that are "close" to each other in a multi-dimensional vector space.
  - Use cases: These are used for Semantic Searches, Recommendation engines, RAG based AI Queries.
  - Examples: Pinecone managed vector database, Weaviate an open source vector search engine, FAISS from Facebook AI Research.
Legacy Databases These are the oldest types of databases. They are still used in some legacy systems. You would not propose these database types in current day. Mentioning them here for sake of awareness.
- Hierarchical Databases These databases organize data in a tree-like structure. Each data item is a node in a tree-like structure, where a parent node can have multiple children, but each child has only one parent. These database are inflexible and cannot handle complex relationships.
  - Usage pattern: This is useful for applications where the data has clear or strict parent-child relationships and needs quick retrieval without the need for complex queries.
  - Use case: Applications wanting to store hierarchical charts, network topologies or file system.
  - Example: IBM's Information Management System (IMS) is the most popular and used even today in large mainframe systems.
- Network Databases These where improvements over Hierarchical Databases, in that these allowed more complex relationships to be defined. Here each child node can also have multiple parent nodes.
  - Usage pattern: Useful where application requiring flexibility of complex relationships.
  - Use case: These are still in use but in legacy telecommunication systems.
  - Example: CODASYL (Conference on Data Systems Languages)

How does the structure of data influences database choice?

The structure of data refers to how the data is organized, represented, and related to one another in a system. This structure significantly influences the choice of the database because different databases are optimized for different types of data structures, storage formats, and querying patterns.

Data can be:
- Structured - Employee record, Drug Master, Doctor Appointment, etc
- Semi Structured - Json and XML for Product Catalogs
- Un-structured - Images, audio, video files for media platforms or document management systems.
- Graph-like - Social Networks
- High Dimensional Vector - Data represented as multi-dimensional vectors like Text embeddings or Image embeddings.
Structure of data influence database choice through requirements of
- Schema flexibility - RDBMS vs NoSQL
- Querying pattern - RDBMS vs Graph vs In-Memory
- Data size - RDBMS vs NoSQL Databases vs Blob Storage

How does the query pattern influence database choice?

A query pattern refers to the typical structure and type of operations your application will perform on the database, such as how you retrieve, insert, update, or delete data. It represents the most common ways your system will interact with the data and is a key factor in choosing the right database.

Simple Lookups: - Searching for a specific item based on a unique identifier (e.g., user ID, product ID). Key-Value Stores like Redis or DynamoDB are optimized for fast lookups using unique keys.
Range Queries: - Fetching data that falls within a specific range, such as a date range or numerical range. Time-Series Databases (e.g., InfluxDB) or Relational Databases (e.g., MySQL) work well for range queries
Relational (Join) Queries: - Combining data from multiple tables or collections based on relationships between them, such as foreign keys. Relational Databases (e.g., PostgreSQL, MySQL) excel at handling joins and complex relationships between structured data tables.
Full-Text Search: - Searching through large amounts of text data for matches to specific keywords or phrases. Search Databases like Elasticsearch are designed for full-text search, offering fast and accurate text-based query results.
Graph Traversal Queries: - Querying the relationships between data points, such as connections between people or entities. Graph Databases like Neo4j are optimized for traversing relationships between nodes and efficiently answering queries about connections.
Aggregation Queries: - Summarizing or aggregating data (e.g., counting, averaging, summing) over large datasets. Columnar Databases (e.g., Apache Cassandra, HBase) or Data Warehouses (e.g., Amazon Redshift, Google BigQuery) are built for handling large-scale aggregations efficiently.
Geospatial Queries: - Querying data based on geographic locations or spatial relationships. Geospatial Databases or databases with geospatial extensions like PostgreSQL with PostGIS or MongoDB are best for geospatial queries.
Similarity Search: - Finding data points that are similar to a given input based on vectors or embeddings. Vector Databases (e.g., Pinecone, Weaviate) are optimized for similarity search in high-dimensional vector space, often used in machine learning and AI applications.

Lets look at different scenarios and identify the database types to use

Microservices architecture

In a microservices architecture, every micro service may have its own database. Take a example of any e-commerce like amazon.com. At a very high level e-commerce platform will need to have features like Product Catalog, Product Search, Shopping Cart and Checkout, and View Orders. If e-commerce platform is small, then RDBMS can be used for everything, however, things change when dealing with a large scale microservices platform. In such a case every feature might be a separate micro service with its own choice of database backing its persistence needs.

Product Catalog: Document DB, because each product can have different set of features. This also need to query based on attributes
Product Search: Elastic Search
Cart/Checkout: RDBMS, because transition is required
View Orders: Recent orders can be stored in RDBMS, historical orders can be store in Columnar DB like casandra DB

Real-time gaming leader-board

What type of database is recommended for a real-time gaming leader-board and why?

Recommended database: In-Memory database, key-value data store like Redis
Reasoning: In a real-time gaming leader-board data changes frequently, and both updates/inserts and reads need to be fast
- Performance: In-memory databases provide extremely low latency for read and write, which is crucial for real-time applications.
- Scalability: Key-value stores like Redis can handle high concurrent access, which is important for popular games.
- Data structure support: Redis, for example, supports sorted sets, which are perfect for maintaining leader-boards efficiently.

Product catalogs in a e-commerce system

What type of database is recommended in an e-commerce platform for storing product catalogs and why?

Recommended database: Document DB like MongoDB
Reasoning: Document DBs are perfect for product catalogs because each product can have different attributes. Document is schema-less or support flexible schema. For example, shirts have size and color, while food items have entirely different attributes. If we tried to store this in an RDBMS, it would result in a very wide table with many null values.
- Performance: For read-heavy operations, which are common in product catalog use cases, document DBs can offer good performance, especially when properly indexed
- Query support: Document DBs support querying on these varied attributes efficiently.

Order in a large scale e-commerce system

In a large-scale e-commerce system, what types of databases is recommended for recent orders versus historical orders?

Recommended databases: RDBMS for recent orders and Columnar Datastore for historical orders.
Reasoning: RDBMS provides ACID properties, which are crucial for recent orders where consistency and transactions are important while Columnar databases for historical data provides stability advantages for handling large amounts of historical data.
- Performance: RDBMS faster querying with caching give better performance on RDBMS while Columnar database provides better support for running analytics on historical data.

Product search in a large scale e-commerce system

In a large-scale e-commerce system, what types of databases is recommended for supporting product search?

Recommended databases: Elasticsearch for indexing products.
Reasoning: Elasticsearch support fast searches of products through efficient search indexes.
- Performance: Faster search through indexing on keywords.
- Scalability: Designed to handle large volumes of data.
- Advanced search support: Elasticsearch supports full-text search with stemming, synonyms, and fuzzy matching to help with incorrect spellings in search terms. It also supports search by relevance to ensure most appropriate results appear at the top.

IoT Sensor data

What type of database would you recommend for storing and analyzing large volumes of IoT sensor data, and why?

Recommended databases: Time Series databases like InfluxDB or TimescaleDB.
Reasoning: Time Series database provide efficient storage option for large volumes of timestamped data.
- Performance: Time Series database provide very high write throughput which is the when dealing with IoT sensors.
- Query support: These provide optimized time based querying capabilities.
- Storage support: They provide built-in feature for data retention and down sampling (rolls up documents within a fixed time interval into a single summary document).

Real-time analytics vs long term data archiving with batch analytics

How does the choice of database differ for a system handling real-time analytics versus the system focused on long-term data archiving and occasional batch analytics?

Recommended database: Time series database for for real-time analytics
Reasoning:
- designed to ingest high volumes of time-stamped data
- offer efficient querying and aggregation over time ranges
- often provide built-in functions for time-based analysis
- support real-time querying on incoming data
- examples include InfluxDB or Prometheus
Recommended database: Columnar database for long-term data archiving and occasional batch analytics
Reasoning:
- designed for efficient storage of large amounts of data.
- Often provide high compression ratios, reducing storage costs for large datasets
- excel at analytical queries that involve large scans over specific columns
- optimized for read-heavy workloads typical in batch analytics.

Financial transactions data vs social media network's user interactions

How does the choice of database differ for a system handling financial transactions versus a system managing social media network's user interactions?

Recommended database: RDBMS for financial transactions
Reasoning:
- Financial transactions require ACID properties which is what RDBMS excels at.
- RDBMS will ensure data integrity and consistency
- RDBMS will support complex queries which involve complex joins, filter, grouping and nesting.
Recommended database: Graph Database like Neo4j or AWS Neptune for managing social media network's user interactions.
Reasoning:
- Social media platform need to prioritize high availability and partition tolerance over strong consistency.
- Graph database provide support for complex relationships and optimize relationship traversing.
- Social media systems need real-time responses for queries like "What posts did my friends like recently?" Graph databases excel in querying this kind of interconnected data quickly and efficiently.
- Posts, comments, likes, and shares can be modeled as nodes, with edges representing interactions between users and content.

Write heavy workloads vs read heavy workloads

How does the choice of database differ for a system designed primarily for write heavy workloads versus a system designed for read heavy workloads?

Recommended database: NoSQL datastore like Cassandra or MongoDB for Write heavy workload
Reasoning:
- Scalability: nodes need to efficiently sync data across cluster and often require sophisticated conflict resolution
- Consistency model - Eventual consistency to maintain performance
- Storage engine - Databases optimized for writes often use storage engine like LSM trees (long structure merge trees)
- Tread off: sacrifice some read performance and strong consistency
Recommended database: In-Memory store like Redis or Search datastore like Elasticsearch or RDBMS for Read heavy workload
- Scalability: Multiple read replica
- Consistency model: RDBMS can support strong consistency
- Indexing: RDBMS or Search datastore can support optimized indexing
- Trade offs: Lower write throughput and storage cost due to many read replicas

Try few questions

<-- Back

Take quiz -->