Data is a core value of a company. It becomes the moat of the product when AI and Machine Learning technology expose the hidden value of customer behavior data. Data always show massively and messily – a single blog website (just like this) could generate millions of data every day, but they are isolated and incomplete to paint a whole picture of the audience and customers of the website. It has value but very low value if we don’t analyze them. I will share the big data process and storage with basic concepts in this article. As a tech executive write-up, it won’t resolve your problem immediately but provide the mental model to think.
Data Storage System.
We hear multiple data storage names like Database, Data Warehouse, and Data Lake. Here is a quick comparison of them.
- Database. It is used for real-time transactional data interaction to support applications that interface with customers or record the data. There are two types of databases, a) Rational Database like Oracle, and MySql, which leverages table to store structured data and support ample join functions. b) No-SQL databases like DynamoDB, and MongoDB, which leverage Key-Value to store the data, providing a fast read and write but less complex query functions (support with additional applications).
- Data Warehouse. It is used for business or statistic analysis on the aggregated data from multiple sources like databases and log systems. Besides the raw data, it provides timely snapshots and statistical data (processed raw data) to accelerate comprehensive analysis. Typical data warehouse are Snowflake and Redshift.
- Data Lake. It stores the raw data from the data sources or backup snapshots. A typical system is S3 and other object storage systems.
ETL – Extract, Transform, and Load. It’s a typical big data processing model.
- Extract. It’s an ingestion process to extract raw data from the data sources. The raw data could be in the database, system log, mobile/embedded devices as a real-time system, or anything place the data is generated. The extraction could be in real-time or batch overnight. The streaming data process leverage Kafka, an event-driven system, to extract data.
- Transform. It consumes the data extracted from the source and takes the streaming/batch method to clean, aggregate and convert them to the desired format. Transform evolves from batch to stream with three typical open-source solutions – Hadoop, Spark, and Flink. We will compare them in the following sections.
- Load. It saves the processed data from the Transform layer to a permanent data storage system, typically a Data warehouse.
Beyond the ETL, there are two more areas with more attention – Data Visualization like Looker to make the processed data human-friendly; machine learning to process data deeper. They are out of the scope of this blog.
Within the ETL, Extract is very dynamic and binds tightly with system architecture; load is a downstream function of transform and usually depends on transform. I will discuss the typical technologies in the Transform layer with multiple dimensions of comparison like architecture, reliability, cost, fault tolerance etc.
- Hadoop. Hadoop leverages cheap commodity hardware with the Map-Reduce architecture to process completed data in a batch-oriented. It takes disk as inter-data storage to handle massive data in the highest scalability, low cost, and high fault tolerance. It doesn’t support iterate and cache and has relatively low performance compared to other solutions due to disk-based data exchange. It has an active open-source community and best practices to follow and consult.
- Spark. Spark leverages memory focus hardware with micro-batch architecture to process streaming data. It depends on memory for data exchange to handle streaming data in high scalability with high cost and high fault tolerance. It supports iteration and cache with RDD and medium performance between Hadoop and Flink. It has an active open-source community and best practices to follow and consult. Due to the micro-batch computational model, it could balance between batch and streaming.
- Flink. Flink leverages memory-focus hardware with a continuous operator-based model to process streaming data. It depends on memory for data exchange to handle streaming data in high scalability with high cost and high fault tolerance. It supports iteration and cache and has the highest performance compared to Hadoop and Spark. It doesn’t have an active open-source community and best practices to follow and consult.
All three solutions support machine learning during processing with an additional machine-learning library. By now, Hadoop is popular for offline batch-processing with no latency concerns, and Spark is used for online streaming on second-level latency.