Big Data
|
Big Data ecosystem
- Apache Hadoop ecosystem
- Apache Spark ecosystem
- Hive
- Spark Cluster Management
- Hive on Spark vs Spark with Hive (Metastore)
- Hadoop distributions
- Hadoop file formats
- NoSql databases systems
- Types of NoSql databases
-
Apache Hadoop ecosystem
- Apache Hadoop: A platform for distributed storage and distributed processing of large data sets.
- HDFS (Hadoop Distributed File System): Data storage system (storage layer).
- YARN (Yet Another Resource Negotiator): Resource management and job scheduling/monitoring.
- MapReduce: A YARN-based system for parallel processing of large data sets (execution/processing engine).
- Apache Tez: A Framework for YARN-based data processing applications in Hadoop (execution/processing engine).
- Apache Spark: A unified analytics engine for large-scale data processing (execution/processing engine).
- Apache Hive: Analytical SQL on Hadoop (Hive allows executing SQL-Like queries on data stored in HDFS or HBase).
- HiveQL (HQL): Hive query language (SQL-Like language).
- Apache Impala: Analytical SQL on Hadoop (Impala allows executing SQL-Like queries on Hadoop data stored in HDFS or HBase).
- Apache HBase: Column-oriented distributed database system (NoSQL database).
- Apache Mahout: A Scalable machine learning and data mining library.
- Apache Flume: (unstructured/semi-structured data) Collecting, aggregating, and moving large amounts of log data (log collector).
- Apache Sqoop: (structured data) Transferring bulk data between Apache Hadoop and structured datastores (e.g., relational databases).
- Apache Oozie: Workflow Scheduler for Hadoop (scheduling).
- Apache Pig: Analyzing large data sets (high level scripting language: Pig Latin language).
- R Connectors: Using R (R language) with Hadoop to apply algorithms (statistical computations) on data sets stored in hdfs.
- Apache Ambari: Hadoop cluster management console (provisioning, managing, and monitoring Hadoop clusters).
- Apache ZooKeeper: A high-performance coordination service for distributed applications (configuration, coordination, ...)
- Apache Kafka: A distributed streaming platform.
- Apache Solr: An enterprise search platform built on Apache Lucene.
-
Apache Spark ecosystem
- Apache Spark: A unified analytics engine for large-scale data processing.
- Spark Core API (supported languages: Java, Scala, Python, R).
- Spark Core Engine: The execution/processing engine that provides in-memory computing capabilities (vs MapReduce, Tez).
- Spark SQL (+ DataFrames): A module for structured data processing using interactive SQL queries (vs Apache Hive).
- Spark Mlib: A scalable machine learning library (vs Apache Mahout).
- Spark Streaming: Streaming analytics across both streaming and historical data (vs Apache Pig).
- Spark GraphX: Graph computation engine.
- SparkR (R on Spark): R package that provides a light-weight frontend to use Apache Spark from R.
- Spark Shell: spark-shell (Scala), pyspark (Python), and sparkR (R).
-
Hive
Hive Architecture
:
- Hive is a data warehouse service built on top of Apache Hadoop.
- Hive simplifies the definition of structure for an unstructured data.
- Hive allows the creation of queries with an SQL-like language (HQL).
- Hive uses a relational database to hold schema information (metastore).
- Hive interacts with HDFS that store the unstructured data.
- Hive uses MapReduce, Tez, or Spark (execution/processing engine) to transform unstructured data (using HQL queries) into structured data (conforms to the defined schema).
- Metastore: stores data schema information (provides a structure for the unstructured data).
- Hive QL: query definition and processing.
- Hive QL: uses MapReduce, Tez, or Spark (execution/processing engine) to interact with the data storage (HDFS or HBase).
- Hive CLI: runs queries, command line tool for Hive Server.
- Hive Web UI: provides access to Hive configuration settings, local logs, metrics, and information about active sessions and queries.
Hive QL can be extended with
:
- User-Defined Functions (UDF).
- Aggregate Functions (UDAF).
- Table-Generating Functions (UDTF).
Hive tables
:
- Schema: metadata (tables definition, ...) stored in a relational database.
- Data: files stored in HDFS.
-
Spark Cluster Management
Spark Local Mode
: used to run Spark applications on local server using single JVM.
Spark Standalone Mode
used to run Spark applications on Spark cluster using the Spark resource manager.
YARN
: Resource management and job scheduling/monitoring (Hadoop).
Apache Mesos
: Resource management and scheduling across entire datacenter and cloud environments (Hadoop, Spark, Kafka, Elasticsearch, ...).
-
Hive on Spark vs Spark with Hive (Metastore)
-
Hadoop distributions
- Hortonworks Hadoop Distribution (HDP: Hortonworks Data Platform).
- Cloudera CDH Hadoop Distribution.
- MapR Hadoop Distribution
- Amazon Web Services Elastic MapReduce Hadoop Distribution (AWS EMR)
-
Hadoop file formats
- Text (JSON, CSV formats): stores data in lines (each line is a record). Lines are terminated by a newline character "\n".
- Apache Avro: stores data in a row-based format.
- Apache Parquet: stores data in a column-based format.
- Apache ORC (Optimized Row Columnar): stores data in a column-based format.
-
NoSql databases systems
Key-value databases
:
- Redis: in-memory data structure store, used as a database, cache and message broker.
- MemcacheDB: in-memory data structure store, used as a database and cache.
Column-oriented databases
:
- Apache HBase: a distributed, scalable, big data store (Google Big Table).
- Apache Cassandra: a distributed, column store, NoSQL database management system.
Document-oriented databases
:
- MongoDB: a distributed, document store, NoSQL database management system.
- Apache CouchDB: a document store, NoSQL database management system.
Graph databases
:
- Neo4J: a graph database management system.
- OrientDB: a distributed multi-model nosql database with a graph database engine.
-
Types of NoSql databases
KeyValue database
: stores keys value pairs (similar to a hash table data structure). Useful for caching or queueing.
Column-Oriented database
: stores data tables by column (1:col1-row1,2:col1-row2,3:col1-row3,...).
Document-Oriented database
: stores data in "schema-less" documents (JSON, XML). Documents stores keys value pairs (the value data type can be a simple data type, array, object, ...). Useful for nested information.
Graph database
: stores data as nodes, relationships (edges), and properties. Useful for modeling classification and handling relational information.