Big Data ecosystem

Big Data | Big Data ecosystem

Apache Hadoop ecosystem
Apache Spark ecosystem
Hive
Spark Cluster Management
Hive on Spark vs Spark with Hive (Metastore)
Hadoop distributions
Hadoop file formats
NoSql databases systems
Types of NoSql databases

Apache Hadoop ecosystem
- Apache Hadoop: A platform for distributed storage and distributed processing of large data sets.
- HDFS (Hadoop Distributed File System): Data storage system (storage layer).
- YARN (Yet Another Resource Negotiator): Resource management and job scheduling/monitoring.
- MapReduce: A YARN-based system for parallel processing of large data sets (execution/processing engine).
- Apache Tez: A Framework for YARN-based data processing applications in Hadoop (execution/processing engine).
- Apache Spark: A unified analytics engine for large-scale data processing (execution/processing engine).
- Apache Hive: Analytical SQL on Hadoop (Hive allows executing SQL-Like queries on data stored in HDFS or HBase).
- HiveQL (HQL): Hive query language (SQL-Like language).
- Apache Impala: Analytical SQL on Hadoop (Impala allows executing SQL-Like queries on Hadoop data stored in HDFS or HBase).
- Apache HBase: Column-oriented distributed database system (NoSQL database).
- Apache Mahout: A Scalable machine learning and data mining library.
- Apache Flume: (unstructured/semi-structured data) Collecting, aggregating, and moving large amounts of log data (log collector).
- Apache Sqoop: (structured data) Transferring bulk data between Apache Hadoop and structured datastores (e.g., relational databases).
- Apache Oozie: Workflow Scheduler for Hadoop (scheduling).
- Apache Pig: Analyzing large data sets (high level scripting language: Pig Latin language).
- R Connectors: Using R (R language) with Hadoop to apply algorithms (statistical computations) on data sets stored in hdfs.
- Apache Ambari: Hadoop cluster management console (provisioning, managing, and monitoring Hadoop clusters).
- Apache ZooKeeper: A high-performance coordination service for distributed applications (configuration, coordination, ...)
- Apache Kafka: A distributed streaming platform.
- Apache Solr: An enterprise search platform built on Apache Lucene.
Apache Spark ecosystem
- Apache Spark: A unified analytics engine for large-scale data processing.
- Spark Core API (supported languages: Java, Scala, Python, R).
- Spark Core Engine: The execution/processing engine that provides in-memory computing capabilities (vs MapReduce, Tez).
- Spark SQL (+ DataFrames): A module for structured data processing using interactive SQL queries (vs Apache Hive).
- Spark Mlib: A scalable machine learning library (vs Apache Mahout).
- Spark Streaming: Streaming analytics across both streaming and historical data (vs Apache Pig).
- Spark GraphX: Graph computation engine.
- SparkR (R on Spark): R package that provides a light-weight frontend to use Apache Spark from R.
- Spark Shell: spark-shell (Scala), pyspark (Python), and sparkR (R).
Hive
Hive Architecture:
- Hive is a data warehouse service built on top of Apache Hadoop.
- Hive simplifies the definition of structure for an unstructured data.
- Hive allows the creation of queries with an SQL-like language (HQL).
- Hive uses a relational database to hold schema information (metastore).
- Hive interacts with HDFS that store the unstructured data.
- Hive uses MapReduce, Tez, or Spark (execution/processing engine) to transform unstructured data (using HQL queries) into structured data (conforms to the defined schema).
- Metastore: stores data schema information (provides a structure for the unstructured data).
- Hive QL: query definition and processing.
- Hive QL: uses MapReduce, Tez, or Spark (execution/processing engine) to interact with the data storage (HDFS or HBase).
- Hive CLI: runs queries, command line tool for Hive Server.
- Hive Web UI: provides access to Hive configuration settings, local logs, metrics, and information about active sessions and queries.
Hive QL can be extended with:
- User-Defined Functions (UDF).
- Aggregate Functions (UDAF).
- Table-Generating Functions (UDTF).
Hive tables:
- Schema: metadata (tables definition, ...) stored in a relational database.
- Data: files stored in HDFS.
Spark Cluster Management
- Spark Local Mode: used to run Spark applications on local server using single JVM.
- Spark Standalone Mode used to run Spark applications on Spark cluster using the Spark resource manager.
- YARN: Resource management and job scheduling/monitoring (Hadoop).
- Apache Mesos: Resource management and scheduling across entire datacenter and cloud environments (Hadoop, Spark, Kafka, Elasticsearch, ...).
Hive on Spark vs Spark with Hive (Metastore)
- Hive on Spark: This refer to configuring and using Spark as the execution engine of Hive (instead of MapReduce or Tez).
  set hive.execution.engine=spark;
  See: https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark:+Getting+Started
- Spark with Hive (Metastore): This refer to executing queries on Hive tables using Spark-SQL.
  See: https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
Hadoop distributions
- Hortonworks Hadoop Distribution (HDP: Hortonworks Data Platform).
- Cloudera CDH Hadoop Distribution.
- MapR Hadoop Distribution
- Amazon Web Services Elastic MapReduce Hadoop Distribution (AWS EMR)
Hadoop file formats
- Text (JSON, CSV formats): stores data in lines (each line is a record). Lines are terminated by a newline character "\n".
- Apache Avro: stores data in a row-based format.
- Apache Parquet: stores data in a column-based format.
- Apache ORC (Optimized Row Columnar): stores data in a column-based format.
NoSql databases systems
- Key-value databases:
  - Redis: in-memory data structure store, used as a database, cache and message broker.
  - MemcacheDB: in-memory data structure store, used as a database and cache.
- Column-oriented databases:
  - Apache HBase: a distributed, scalable, big data store (Google Big Table).
  - Apache Cassandra: a distributed, column store, NoSQL database management system.
- Document-oriented databases:
  - MongoDB: a distributed, document store, NoSQL database management system.
  - Apache CouchDB: a document store, NoSQL database management system.
- Graph databases:
  - Neo4J: a graph database management system.
  - OrientDB: a distributed multi-model nosql database with a graph database engine.
Types of NoSql databases
- KeyValue database: stores keys value pairs (similar to a hash table data structure). Useful for caching or queueing.
- Column-Oriented database: stores data tables by column (1:col1-row1,2:col1-row2,3:col1-row3,...).
- Document-Oriented database: stores data in "schema-less" documents (JSON, XML). Documents stores keys value pairs (the value data type can be a simple data type, array, object, ...). Useful for nested information.
- Graph database: stores data as nodes, relationships (edges), and properties. Useful for modeling classification and handling relational information.