November 28, 2017

Big Data Review

Big data refers to data that would typically be too expensive to store, manage, and analyze using traditional database systems.

Usually, traditional systems are cost-inefficient for storing unstructured data (such as images, text, and video), accommodating “high-velocity” (real-time) data, or scaling to support very large (petabyte-scale) data volumes.

eg. real-time twitter data, purchase information


The Problems with large data sets in R:

  • R reads entire data set into RAM all at once.
  • R Objects live in memory entirely.
  • The OS and system architecture can only access 232/10242 = 4GB of memory on a 32 bit system, but typically R will throw an exception at 2GB.

Big Data Solution

  • Distributed System
    • Hadoop
    • Spark (in-memory processing of data)
    • Integrate R with Hadoop (RHadoop)
    • Amazon service (Amazon S3 + Amazon EMR)

What is Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Summary: Hadoop is an open-source software framework written in java for distributed computing (E.g. MapReduce) and storage (HDFS).

Components of Hadoop

  • Hadoop Common: basic support modules
  • Hadoop Distributed File System (HDFS)
  • Hadoop YARN: job scheduler and resource manager
  • Hadoop MapReduce: a framework for processing large datasets in a distributed system.

Hadoop Workflow

  • To start with a hadoop project, users have to define and set up cluster.

  • Hadoop uses master-slave format to track the job.