Skip to main content

Hadoop The Definitive Guide [Book] - Study Notes

Chap-1- Meet Hadoop

  • Requirement and adoption in yahoo.
  • A framework that can scale to the web.
  • Map and Reduce acitivity and features like data locality.
  • Can be applied with a variety of algorithms
  • Huge data processing can beat good algorithms


Chap-2 - MapReduce

  • The Map Java class and Reducer Java class
  • The Job java class
  • Jobtracker and tasktracker
  • Hadoop reduces the input to input splits or just splits
  • Map tasks write the intermediate output to local disks, so that they can be discarded after use.
  • Outputs of Reduce tasks are stored in HDFS
  • Combiner function can be run on map output, and the combiner functions output forms the input to the reduce function
  • Hadoop streaming proivide hadoop apis in languages other than Java


Chap-3 - The Hadoop Distributed Filesystem

  • Fault tolerant solution. Same data written at multiple places.
  • Filesystems that manage the storage across a network of machines are called distributed filesystems.
  • Blocks - a block size is the minimum amount of data it can read and write (for hdfs its 64mb by default)
  • Namenodes and Datanodes - An HDFS cluster has a master-worker pattern: a namenode (master) and number of datanodes(workers). Master has all the meta data and datanode has all the blocks (but not persistent). Its reconstructed at start time.
  • HDFS federation
  • HDFS High-availablity
  • On large clusters the time it takes for a namenode to start from cold can be upto 30 mins
  • Fencing and failover - When one node fails an entity called 'failover controller' switch to the standby node. But first a ZooKeeper is used to ensure that only one namenode is active.
  • Graceful failover - triggered by adming
  • Ungraceful failover - in this case to make sure that the other node has completely stopped running, a mechanism called fencing is done. In worst case it does ' shoot the other node in the head' - force shutdown .
  • File Operations in HDFS
  • There are java endpoints to do all operations like create, delete, sync
  • Use Flume and Sqoop to move data
  • Copy parallel with distcp
  • Hadoop archives are compressed blocks that can be used as input to MapReduce


Chap - 4 I/O


  • Compression
  • Reading compressed data
  • Serialzation in natively implemented in Hadoop for better perfomance
  • Apache Avro is a project to do this in an improved way and support multiple languages, diff from Google Protocol Buffer and Thrift


Chapter - 5 - Developing a MapReduce Application

Setting up the Environment
- The Configuration API to read xml resource files etc
- Writing Unit Test with MRUnit
- Running locally on a small data
- Using Tool Interface write a Driver to run our MapReduce Job (Java file)
- Testing the driver
- Run in Cluster
- Package jar
- Launching a Job run the driver
- Debugging a Job
- Running multiple Job in particular flow

Chapter 6 - How MapReduce Works

Chapter 9 - Chapter 15

Setting Up Hadoop Cluster 
- Manually 
- Using a CDH distribution (See Appendix)

Hadoop Tools : 
  • Pig: Aimed to provide data structure and transformation more than just map and reduce can do
  • Hive: Made to run queries for people who were weak in Java but strong in SQL
  • Hbase: Distributed, column-oriented database built on top of HDFS. It is built to scale.
  • ZooKeeper: Is build to avoid partial failures of request transfers happening between nodes.
  • Squoop: To transfer data from external applicaitons , web api etc. This is focused on data movement.

Comments

Popular posts from this blog

Building Autonomous Drone with Raspberry Pi and APM 2.8

I am a total newbie to hardware and was pushing my limits to see how far I can reach on with hardware projects (which sparked my interest lately). I have set out on a very ambitions mission  to control a drone from raspberry pi .I began the research for this around 2 months ago and had brought a raspberry pi, drone body kit and apm flight controller. The key difference of this project from common drone projects is that I'm trying to avoid the use of and RC and instead use the raspberry pi to control it.  Hardware Ins tallation Setup: I am using APM 2.8 and Mission Planner. I am using RPi 3 to control the APM 2.8 via Telem port of APM I am planning to power the apm via the battery to ESC (Electronic Speed Controllers) Now, documenting my steps below: Day 1 Watch Tutorial To get started with APM flight controller, I watched this video tutorial [1] which gives a gentle introduction about APM board.  Setup APM board and Calibrate Sensors I downloade...

Adafruit GFX - How to change line spacing in text?

  You may want to update the line spacing to be a little lower than default due to small screen size on IoT devices. I faced this challenge while working on a Watchy hobby project. You may have used a font generator or just using the default fonts and got a *.h file that has the details of the font. In that case just change the last integer value in the PROGMEM variable.

Data Mining & Analytics with R : Running R Scripts and Data Mining Techniques - Day 2

Warning: These are my messy study notes, much better legible notes can be found here  http://onepager.togaware.com 1. A Tour Thru Rattle Transform Tab ( by no means near to the full power of underlying R) Data Mining Tabs - Cluster, Associate Model Log Tab  - Capture the corresponding R command Working from Left to Right on Tabs  Remember to Click Execute Button 'Save' -> Projects save the current state, all models etc. 'Open' Projects can be restored at a later time You can even load it back to R  2. First R Program Load rattle and ggplot2 library(rattle)  # Provides the weather dataset  library(rattle)  # Provides the ggplot() function ggplot -> Grammar of Graphics : Just like english grammar or grammar of a computer language. A result of Hadley's Phd . Look him up to learn more details. Then produce a plot using ggplot() # handsondatascience.com - tips on elegantly writing repeated code ds ...