Chap-1- Meet Hadoop
Chap-2 - MapReduce
Chap-3 - The Hadoop Distributed Filesystem
Chap - 4 I/O
Chapter - 5 - Developing a MapReduce Application
Setting up the Environment
- The Configuration API to read xml resource files etc
- Writing Unit Test with MRUnit
- Running locally on a small data
- Using Tool Interface write a Driver to run our MapReduce Job (Java file)
- Testing the driver
- Run in Cluster
- Package jar
- Launching a Job run the driver
- Debugging a Job
- Running multiple Job in particular flow
Chapter 6 - How MapReduce Works
Chapter 9 - Chapter 15
- Requirement and adoption in yahoo.
- A framework that can scale to the web.
- Map and Reduce acitivity and features like data locality.
- Can be applied with a variety of algorithms
- Huge data processing can beat good algorithms
Chap-2 - MapReduce
- The Map Java class and Reducer Java class
- The Job java class
- Jobtracker and tasktracker
- Hadoop reduces the input to input splits or just splits
- Map tasks write the intermediate output to local disks, so that they can be discarded after use.
- Outputs of Reduce tasks are stored in HDFS
- Combiner function can be run on map output, and the combiner functions output forms the input to the reduce function
- Hadoop streaming proivide hadoop apis in languages other than Java
Chap-3 - The Hadoop Distributed Filesystem
- Fault tolerant solution. Same data written at multiple places.
- Filesystems that manage the storage across a network of machines are called distributed filesystems.
- Blocks - a block size is the minimum amount of data it can read and write (for hdfs its 64mb by default)
- Namenodes and Datanodes - An HDFS cluster has a master-worker pattern: a namenode (master) and number of datanodes(workers). Master has all the meta data and datanode has all the blocks (but not persistent). Its reconstructed at start time.
- HDFS federation
- HDFS High-availablity
- On large clusters the time it takes for a namenode to start from cold can be upto 30 mins
- Fencing and failover - When one node fails an entity called 'failover controller' switch to the standby node. But first a ZooKeeper is used to ensure that only one namenode is active.
- Graceful failover - triggered by adming
- Ungraceful failover - in this case to make sure that the other node has completely stopped running, a mechanism called fencing is done. In worst case it does ' shoot the other node in the head' - force shutdown .
- File Operations in HDFS
- There are java endpoints to do all operations like create, delete, sync
- Use Flume and Sqoop to move data
- Copy parallel with distcp
- Hadoop archives are compressed blocks that can be used as input to MapReduce
Chap - 4 I/O
- Compression
- Reading compressed data
- Serialzation in natively implemented in Hadoop for better perfomance
- Apache Avro is a project to do this in an improved way and support multiple languages, diff from Google Protocol Buffer and Thrift
Chapter - 5 - Developing a MapReduce Application
Setting up the Environment
- The Configuration API to read xml resource files etc
- Writing Unit Test with MRUnit
- Running locally on a small data
- Using Tool Interface write a Driver to run our MapReduce Job (Java file)
- Testing the driver
- Run in Cluster
- Package jar
- Launching a Job run the driver
- Debugging a Job
- Running multiple Job in particular flow
Chapter 6 - How MapReduce Works
Chapter 9 - Chapter 15
Setting Up Hadoop Cluster
- Manually
- Using a CDH distribution (See Appendix)
Hadoop Tools :
- Pig: Aimed to provide data structure and transformation more than just map and reduce can do
- Hive: Made to run queries for people who were weak in Java but strong in SQL
- Hbase: Distributed, column-oriented database built on top of HDFS. It is built to scale.
- ZooKeeper: Is build to avoid partial failures of request transfers happening between nodes.
- Squoop: To transfer data from external applicaitons , web api etc. This is focused on data movement.
Comments
Post a Comment