Wednesday, 25 November 2015

Datamining with Rattle for R - My Talk at GDG, Trivandrum 2015

Recently, I got a fantastic opportunity to be a speaker at Google Developers Fest 2015, Trivandrum. It was a wonderful topic I enjoyed learning and sharing.

Data Mining with R for Rattle :

And here is me on stage:

With rest of the speakers/co-ordinators

Wednesday, 6 May 2015

Learn Awesome Mnemonic Hacks and Help Build New Ones @

Mnemonics are cool techniques that are immensely helpful remembering facts. Be it
anything from spelling of english words to quantum physics, you can make a mnemonic to learn it. 

For example: 

The spelling of 'SEPARATE' is one of the the most commonly misspelled word in English language. 

"There is A RAT in sepARATe" - visualizing and remembering this sentence will help anyone recall the spelling of separate quickly without mistake. 

Another example is the spelling of 'necessary'Just remember the sentence 'Shirts have 1 collar and 2 sleeves' to avoid misspelling the number of 'c' and 's'. 

 It's a clever idea to keep these tricks under your sleeves to impress your friends or teachers with your new exceptional memory powers. 

A trick to recall the first10 element of periodic table is to just remember : "Henry Hester Likes Beer But CanNot Obtain Food Now"

Elements: Hydrogen, Helium, Lithium, Beryllium, Boron, Carbon, Nitrogen, Oxygen,Fluorine

Every student uses such tricks every now and then, but they cannot spend too much time making such short-codes (known as mnemonics) for all topics ( because they have loads to study already). 

If, all of them have a single platform to share all these tricks then they will have greatly benefit from it and all those students will start thinking creatively about studies. We can reduce their burden and bring a smile to their face while studying. 

That is why I've built a free platform for student's to share and discover the mnemonics.

Section-wise quick links :
Check out right now and start sharing your mnemonics! Do let me know your feedbacks here. 

Data Mining & Analytics with R : Running R Scripts and Data Mining Techniques - Day 2

Warning: These are my messy study notes, much better legible notes can be found here

1. A Tour Thru Rattle

Transform Tab ( by no means near to the full power of underlying R)
Data Mining Tabs
- Cluster, Associate Model
Log Tab 
- Capture the corresponding R command
  • Working from Left to Right on Tabs 
  • Remember to Click Execute Button
  • 'Save' -> Projects save the current state, all models etc.
  • 'Open' Projects can be restored at a later time
  • You can even load it back to R 

2. First R Program

Load rattle and ggplot2

library(rattle)  # Provides the weather dataset 
library(rattle)  # Provides the ggplot() function
ggplot -> Grammar of Graphics : Just like english grammar or grammar of a computer language. A result of Hadley's Phd . Look him up to learn more details.

Then produce a plot using ggplot()
# - tips on elegantly writing repeated code
ds <- weather 

1. ggplot(ds, aes(x=MaxTemp, y=MinTemp)) + geom_point()

aes - aesthetics ( x axis and y axis, colors etc)
geom_point - you want  points as the geometric indicators 

2. ggplot(ds, aes(x=MaxTemp, y=MinTemp)) + geom_point() + ggtitle("Daily Temp Obs")

3. ggplot(ds, aes(x=MaxTemp, y=MinTemp, colour=RainTomorrow)) + geom_point() + ggtitle("Daily Temp Obs")


Google R Gallery to find lot of graph implementations 

3. DMT : Clustering (Classification)

Cluster Analysis: 
A collection of observations .
Has been done for centuries : Classifying people, animals, mammals etc.
I cannot understand scientifically about each one of you without any historic background

Cluster - KMeans (Number of clusters :2)

Cluster - KMeans (New cvs file audit.csv)

Ideal number of cluster is 12, This is how to choose it 

Google  : Curse of dimensionality ( use ewkm for clustering if you have lot of variables)

If you want to do clustering on categoric values eg. male, female. Use:
Transform -> Recode-> Indicator Variable
Transform -> Recode -> As Numeric

Difference between two cars 

Can imagine number of pistons as their numeric value
Or some parameters that indicate luxury

4. DMT : Association Rule Mining (Recommendation)

It's what Amazon did for suggesting  books. 
The beer and baby diaper example

Link analysis
Market basket analysis
Cross Marketting

Math n CS -> High Distinction
[91%, 75%] [support, confidence]

Gladiator n Patriot -> Sixth Sense
[0.1%, 90%]

Statins n Peritonitis -> Chronic Renal Failure
[0.1%, 32%]

Gladiator n Patriot -> Sixth Sense
[0.1% - support, 90% - confidence]
( association analysis)
support ->
out of 1000 cart 0.1% of people have all 3 of those movies.
i.e. 10 people have these in shopping cart

if they have watched gladiator and patriot, 90% of the time they have watched Sixth sense.

lift ->
confidence / support : The higher the lift the better 

Health Insurance Commission 
6.8 million records x 120 attributes (3.5 GB) 
12 months preprocessing then 2 weeks data mining

Goal : find associations between tests

cmin = min confidence
smin = min support

Hands On

5. DMT: Predictive Data Mining: Decision Trees (Prediction)

Often referred to as supervised learning ( we already have a decision)

Like deciding if a we should lend money to a person 
-> We will have a model that can be used to arrive at the decision. The model would have been build by 
How do we find a good model?
There can be infinite number of models 
1. Write down infinite number of models ( we will take infinite time to search) [2]
2. Measure each model and find the best one

[2] We use heuristics search to see how good a model is

In the room example: Weather a person would be wearing glasses? 

  • 30% females are wearing glasses
  • 60% males are wearing glasses 
  • (60% is not accurate enough) so, we will further divide by age:
  • People above age 42 has a 80% of chance of wearing a glass

If this is not effective enough then the algorithm starts taking other parameters and try to get better models

But how do I choose the best variables to reduce search time and get the best model?

Formula for entropy (disorder ) - nlogn 

Induction Tree - Greedy Algorithm (Heuristics - Goodness)# Important

  • Partition by every variable gender, age, height, shirt colour, shoe colour
  • Check which variable maximise reduction in entropy

Hands On Rattle

rpart - recursive partitioning 

type 1, type 2, type 3, type 4 errors
true positive - it will rain and it rains
false positive - it will rain and it doesn't rain
true negative - it won't rain and it doesn't rains
false negative - it won't rain and it rains ( i don't want this, i'll get wet)

Chances of No  sis .84% 

Very widely used - has been there for a long time. 

Democracy doesn't always give us the answer. It it did, the world would be still flat. 

Tuesday, 5 May 2015

Data Mining & Analytics with R : Introduction to R, RStudio and Rattle - Day 1

This is a blog post on a workshop I attended : a three-day hands-on Workshop on Data Mining & Analytics with R at Technopark, Trivandrum on 5th May 2015

Taught by : 
Graham Williams - Senior Director and Data Scientist , Australia

Read About 
- Literate Programming 
- Literate Data Mining

We are not writing program for the computer, it is written to share with other people.
Everything that we write should be written for others.
We should control the computer not vice versa. 

Introducing Data Science

Data Mining 
Started in around 1989.
Lego house - model - > not real but we get an idea. Data mining is all about building such models.

Descriptive Analytics - what happens , suggestions in Amazon
Diagnostic Analytics - explain why the above happened
Correlation and Causation - Find up people who ended up in hospital after taking a particular drug
Predictive Analytics - machine learning and statistics models - predicts when it happen in the future.
Prescriptive Analytics - decide on what to do and how to decide the best interaction

Science brings knowledge, philosophy brings wisdom
Everything begins as science and ends as an ard. 

The Data Roles 
Data Technician -> Data Analyst (Add value to data) -> Data Miner (Computer Scientist/Statistician, Machine Learning etc) -> Data Scientist (Ability to follow one’s intuitions to draw it all together)

Continued.. Read Day 2 Here. 

Friday, 20 March 2015

Hadoop The Definitive Guide [Book] - Study Notes

Chap-1- Meet Hadoop

  • Requirement and adoption in yahoo.
  • A framework that can scale to the web.
  • Map and Reduce acitivity and features like data locality.
  • Can be applied with a variety of algorithms
  • Huge data processing can beat good algorithms

Chap-2 - MapReduce

  • The Map Java class and Reducer Java class
  • The Job java class
  • Jobtracker and tasktracker
  • Hadoop reduces the input to input splits or just splits
  • Map tasks write the intermediate output to local disks, so that they can be discarded after use.
  • Outputs of Reduce tasks are stored in HDFS
  • Combiner function can be run on map output, and the combiner functions output forms the input to the reduce function
  • Hadoop streaming proivide hadoop apis in languages other than Java

Chap-3 - The Hadoop Distributed Filesystem

  • Fault tolerant solution. Same data written at multiple places.
  • Filesystems that manage the storage across a network of machines are called distributed filesystems.
  • Blocks - a block size is the minimum amount of data it can read and write (for hdfs its 64mb by default)
  • Namenodes and Datanodes - An HDFS cluster has a master-worker pattern: a namenode (master) and number of datanodes(workers). Master has all the meta data and datanode has all the blocks (but not persistent). Its reconstructed at start time.
  • HDFS federation
  • HDFS High-availablity
  • On large clusters the time it takes for a namenode to start from cold can be upto 30 mins
  • Fencing and failover - When one node fails an entity called 'failover controller' switch to the standby node. But first a ZooKeeper is used to ensure that only one namenode is active.
  • Graceful failover - triggered by adming
  • Ungraceful failover - in this case to make sure that the other node has completely stopped running, a mechanism called fencing is done. In worst case it does ' shoot the other node in the head' - force shutdown .
  • File Operations in HDFS
  • There are java endpoints to do all operations like create, delete, sync
  • Use Flume and Sqoop to move data
  • Copy parallel with distcp
  • Hadoop archives are compressed blocks that can be used as input to MapReduce

Chap - 4 I/O

  • Compression
  • Reading compressed data
  • Serialzation in natively implemented in Hadoop for better perfomance
  • Apache Avro is a project to do this in an improved way and support multiple languages, diff from Google Protocol Buffer and Thrift

Chapter - 5 - Developing a MapReduce Application

Setting up the Environment
- The Configuration API to read xml resource files etc
- Writing Unit Test with MRUnit
- Running locally on a small data
- Using Tool Interface write a Driver to run our MapReduce Job (Java file)
- Testing the driver
- Run in Cluster
- Package jar
- Launching a Job run the driver
- Debugging a Job
- Running multiple Job in particular flow

Chapter 6 - How MapReduce Works

Chapter 9 - Chapter 15

Setting Up Hadoop Cluster 
- Manually 
- Using a CDH distribution (See Appendix)

Hadoop Tools : 
  • Pig: Aimed to provide data structure and transformation more than just map and reduce can do
  • Hive: Made to run queries for people who were weak in Java but strong in SQL
  • Hbase: Distributed, column-oriented database built on top of HDFS. It is built to scale.
  • ZooKeeper: Is build to avoid partial failures of request transfers happening between nodes.
  • Squoop: To transfer data from external applicaitons , web api etc. This is focused on data movement.

Tuesday, 3 March 2015

How To Publish An Apple Watch App To The AppStore

The Apple Watch launch is almost nearing at the time of writing this article. I'm all excited and ready to submit my first Apple Watch compatible application to the AppStore. I'll write down my learning experience here so that you can publish your own Apple Watch application to the app store. I'll do this step by step, as the work of my current app progress. This article will be updated over time until I reach the final step to see it live in the AppStore.

Step 1 : Make the iPhone Part of the Apple Watch

An Apple Watch app is not much different from an iPhone app. In fact, it is a sub-part of the main iPhone application running on the iPhone and the Watch App merely acts as the extension of the parent app in the iPhone. So essentially, need an iPhone app anyway. In this scenario, I'm thinking of building an app that will be useful both on iPhone as well as the Apple Watch, instead of solely focusing on the the Apple Watch aspect. The app will be very simple, but all it has will be available on both the devices.

Update: To submit Apple Watch Application now, you should use Xcode 6.2 and not Xcode 6.3. The Xcode 6.3 comes with Swift 1.2 which is currently under beta and not yet supported for AppStore release.

Also, as of now Apple AppStore is not accepting Apple Watch applications . See the screenshot from Apple WatchKit portal below:


Saturday, 31 January 2015

Learn Apple Watch Programming Quickly by Examples - My January Challenge

Apple Watch is soon to release, and being a huge Apple Evangelist, I've been very eager to explore the possibilities of what one can make with these 'Most personal device ever made by Apple'. My hopes are high. I'm in a constant mission to excavate this area at the earliest. And guess what, me and my friends have been working on a couple of interesting 'Apple Watch' things lately:

1. Creating a course - 'Learn Apple Watch Programming Quickly By Examples'
2. Creating a website for Apple WatchKit Tutorials

Both are in beta and will be public soon. I'm trying to get the latest releases of the WatchKit sdk (which is bundled in Xcode 6.2 beta and higher and is required to build Watch Apps) and trying to publish tutorials on new APIs as and when they come.

Following are my objectives with both these online ventures

  • Teach things more by examples and less by theory
  • Convey ideas at the simplest form possible
  • Keep user-interaction at the heart of all materials
  • Make the examples in a way encouraging the user to replicate the same
  • Provide quick support in the discussion forum of the course 

Let me know the feedback/ideas , and yes watch out this space for more details coming soon.

P.S. : If you'd like to get a free coupon for my course, please leave a comment with your email address below.

Get My Next Post In Your Inbox