Monthly Archives: February 2015

Have You Heard of One Minute Test(TM) for Understanding Graphs? Tale of Three Graphs

The tale of three graphs:  Why right “right” visualization matters? If you can not say the conclusions of a graph with in a minute after you see the graph, the purpose of visualization is lost.

Take a look at the graph: (Image via the Georgetown University Center on Education and the Workforce analysis of U.S. Census Bureau, American …).  This compares median salaries of specialization major by experience.  Use legends at the bottom of this graph.

Now see same information from, http://chronicle.com/article/Median-Earnings-by-Major-and/127604/ , though i have a screen shot below.

Now, consider the following.

As a third example, consider the third graph, which again bringing it here for quick lesson.  But I encourage you to go to the above link to see yourselves, for another set of rich information that I am bringing out in the end.

I want you to compare “Alphabetical order” and “Median Salary” where ranking of the median salary level is the basis, basically ranking the subjects by median salary.

Among all the three which one passes 1 minute understanding of the story.

See the link in chronicle.com above to understand why I see there is deeper considerations for visualization and identify it.  Key word is “context sensitive”, and “mouse over”.

Of course, so far, we talked about beauty of information.  Now, information is deeper, and there are social justice, entrepreneurial interests, and market equilibrium issues, and so you can see the following, for politics of salary ranges, that goes beyond all the above interesting facts are given in http://www.bloomberg.com/news/articles/2011-05-24/engineering-undergrads-reap-top-salaries-among-college-majors 

This tells the full story by going over the details.

Well geek in me is not satisfied and me wants to take a look at detailed data.  Here it is.  http://online.wsj.com/public/resources/documents/info-Salaries_for_Colleges_by_Type-sort.html

There could be time differences between this and the data that are depicted in the above graphs.   But the idea is clear… that some people want to see more data, and long list of rows of data excites such people.

 

How Data Scientists are Trained/Expected to Deliver in India

This is based on analysis of the site: http://www.iimjobs.com/

– Relevant experience with an organization known for its cutting edge/ best-in-class applicability of data mining and machine learning techniques.

– High level of proficiency in statistical tools like SAS, R

– Expertise in programming languages like Java/C/C++/Python

– Experience with relational databases and SQL is a must

– Relevant experience in Big Data platforms like Hadoop and its eco-system

– Hands on with various data types and structures: structured and unstructured data, static and streaming data, extensive prior experience in integrating data, profiling, validating and cleansing data.

– Masters Degree / PhD in a quantitative field (e.g., Computer Science, Economics, Engineering, Mathematics, Finance, Statistics, Operations Research)

Machine Learning in Business Applications — Microsoft Platform — Part 1

Machine learning platforms in the cloud will replace most of the junior level analyst positions in data science.   Think about what you need to become so that you are not dispensable.

In this series, I will be bringing out different platforms.  Microsoft has become a force to reckon with their Azure Cloud and Azure ML.

Microsoft Azure:

Here is an attempt to compress all of Machine Learning in this 1 hour 18 minutes presentation, using Azure ML platform including a case study on predicting whether some  one is less than $50K annual income or not.

SVM Notes – Experts’ Teaching on Logical Foundations of SVM and SVM in R Caret by Original Creator

How SVM (Support Vector Machines) Works? Note that you need to know linear algebra as well as quadratic programming optimization methods – Karush-Kuhn-Tucker method for constrained optimization.  The simplicity with which the presenter brings out the salient features of SVM is excellent.

Use R Caret package to do your SVM. Here is a video on R Caret by the creator, Max Kuhn.  The presentation discusses SVM application also. The latest caret package reference manual is http://cran.r-project.org/web/packages/caret/caret.pdf The slides from Max Kuhn are also available here: Three different presentation decks from Max Kuhn are posted here. First two have application codes and examples on SVM.   The third is a detailed caret package presentation. http://www.slideshare.net/kmettler/caret-package-for-r http://www.slideshare.net/NYCPredictiveAnalytics/the-caret-package-a-unified-interface-for-predictive-models Here is another presentation deck from Max Kuhn which has more details. For complete help page for the full Caret package, look at http://topepo.github.io/caret/Neural_Network.html .  I picked up this specific sub-topic page as it confirms that it has avnnet also in its list of algorithms. A reason why ensembling means more than random forest as an example is provided here. http://topepo.github.io/caret/similarity.html On how to get ROC for SVM, follow http://r.789695.n4.nabble.com/ROC-from-R-SVM-td3318277.html

SVM kills RF… Does it!!! (Our community likes titles like this, a clear sexy dichotomy statement).

SVM vs. RF in Biomedical APPs

A better statement is: SVM performs superior 80% of the cases compared to RF (Random Forest), the two giants in machine learning, in the biomedical applications of the authors works, all of which are classification applications. The authors’ presentation to biomedical community is a comprehensive SVM notes. Though authors claim gentle introduction, it is advanced mathematics. None the less, it is one of the easiest with lots of graphics. Go for the following collection of slides. https://www.med.nyu.edu/chibi/sites/default/files/chibi/Final.pdf

Tuning of SVM parameter: http://stackoverflow.com/questions/20461476/svm-with-cross-validation-in-r-using-caret

 

Today’s News – Mathematics Uses Aspects of Prediction To Solve One of Its Pending Problems! The Brilliance of Dr. Zhang

Analytic Computational Number Theory uses predictive scoring principles uses genetic algorithm.

As always, I wear my goggles to see what is in it for me – my science of prediction –  to see the Photo courtesy Nature magazine. "Tom" Zhang, shown May 13, 2013, in ...analytical strategies or better I say, in this case, logic of mathematicians. It is very interesting to see how mathematicians use the ideas of high probability situations to help them hone on the targets! … isn’t it fascinating?   For me it is. Today, having received a note about unassuming, hard working, private, deep thinking mathematician who never cared for tenure nor big American-dream-house and family, solved one of the famous unsolved problems of twin prime numbers, it was refreshing to read the articles. Before we go further, a popular lecture by Dr. Zhang himself on twin primes, and the twin prime conjecture.

 

See Dr. Zhang introducing himself regarding his MacArthur prize in this youtube video. The key point of excitement is this. That is there is a bracket of 70,000,000 consecutive numbers – will call this as a sieve, if some numbers are knocked out which will drop off the non-prime numbers, then such a sieve identifies the twin primes, infinite number of twin primes, by moving up and down the linearly arranged integer spotted line number system.  After he figured it out, using his ideas, other people excitedly were trying to figure out the smallest sieve one can get. The amazing biggest improvement was by James Maynard, from UK.  The picture is by Eleanor. James Maynard of the University of Oxford wrote the second paper proving Erdős’ conjecture on large prime gaps.  James Maynard of the University of Oxford wrote the second paper proving Erdős’ conjecture on large prime gaps. Picture: Eleanor Grant James Maynard proved, that the sieve size is less than 252 in size, significantly less than 70,000,000.  But the polymath computational initiative proved it as 246. The full story are captured in one video (New yorker) and an articles(wired.com). Read also, the following article that provides more insights on the analytical strategies. https://www.quantamagazine.org/20141210-prime-gap-grows-after-decades-long-lull/

 Now what are the analytics strategies here? The following are my observations: – When strong results are asked for, at least quickly prove the weaker result – The computation method here uses genetic algorithm to locate (probability score) to identify the most likely candidates and by further simulation determines the sieve size.  This was the most surprising analytical strategy, which further provided another surprising strategy – The multiplication of scores of most likely places gives raise to reordering of the locations and identification of locations with better accuracy… to look for primes!