Monthly Archives: March 2012

$200M BIGdata Investments across NSF, HHS/NIH, DOD, DARPA, USGS

http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_release_final_2.pdf

Investments are happening across NSF, HHS/NIH, DOD, DARPA, USGS

“…

Further details about each department’s or agency’s commitments can be found at the
following websites:
NSF: http://www.nsf.gov/news/news_summ.jsp?cntn_id=123607
HHS/NIH: http://www.nih.gov/news/health/mar2012/nhgri-29.htm
DOD: www.DefenseInnovationMarketplace.mil
DARPA: http://www.darpa.mil/NewsEvents/Releases/2012/03/29.aspx
USGS: http://powellcenter.usgs.gov

…”

Also, read how NYtimes article views this in a broader perspective:

New U.S. Research Will Aim at Flood of Digital Data

From Data Monster & Insight Monster

15-20 Projects – NIH BIGdata Solicitations – Min 250K offers

The solicitation:

Core Techniques and Technologies for Advancing Big Data Science & Engineering (BIGDATA)

“…
The Core Techniques and Technologies for Advancing Big Data Science & Engineering (BIGDATA) solicitation aims to advance the
core scientific and technological means of managing, analyzing, visualizing, and extracting useful information from large, diverse, distributed and heterogeneous data sets so as to: accelerate the progress of scientific discovery and innovation; lead to new fields of inquiry that would not otherwise be possible; encourage the development of new data analytic tools and algorithms; facilitate scalable, accessible, and sustainable data infrastructure; increase understanding of human and social processes and interactions; and promote economic growth and improved health and quality of life. The new knowledge, tools, practices, and infrastructures produced will enable breakthrough discoveries and innovation in science, engineering, medicine, commerce, education, and national security — laying the foundations for US competitiveness for many decades to come.

The phrase “big data” in this solicitation refers to large, diverse, complex, longitudinal, and/or distributed data sets generated from instruments, sensors, Internet transactions, email, video, click streams, and/or all other digital sources available today and in the future.

This solicitation is one component in a long-term strategy to address national big data challenges, which include advances in core techniques and technologies; big data infrastructure projects in various science, biomedical research, health and engineering communities; education and workforce development; and a comprehensive integrative program to support collaborations of multi-disciplinary teams and communities to make advances in the complex grand challenge science, biomedical research, and engineering problems of a computational- and data-intensive world.

Today, US government agencies recognize that the scientific, biomedical and engineering research communities are undergoing a profound transformation with the use of large-scale, diverse, and high-resolution data sets that allow for data-intensive decision-making, including clinical decision making, at a level never before imagined. New statistical and mathematical algorithms, prediction techniques, and modeling methods, as well as multidisciplinary approaches to data collection, data analysis and new technologies for sharing data and information are enabling a paradigm shift in scientific and biomedical investigation. Advances in machine learning, data mining, and visualization are enabling new ways of extracting useful information in a timely fashion from massive data sets, which complement and extend existing methods of hypothesis testing and statistical inference. As a result, a number of agencies are developing big data strategies to align with their missions. This solicitation focuses on common interests in big data research across the National Institutes of Health (NIH) and the National Science Foundation (NSF).
This initiative will build new capabilities to create actionable information that leads to timely and more informed decisions. It will both help to accelerate discovery and innovation, as well as support their transition into practice to benefit society. As the recent President’s Council of Advisors on Science and Technology (PCAST) 2010 review of the Networking Information Technology Research and Development (NITRD) [http://www.nitrd.gov/pcast-2010/report/nitrd-program/pcast-nitrd-report-2010.pdf] program notes, the pipeline of data to knowledge to action has tremendous potential in transforming all areas of national priority. This initiative will also lay the foundations for complementary big data activities — big data infrastructure projects, workforce development, and progress in addressing complex, multi-disciplinary grand challenge problems in science and engineering.

…”

“…Estimated program budget, number of awards and average award size/durhttp://www.blogger.com/img/blank.gifation are subject to the availability of funds. An estimated fifteen to twenty projects will be funded, subject to availability of funds. Up to $25,000,000 will be invested in proposals submitted to this solicitation, subject to availability of funds.

All awards under this solicitation made by NIH and / or NSF will be as grants or cooperative agreements or other contract vehicles as determined by the supporting agency. Two sizes of projects are expected to be funded under this solicitation:

Small projects: One or two investigators can ask for up to $250,000 per year for up to three years.
Mid-scale projects: Three or more investigators can ask for funding between $250,001 and $1,000,000 per year for up to five years.
For both types of projects, we encourage scientists from all disciplines to participate. Projects will be awarded depending on the availability of funds and with consideration for creating a balanced overall portfolio, from foundational big data science and engineering to areas of national priority, including health IT, emergency response and preparedness, clean energy, cyberlearning, material genome, national security, and advanced manufacturing.

…”

From Data Monster & Insight Monster

$6.5B worth of 938,664 IRS related Debit card fraudulent returns using identity theft – A Predictive Analytics Opportunity?

This is rampant in Florida. This is the new drug selling type racket.

How can IRS which tries to make it easier for people who do not have checking account help get their refund faster?

– mail the check, why not (follow the old method), of course it is snail mail and takes time to cash

– disassemble the debit card process and take components from there and implement more secure aspects of components only. For example, IRS should place strict selection criteria that restricts branded debit card issuers only to be part of the network which IRS will recognize, and/or use one more ID process that confirms the individual (for example, drivers license or voter’s id, or passport). Even then, it is very easy for people to lose the debit card, because of the de-linking between personal information and the usability of the credited amount. This creates moral hazard in using the debit cards that are loaded with IRS tax returns

– What else, IRS can do to bring down the fraud amounts incredibly to a low amount

– On the users side, enroll in to programs like my-spy.com, but unfortunately it is very expensive, $150 per year!!!

– How can analytics pay a roll in identifying the potential fraud situation and hence put more safe guards or just be little bit more cautious with second ID process

– Is it a predictive-analytics opportunity?

http://www.cnn.com/2012/03/20/us/tax-refund-scam/index.html

From Data Monster & Insight Monster

SAS visual Analytics – SAS answers BIG data needs with a powerful solution

An exciting new development has happened in the anaytics world.

SAS has answered its approach to BIG data analysis. The interesting thing is it can be a plug and play in any of the BIG data hardware/software architecture platforms. The most important being IBM and the other being Oracle, or another third party platform that is redefining the Hadoop architecture.

http://www.sas.com/technologies/bi/visual-analytics.html

These sentences and phrases were culled out from the demo section in the above web page

– It is lot easier to understand content in pictures than in data
– Explore billions of rows of data in seconds
– It can be used by different types of people, not just analysts
– Share insights and reports across delivery methods once they are developed, including mobile devices
– Develop insight at the speed of light
– Drag and drop measures in a palette and auto charting algorithm
– in memory SAS LASR component (service)
– quickly and precisely process billion rows of data in few seconds
– over comes the difficulty of CUBE pre-architecture for summarization of data and changing the hierarchy structure of a CUBE
– restructuring the CUBE is no more a limiting resource hogger
– access all data and configure the visuals in shortest possible time

BIG data exploration is becoming main stream with SAS visual analytics
built around auto charting algorithms
with real time hierarchy creation in data CUBEs
that provides visualizing analytic correlations, trends, and exceptions,
and also help share and/or publish to mobile devices in a snap.

Note that Hadoop preprocessing happening though, independent of this.

Go SAS.

From Data Monster & Insight Monster

On an Average, $500K is the Incremental Profit Lost Due to Incompetency of an Analyst

image courtesy: Mediabistro.com

So much at stake, I feel, every time I do a project.

It takes lot of experience and training to be a great analyst.  Convincing business managers is such an important task in executing analytical projects.

You should thank yourselves working with such people because they are the one who brings sense into all the complex modeling you might do. The common denominator is that every one is a smart person and can figure out whether some thing makes sense or not.

However, it requires a great set of tool box, practical wisdom to interpret human behavior, and focus on the requirements.

In the end, this discussion is all about the value of analytics not really about the analyst’s mistakes, though it comes through the analysts.

Here is a simple example: The response analysis lift chart looked flat, when the results came from the campaign.  Everyone was getting anxious and wriggling the palms, because a lot of brand image and the customer value is dependent on the promise of the success of the model and the pilot campaign that the client was willing to invest in.

One needs to figure out looking at the detailed levels of the data to see what was going on and looking at the result in the right way showed that the results were dot on.  The point is you can not look at the data in the run of the mill text approach because in that situation it will not bring out the results for correct interpretation.  You might end up working in all kinds of interesting new data situations and the consumer dynamics and market dynamics might not have been captured well in the model and yet the analytical methods will yield a way to analyze the data to get the right interpretation or differential story. 

The raising importance of analytics and every directional time and money invested in analytics brings out an important question.

Many stories of my past experience nagged me, ok: what could be total life time opportunity value of a failed analytics project? What happens if couple of them happen before the analyst gets fired.

I felt it is mostly attributable to the analyst.  Some times, the analyst has to be maniacally following the reasoning to protect not only one self but also the organization.

So i recast the question, what is the cost of an incompetent analyst to an organization? Because most of the damage will be done in the early stages of the employment of an analyst, the estimate I am going to quote is relevant only for the first three years of employment and in a way censored list of mistakes, not to be influenced by all the mistakes one might commit in the long tail tenure distribution of the analyst to be relevant for the hiring organization.


This is important because there are many innocent looking strategic judgement/activity errors and directional mistakes that would be led by the input or the works of an analyst. By the time the manager comes to realize the mistake the time would have passed and the loss would be sculpted in the brand value of the analyst, the owning department, and the organization.

The loss due to such analysts points out to more than $500,000, above and beyond the net labor cost of retaining the analyst for three years. The estimate is much higher if it is a consulting company and the total value of the relationship.

So how can companies protect against this invisible-looking loss?

– Make sure the analysts belong to a best of breed; not even Ph.D is enough to assure the breed quality. It is really the overall personal attitude to life, work ethics, team behavior, interest in continuous learning, and ability to focus on work needs, and the ability to connect the data intelligence with human behavior intelligence – no wonder one can say that these things automatically happens if you have passion in your field; in this case it is all about data sciences.
– During the interview time, ask questions that would help you understand on the following

– What kind of invisible strategic errors one is likely to make, how to spot them, and how to redress the expected event so that expected event does not happen
– Communicate and understand how strategic invisible errors can happen
– Play the game of estimating what would have happened, had it not for that mistake,  during the interview

Now imagine what would be the opportunity lost if a leader has some incompetent dimension that was not well understood and the remaining management does not redress the issue. Tough one.

Ok, once an analyst is hired how do you redress the needs of the analyst to keep strengthening the factors? You need a third party who will be a overseer of such talents and who will personalize the redress process. It is worth it. Even a good analyst will benefit by this and contribute back to the organization in 100s of 1000s of dollars, every year directly to the profits of an organization.

Just some thoughts to point:

– Get a third party to keep certifying decision analytics of analysts
– Get them enrolled in online training programs of renowned universities or institutes, if these are managers who have to create and manage analytics teams
– Hire the right talented people
– Above all, train the analyst and sport a great team attitude for people to have fun together.

Ask them to specialize in an applied area of analytics and measure their progress

I am getting responses from people saying that it can be millions of dollars – of course depending on how big the company is, it can be billions especially if it is a highly strategic insights projects.

Think of the Netflix case, which has struggled in the last 9 months in trying to come out with the right strategic decisions to price the offers to consumers and also the strategic relationship with content providers. (update as of April 2012)

They lost $12 billions market value messing around with difficult choices, not $500k per analyst; this may not be a problem of the analyst; perhaps it was managerial.

PS1:  Some one asked how would you estimate more scientifically?

One way to start.

Use a sample of non-competent analysts who ended up getting fired. Use measures such as direct and indirect costs of retaining the analyst and the net profit from the ROI calculations.  Audit the various reports that provide data on QA problem, modeling problem, interpretation problem, application problem, client communications, application domain area… regarding the analysts, … and the simulated net profit had it been done with out those problems and the net difference between the simulated net profit vs. net profit estimated on the basis of actual work delivered will give the opportunity lost. Or you can compare between the non-competent ones and the competent ones. Note that the data is going to be censored. I have not even touched the brand value.

One has to make lot of simplifying assumptions, as you can see.

Also, the above points are in relation to fortune opportunity lost from 5000 companies in USA and similar comparable segments around the world in dollar terms.

Well, to conclude and rest this controversial highly subjective piece I found this very interesting and hilarious.

From Data Monster & Insight Monster

Familiarize or Recall: Survey Design and Implementation Concepts

(1) An auto analyst is conducting a satisfaction survey, sampling from a list of 10,000 new car buyers. The list includes 2,500 Ford buyers, 2,500 GM buyers, 2,500 Honda buyers, and 2,500 Toyota buyers. The analyst selects a sample of 400 car buyers, by randomly sampling 100 buyers of each brand.

Is this an example of a simple random sample?
(A) Yes, because each buyer in the sample was randomly sampled.
(B) Yes, because each buyer in the sample had an equal chance of being sampled.
(C) Yes, because car buyers of every brand were equally represented in the sample.
(D) No, because every possible 400-buyer sample did not have an equal chance of being chosen
(E) No, because the population consisted of purchasers of four different brands of car.

(2) Which of the following statements are true?
I. Random sampling is a good way to reduce response bias.
II. To guard against bias from undercoverage, use a convenience sample.
III. Increasing the sample size tends to reduce survey bias.
IV. To guard against nonresponse bias, use a mail-in survey.
(A) I only
(B) II only
(C) III only
(D) IV only
(E) None of the above.
(the above two questions are from: http://stattrek.com/survey-research/sampling-methods.aspx)

(3) Usually the sample size problems in marketing do not use the power concept (controlling for the type II error – false negatives – accepting the unit as an opportunity when it is not – willing to be fooled into thinking that something is right when it is not), because
I. The cost of controlling for such errors is not worth it
II. The commonly available tools do not have such results
III. It is applicable only in simplest cases of application

(A) I only
(B) II only
(C) III only
(D) I and II only
(E) II and III only
(F) All the above

(4) Choose the right ones:
I. Margin of error provides a way of protecting the conclusions because the results are not based on complete enumeration
II. Non-probability samples are a quick way of testing a survey design and implementation
III. The analytics based on multi-stage sampling provides proper estimate of the population distribution
IV. In analytics, a key question a manager should ask is “How are the weights used to balance the samples?”

A. I, II, III only
B. II, III, and IV only
C. I, II, and IV only
D. All

(5) Key questions a manager should ask in analytical projects that are based on samples:
I. How the representativeness of key segments are addressed with proper sampling?
II. How do we know weighting adjusting for the representativeness is all what is needed?
III. What is the total (economic or success) value of the key segments that are being addressed?
IV. Which is minimized in building the model: error in the estimate or bias in the estimate?
V. What is the margin of error?

(A) I and II
(B) I and III
(C) I and IV
(D) I, II, III
(E) I, III, IV
(F) II, III, IV
(G) All

(6) Margin of error is
I. affected by sample sizes but generally not affected by population sizes
II. is a function of level of confidence we can have regarding the estimate
III. minimized effectively by stratified sampling for differences in estimates among segments

A. I and II
B. I and III
C. II and III
D. All

From Data Monster & Insight Monster