## Archive for the ‘**Statistics**’ Category

## Aligning Scores

Past few days, IITs have been in news again. HRD minister Kapil Sibal wants to give weightage to class twelfth performance also for selection. He wants minimum marks to be eighty percent to qualify for the test. While this proposal has been topic of hot discussion, I kept thinking about the execution of this proposal from statistical perspective.

The burning question, we all will agree, is that not two board examination is similar in terms of strictness in awarding marks. While, it’s easier, I mean relatively easier, to get 90% in CBSE and ICSE , it’s next to impossible to get similar marks in a few state boards, like Bihar Intermediate examination.

Now, if we want the screening rule to be unbiased, we need to account for the reality that 80% in one board examination is not 80% in another board examination. This can be done by creating a unbiased methodology to project all the scores on same axis – that is by aligning the scores of different board examination.

Before getting into the details of approaches to accommodate the bias, let us list down few business scenarios where we might have to do similar task.

Competitor Analysis:

a) An auto finance company gets number of request for refinance of auto loans. In this situation, the company would have data of interest rate charged by previous financier. It would have complete application data and bureau data for the customer. Interest rate charged is function of application data and internally developed risk score based on bureau data. Now if an auto finance company is able to align its internal risk score with competitors risk score, it certainly has an edge over competitor. The detail will be clear from the paper linked below.

b) A similar problem for insurance company: Please have a look at this patented method for aligning two scores. It can potentially help company to align the premium it will charge with competitor’s premium. Unfortunately, like all the patent filing document, it’s not very easy to get hold of methodology at one reading. Allow me to digress for a while; but it would be really helpful if law enforces patentee to file an easy- to- understand document along with the usual patent filing. It’s understandable why patentees constructs the claim the way they do, it’s necessary to prove infringement.

There are various examples in competitor analysis domain where we would have to perform similar task. Other scenario could be, lets say risk team has developed a risk scorecard that is one of the input for pricing. Earlier risk team used traditional FICO score, but in newer model they have used Nextgen score for internal risk scorecard. Pricing team applies rule on FICO score and Internal risk score. With new risk score, pricing team has to apply rule on Nextgen score and internal score, but they don’t have access to Nextgen score of previous customers to reprice them.

Having explained few examples where aligning score is of paramount importance, we have to solve the problem of aligning marks. The details would be there in next post.

## Suicide Pandemic

Media, offline (NewsChannels and Newspaper) as well as online, have been busy counting number of suicides and heart attacks over the death of YSR. Obvious and expected to a thinking mind, the number of death toll have varied from just over 100 to 344.

Recently, two weeks ago, I observed the count of swine flu victim in Pune was reduced by 2. The officials said that the virus HINI was misappropriately considered guilty of two deaths. Likewise, and for the fact of the matter, it’s more difficult to establish the casuality of these suicides and heart attacks. Even, tougher is to deny the causality sitting far from AP when people residing over there would die for number greater than humdred.

The number 344 (46 suicide+ 298 heart attacks) came from Kadapa MP Y S Jaganmohan Reddy. I can think of following reason for people like YSJR coming with this high number

a) Sense of self importance on divulging an information unknown to whole world. b) proving YSR was a superstar, though he was; and YSJR worked for the superstar, and only star works for superstar.c) this information has surprise/shock element in it, this is reason why we all like to discuss about it, and so did YSJR.

Doubt over the validity of numbers didn’t make me write a blog entry, the reason I set out to write this that I have seriously started doubting that AP has been inflicted with suicide pandemic. This suicide virus must have been reason behind all 48 suicides, it must have been reason behind farmers committing suicide.

First time I came across the idea of suicide being infectious is in novel Snow of Orhan Pamuk. It made me think, and I realized that suicide must be infectious. For a person it becomes easier to commit suicide when he has thousands cases before him. The more the cases, the easier it becomes. After all, with all due respect to the *beauty of world*, a person doesn’t looses anything by committing suicide, and in fact he unburdens him of all the expectations, failures that has wounded his heart thoughout his existence. Once a person realizes it becomes easier. Farmers in AP have attained this wisdom.

Few days ago, I came across a video featuring Vandana Siva where she was playing blame game on farmers suicide with Monsanto. The fact that needs to be known is that Monsanto sells BT cotton seeds, farmers in AP are using them, BT cotton seeds require regular supply of water, and it’s highly senitive to regular supply of water, it’s not *adjusting* like traditional seeds which still gives good result when monsoon is late. So farmers didn’t get much from their land, and few of them committed suicide. Now Vandana Siva is blaming Monsanto for that. Though Monsanto has more logical arguments in favour of them, like they have mentioned the precondition in package of seeds, that with good farming the result is better, so they must not be held responsible for illiteracy of farmers.

I have one more argument in favor of Monsanto. That farmers in AP might be inflicted with suicide virus. And as days is passing it is becoming easier for them to commit suicide. Think of a hypothetical situation, say you have few people in your family who have committed suicide, won’t it be easier for you. Have researchs not shown us that kid from divorced parents are more likely have divorce. It’s about psychology of overcoming the mental threshhold.

If it’s true then we must act on it, because this local suicide epidemic might become a pandemic, and would be more deadly than spanish flu of 1918.

## Forecasting season of sales

Few facts, courtsey a news item in ET.

- During mango season, which typically strethes from April to June, battery makers such as Eveready and Nippo observe a steep spikes in sales.
- GSKCH has observed a huge spurt in the health drink, Horlicks, in period of exams, that is typically March.
- Dabur India also sells half of their annual volume of Shankha Pushpi (a drink that makes people intelligent :)) in period between January and March- the phase before board examination. The sale of product Chywanprash picks up on onset of winter.
- Reebok, Nike and Puma – sports brands – observe sharp spurt on sales (particularly women’s apparel and shoes) during the begining of a year.

While I leave the task of finding and establishing casuality of above observations on imaginative mind of readers; here from perspective of data analyst, I will explore the famous air-passenger data of Box-Jenkins to see the trick in finding patterns, seasonality in data.

Looking at the plot of monthly air passenger volume, we can infer

- There seems to be a pattern in number of passengers travelling.
- Number of travellers seems to pick on June or July, then it goes down till the end of a year.
- There has been a continuous increase in number of travellers YOY.
- Amplitude of season changes increases with overall trend.
- There is seasonality, but we can not be sure of the lag – time after which the data follows pattern- by merely looking at the plot.

Having observed these traits in the behaviour of air passengers the next question that pops up in mind is, Intuitively it does seem possible to forecast number of air passengers in year 1961 on monthly basis, but How ?

To answer the *How*, first let us plot a linear or exponential, the best fit trend line on above data.

We see that exponential function fits well in the data. Now, to remove trend from data we subtract trend plot with actual plot. In addition to this transformation, we need to remove the varying/ increasing amplitude also. We need to stabilise variance. This can be done through logrithmic transformation; however, in this particular example since first transformation has made the series negative for few data points, a simple log won’t work. Taking log of magnitude will be a viable idea here. For a moment suppose we have stabalized the varying variance.

Our next step would be to find out the lag, time period after which history repeats itself, that is the curve start repeating itself. This can be done looking at the autocorrelation matrix. Basicly, by autocorrelation we mean how data is correlated to itself. We lag the variable by one data point and see it’s correlation with unlagged variable. Then we lag it by two data points and observe it’s correlation with unlagged variable. The point where we get maximum correlation, we say that the variable repeats itself after that period.

Uptill now, we have been trying to perform all necessay calculation

## Interaction Variable

Suppose in a laboratory experiment, we are trying to figure out sweetness of tea as a function of variable ‘quantity of sugar’ and ‘frequency of stirring’ . With full day of methodical experiment we entered data in your experiment book.

Now we run a linear regression analysis on our data to get following equation with adjusted R square value of 0.90 .

Sweetness = 1.93 sugar + 5.37 stirring freq – 6.43

Though adjusted R^2 is good enough, we create one more variable sugar*stirring freq. We run the regression model again with assumed relationship Sweetness = c1 * sugar + c2 *sugar* sf + c3 .

Eureka! we see that now we have the adjusted R^2 of 0.9966 .

Now the valid question would be, how could you have thought of adding a variable like sugar*sf ? True the choices are plenty, we could have sugar*sugar* Sf or sf*sf .

The answer lies in exploratory data analysis. It’s always very helpful and insightful to plot all the explanatory variables with dependent variable, and see how do they change with respect to each other.

Looking at the plot above, it would not have been difficult to try the equation that we tried to get such a good result. Varibles like these are called interaction variables. Think of the experiment we just tried, it seems logical that stirring would have more effect on sweetness when sugar quantity is high and vice versa.

Though a good look or understanding of explanatory variables is best guide to create an interaction variables, but when there exists higher order interaction effects, it gets cumbersome. Automatic Interaction detector or CHAID are statistical methods that have been developed to save us from this strainful mental exercise.

A real life examples:

- Sale of Opera ticket: A statistical profile of opera ticket buyers reveal that they are both highly educated and upper income. This information can be leveraged to build a model for opera ticket buyers, but as we know all upper income segment are not highly educated; nor all highly educated belong to high income strata. In this case we would like to have a third variable that reflects the fact that a person is both highly educated and upper income. This third variable is interaction variable.