Machine Learning Resources

Suppose there are a large number of predictors ‘p’. What is the best approach to find out if any of the p predictors are helpful in predicting the response ‘y’? 

Bookmark this question

Suppose Null hypothesis (H0) is:

?1= ?2 = ……….?p= 0

And Alternate hypothesis (Ha) is: 

At least one of the ?i not equal to 0

Best approach: In order to find if any of the ‘p’ predictors are helpful in predicting ‘y’, use F-Statistic. (This approach works well when p<n. For p>n, other high dimensional methods will work)

Side Note: T-statistic might not be good in this scenario

If p is large, let’s say p = 200, and none of the variables (p1, ….pn) are predictive for response variable y (i.e. null hypothesis above is true), yet about 5% of the p-values associated with each of the variables comes below 0.05 by chance. Now, in reality, these variables with low p values do not have any predictive power. The lower p-value is just by chance. Therefore, if we are using individual t-statistic and p values to conclude that the variables have predictive power, we may be drawing the wrong conclusion. 

As F-statistic adjusts for the large number of variables, it doesn’t suffer from the above problem

Leave your Comments and Suggestions below:

Please Login or Sign Up to leave a comment

Partner Ad  

Find out all the ways
that you can