Yahoo Answers is shutting down on May 4th, 2021 (Eastern Time) and beginning April 20th, 2021 (Eastern Time) the Yahoo Answers website will be in read-only mode. **There will be no changes to other Yahoo properties or services, or your Yahoo account.** You can find more information about the Yahoo Answers shutdown and how to download your data on this help page.

## Trending News

# R2 in regression modelling?

Why can't R2 be compared against different models with different dependent variables, e.g. a model for Y=Income with a model for Y=Income per capita, even if they have the same independent variables???

### 2 Answers

- OPMLv 79 years agoFavorite Answer
What you have is a "model selection" problem. I am assuming you are using frequency based statistics and not Bayesian statistics.

You are unfortunately missing the meaning of r^2. r^2 is a measure of association GIVEN that you have the true model. It means that you are assuming each model is the one and only true model and the others are false.

There is a single, always correct, solution to your problem. You most likely won't like it as you have probably been trained in frequency based statistics and this solution is a Bayesian solution. You will likely have a large learning curve to do it.

In order to solve the problem of the best model, you have to solve the inverse problem of showing the data you actually have could have been created by your model.

Let's imagine the following models,

y=x+a

y=bx+cz+d

y=mq+cz+e

w=rq+n

What you need to solve is the probability of the data existing given each model. The last one is rather spurious because it would be weird to test a true model of gravity against the quality of a good model of liver disease. So part of your problem is that you are not comparing comparable things.

So lets drop the last model and restrict ourselves to the first three, for simplicity.

For notation purposes p() will be the probability of something the | sign means "given" and Integral() means the addition of all possible choices, D means data, P means parameters and M means model.

So p(D|M)=Integral p(D|P)p(P|M) dP over the set of all possible parameters which is usually the real numbers.

The model with the highest probability will also be the closest to the true model.

r^2 is valid only if the underlying model is the true model, higher and lower r^2 do not give evidence of truth. r^2 is a statement about statistical noise. It is true that r^2 and the above method are related, but the relationship is neither straightforward nor linear.

It is also true there is a relationship between r^2 and truth but as above it is neither linear nor straightforward.

The problem with all frequency based systems is that they depend upon you actually matching the true long run frequencies in nature. To do that you must only run regression where you believe there is a theoretical relationship between the variables.

Also, the above author was correct, you are comparing completely different denominators, but as I stated above the problem is in fact deeper than that.

- Anonymous9 years ago
Income per capita is not same as income - it is divided by population.

R2 is a ratio, with bottom part being sum of squared deviations of Y.

Ratios are compatible only if bottom part is same.