The larger the number of factors, the larger the colinearity. (What @nntaleb calls "the curse of dimensionality") What number of factors starts being problematic does not respond to a magic number, as we will see. We can also have polynomial, or non-linear regression... 32/n
Data is almost never Gaussian, most of the time is a mixture of many things. Raphaël tells us most of the standard tools assume Gaussian, which is not going to work most of the time as he himself has often tested... 33/n
Raphaël type of joke: "If the p-square is negative, p is an imaginary number". This is the type of jokes he would crack at RWRI and only him would laugh at them... 34/n
P-squate at the bottom of this slide, for those who want to get it... 35/n
His data set is stock market (for past decades)... 36/n
This is the more correlated data set Raphael could produce before having a microphone technical issue, so unexpected break... 37/n
Raphael rebooted his laptop and now he is back with us. 38/n
42% is the R squared. Is this good enough for prediction? 39/n
Raphaël: "I hate VB...". He will make us hate it as well. 40/n
From Dave... Overfitting! 41/n
Raphaël extracted from his data set data up to May 2004, now he will use that to predict the returns of what will come next via linear regression and the fitted error moved by 1 month, so that he will calibrate his model. 42/n
It does not replicate and error is huge. P-square is negative... If we had predicted 0 we would have bettered the prediction of this model, not even for the subsequent months after May 2004. 43/n
The fact that P-square is negative means that the variance increases by using the model. Here we shall look for models that increase the P-square rather than the R square. 44/n
Coming to my original question, the first model had 11 factors. Chances are that removing factors we reduce overfitting. By the way Raphaël is trying to get a model that has P-square positive (so some predictive value) for real estate based on SP500 values and other data. 45/n
Cross Validation is useful for this 1) Identify which factors matter 2) Decide if we shall include linearity or non-linearity (once we have determined the factors, if trying non-linear regression improves the model) Cross validation is much more telling than a high R... 46/n
Linear regression can be fooled easily, models that pass the cross validation test not so much. So the goodness of a fit is measured more by a P-square than by an R-square. For extreme events, often 1 factor is the one giving the best fit. We are going to wrap up soon... 47/n
Any 2 factor explanation has a better R-square than a 1 factor, by definition. But is that true? fughetabout it! Raphaël recommends to start with 2 factors and add, to that, to see if that improves the P-square, that is the value we really want to improve. 48/n
Thought experiment to think of the markets for traders: we are in a cheap Thai casino, and we observe one bias in the roulette. Then the croupier changes, and the bias changes. So we can observe some patterns, but these are not stable, unless we get to understand the bias... 49/n
of all roulettes done by that manufacturer. The prediction exercise is the most difficult in the world. The real way to do it is depict 2 or 3 scenarios, not assigning them particular probabilities, and get positions consequential with those scenarios. 50/n
Raphaël talks about the current situation, and how odd it is that Trump getting covid didn't translate into the markets. Are we in a metastable equilibrium, in that nothing can really affect the market now? Not sure, but that can be a work hypothesis/scenario. 51/n
What is p values, for instance that a random correlation is of 42%? The standard for publication is that the p value chances correlation is not random to be 5%. But by definition, if 20 monkeys are searching something one will find such potential "non random" correlation. 52/n
There is a lot of such p-hacking. "To be a good statistician, you need to understand how your friend works". For instance, we are fooled by optical illusions. But the optical illusions are the result of our brain interpreting data, without that we are blind. Day 2 END. 53/53
*brain* works












