Nop. Indeed, the cross-validation shows you that a bit of randomness in picking some samples will have an effect.
You could do a learning curve to answer this question. But I would say that this is a good idea for this specific problem (we gave only few data samples).
To be sure, we need to compare with the training scores. A large gap between both score and a lower testing set would mean that the model overfits.
I am not sure what do you mean here. If you could ellaborate your question I might be able to help.
This is where being aware of the process that generate the dataset is super useful and not trivial. You could yourself in the same situation in real life application in industry for instance.
So here the right question to answer is: what is the reason for having a large variance. Indeed, since these data come from my personal training, I have a couple of intuitions 
Indeed, the difficult part in this prediction problem is when a cyclist is in descent or that you have a lot of variation in the velocity hence the power output. Indeed, the folds were the model work better corresponds to some training that were more intensive, where the power of the cyclist was planned to be more constant without large variation. This is not the case for the bad folds.
So collecting a larger range of training sessions might allow to better model the problem and to reduce the error on rides were the variations of speed happened.
However, to make these kind of analysis you need to go deeper in the process that produce the data and this is not anymore a data science problem only 