Why do I get this weird error?

georgwille · 24 April 2022 15:39

For the last part of the exercise, I defined goodness of fit like so, together with the given code.

def goodness_fit_measure(true_values, predictions):
    return sum((true_values-predictions)**2)

Why do I get this weird error for the “return” line:

TypeError: unsupported operand type(s) for +: 'int' and 'str'

The solution has some .ravel there, which was somehow unexpected.

ArturoAmorQ · 25 April 2022 09:36

The error says that you cannot use operation like sum or subtract when dealing with objects of different type. In this case, either of your columns seems to be string type. To debug your code, use the true_values.dtype syntax and try printing both variables.

If you are still struggling after that, you can look at the Solution notebook located below the jupyter notebook and above the page’s forum (you will have to click on Solution > Look at the solution: to display it) you may usually get some additional information, but try to make the exercises without looking at it!

The numpy.ravel function returns a 1D array with all the input-array elements and with the same type as it, for example, the following array

 [[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]

will be “flattened” to output the following array after using ravel

[ 0  1  2 ..., 12 13 14]

DanielFinol · 4 May 2022 15:10

Hi:

Why was the function ‘ravel’ used in this function? Target and target_predicted are one-dimensional arrays, right? Do pandas Series not behave like 1-D arrays?

Was there a more direct/intuitive way to write a solution similar to the original-poster’s, if one didn’t know about ‘ravel’?

Thanks,

glemaitre58 · 4 May 2022 20:59

Maybe np.reshape would be a bit more known. However, you need to provide the new length that should be something along np.prod(X.shape).

andrewjohnlowe · 11 May 2022 18:23

I too had difficulties with this exercise. I have not encountered numpy.ravel or numpy.reshape before, so had no knowledge of these and therefore could not make use of either. I went through the code in the notebook and determined that there is an easy fix. If the first block of code supplied is:

import pandas as pd

penguins = pd.read_csv("../datasets/penguins_regression.csv")
feature_name = "Flipper Length (mm)"
target_name = "Body Mass (g)"
data, target = penguins[feature_name], penguins[target_name]

(Note the change to the last line!)

Then we can write a function for the goodness of fit measure very simply, for example:

def goodness_fit_measure(true_values, predictions):
    return sum((true_values - predictions)**2)

Is the learning objective for this exercise to understand the parametrization of a linear model and determine how we may quantify the goodness of fit of the model, or is it about software carpentry? If the former, I think the exercise would benefit from doing away with the need to be familiar with numpy.ravel or numpy.reshape. True, we may be clients of a method or function that returns data in a form that is less than ideal for downstream processing, and being able to deal with such situations is a skill well worth acquiring, but I do not believe that was the intended focus of this exercise.

ArturoAmorQ · 12 May 2022 09:06

Defining the data as penguins[[feature_name]] instead of penguins[feature_name] is done to keep consistency with the scikit-learn API. A bit later in this Module (in the notebook Linear regression with non-linear link between data and target) we mention that

In scikit-learn, by convention data (also called X in the scikit-learn documentation) should be a 2D matrix of shape (n_samples, n_features). If data is a 1D vector, you need to reshape it into a matrix with a single column if the vector represents a feature or a single row if the vector represents a sample.

You are right that the goal is not about teaching numpy functions or similar, so maybe we can add a small hint for the next session of the MOOC.

andrewjohnlowe · 12 May 2022 10:07

Yeah, I noticed that later in the module defining the data as penguins[[feature_name]] instead of penguins[feature_name]is required for the scikit-learn API, so I’m not sure what is the best way forward to improve the exercise. Perhaps, a code snippet or, as you suggest, a small hint. I think that would be helpful.