"Mysterious" line of code

Marc_In_Singapore · 11 June 2021 10:58

Regularization of linear regression model

I am new to Python environment / libraries, and I take the opportunity of the course to also learn the language.

I find this line of code mysterious… What does est[-1] point to?

coefs = [est[-1].coef_ for est in cv_results[“estimator”]]

I try to break it down in pieces to “peel off the onion” but to no avail… I’m likely spending too much time on such details but I might as well dig a bit deeper.

cv_results[“estimator”]

tanh_lines · 11 June 2021 13:42

[-1] is negative indexing. [0] is the first item from the left, [1] is second, and so on. [-1] just means the first item from the right (end of the list). In the current situation, it would grab you the LinearRegression model.

[est[-1].coef_ for est in cv_results[“estimator”]] is list comprehension. it’s saying something like, for each item in cv_results[“estimator”]], extract the last element (LinearRegression model) and then get the coef_ from it.

glemaitre58 · 11 June 2021 16:48

I am just adding that a scikit-learn pipeline can be indexed like a Python list. this is something that might not be obvious for scikit-learn beginner.

Marc_In_Singapore · 15 June 2021 01:36

Thanks a lot. Got it.

ThomasLoock · 15 June 2021 05:13

Just a remark to the first example:
Naming a variable in Python “list” is not a good choice as this overwrites the built-in data type list.

Marc_In_Singapore · 15 June 2021 06:07

Thanks. It is indeed good practice to call the list another name, e.g. mylist. However, could you please elaborate more with an example, as I am trying to identify the case where overwriting the built-in data type happens (overwriting would be surprising though for a structured language such as Python without the interpreter screaming).

echidne · 15 June 2021 13:11

I agree with Thomas, naming a variable with the same name of builtin function is not a good idea at all. Creators of Python think that the users of the laguage are mature ones and let a lot of freedom to them. So in Python the name of builtins are not reserved.
In your case Python list method list() takes sequence types and converts them to lists.
The code :

aTuple = (123, 'xyz', 'zara', 'abc');
aList = list(aTuple)
print "List elements : ", aList

will display on your terminal :

List elements :  [123, 'xyz', 'zara', 'abc']

but if you write

list = [1,2,3]

and you try call the builtin function as

list(atuple)

that will raize an error :

TypeError: 'list' object is not callable

If you overwrite a builtin function by mistake you can recover the builtin function like that :

del list

and your builtin function is back:

ThomasLoock · 15 June 2021 13:18

Python has a set of keywords that are reserved words that cannot be used as variable names, function names, or any other identifiers.
“list” is not such a reserved word and therefor can be used as a variable name.
But doing so results in the loss of the reference to the class list() and the usage like

letters = list(‘abcdef’)

is not possible anymore.

echidne · 15 June 2021 14:09

To complete the answer of @tanh_lines:

coef = [ est[-1].coef_ for est in cv_results["estimators"]]

is equivalent to

coefs =[]  # I create an empty list
for est in cv_results["estimators"] :  # for each estimators in cv_results
    coefs.append(est[-1].coef_) # I add the last item of the estimator.coef_ array to my list called "coefs"

ps: the MarkDown interpretor has a problem since he had bold letters where i do not put them…

Marc_In_Singapore · 16 June 2021 03:47

Thanks to all. All clear now.

I really like the compactness of this line of code.

coef = [ est[-1].coef_ for est in cv_results[“estimators”]]

Reminds me of Lisp. I still have my 1986 INRIA Lisp book!

Alvin19 · 26 June 2021 15:45

Hi @echidne and everyone,

I would like to add on to this code. When I call the weights_linear_regression variable, it will output a data frame. What is the column name 1 is? Is it the estimator value generated from each of the cv=10? What is estimator value is from the linear equation?

Lastly, just to confirm my understanding. Does the make_pipeline function will do the preprocessing then fit and predict the data and target to the regression model (LinearRegression in this case)?

linear_regression = make_pipeline(PolynomialFeatures(degree=2),
LinearRegression())
cv_results = cross_validate(linear_regression, data, target,
cv=10, scoring=“neg_mean_squared_error”,
return_train_score=True,
return_estimator=True)

Sorry to ask so many question in 1 go.

glemaitre58 · 27 June 2021 14:12

It would be better to ask the question on a new topic to be honest.

I assume that you refer to:

import pandas as pd

coefs = [est[-1].coef_ for est in cv_results["estimator"]]
weights_linear_regression = pd.DataFrame(coefs, columns=feature_names)

So the column will be the name of the feature while each line is a fold of the cross-validation.
If you don’t provide the feature_names as column name, pandas will use a range index from 0 to n_features - 1

The principle of the pipeline is that the preprocessing will be applied and then the output will be given to the predictor. So here, we expand the feature using a PolynomialFeatures (increase the number of features) and then provide this new dataset to LinearRegression. Bu using a Pipeline, you do not have to transform the data yourself.