California housing dataset issues

JavierPerezAlvaro · 27 May 2021 15:59

Some attribute values look too big to be true:

Row	Feature
1914	AveRooms = 141.909091
1979	AveRooms = 132.533333
19006	AveOccup = 1243.333333
etc	etc

Should we be worried about the quality of this dataset?

glemaitre58 · 27 May 2021 17:46

There is probably something wrong with your code. I get the following:

from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing(as_frame=True)
data, target = housing.data, housing.target
data.head()


	MedInc 	HouseAge 	AveRooms 	AveBedrms 	Population 	AveOccup 	Latitude 	Longitude
0 	8.3252 	41.0 	6.984127 	1.023810 	322.0 	2.555556 	37.88 	-122.23
1 	8.3014 	21.0 	6.238137 	0.971880 	2401.0 	2.109842 	37.86 	-122.22
2 	7.2574 	52.0 	8.288136 	1.073446 	496.0 	2.802260 	37.85 	-122.24
3 	5.6431 	52.0 	5.817352 	1.073059 	558.0 	2.547945 	37.85 	-122.25
4 	3.8462 	52.0 	6.281853 	1.081081 	565.0 	2.181467 	37.85 	-122.25

JavierPerezAlvaro · 27 May 2021 17:49

Those are the first 5 data rows. If you look at row 1914, for example, you’ll see some attribute values that don’t make sense.

Thanks!

glemaitre58 · 27 May 2021 17:55

In [5]: from sklearn.datasets import fetch_california_housing
   ...: 
   ...: housing = fetch_california_housing(as_frame=True)
   ...: data, target = housing.data, housing.target
   ...: data.loc[[1914], :]
Out[5]: 
      MedInc  HouseAge    AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude
1914   1.875      33.0  141.909091  25.636364        30.0  2.727273     38.91     -120.1

Looking at the info can give some details:

In [6]: data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
dtypes: float64(8)
memory usage: 1.3 MB

Since that the dtypes are float64 it means that they cannot contain any string such as AveOccup = 1243.333333

JavierPerezAlvaro · 27 May 2021 18:33

I didn’t mean that the dataset contains string such as AveOccup = 1243.333333. What I meant is that the attribute AveRoom takes the value 141.909091 at row 1914, for example.

I don’t think that an average of 142 rooms per house makes any sense.

Thanks for looking into this!

glemaitre58 · 27 May 2021 19:43

OK, I see It could be some artifacts when census data was collected. This is part of machine-learning to deal with potential artifacts and noisy data

It means that such a sample should not have an impact on the rule created during training. If encountered during testing, our model will probably not predict correctly for such sample.

Mirzon · 27 May 2021 20:59

The column AveRoom seems to have been obtained by dividing the TotalRooms column from the original dataset (1561 for row 1914) by the households column from the same dataset (11 for row 1914). This households column seems to be defined as

Total number of households, a group of people residing within a home unit, for a block

Which means that AveRoom is the number of rooms per household in that block, not per house. If there are many empty houses (or households not counted in the census because they are not US citizens or something), this value could possibly be correct, albeit less useful, and not an error in the dataset.

https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

ogrisel · 28 May 2021 06:17

Thanks for your analysis @Mirzon. Would you be interested in opening an issue on the scikit-learn repo to document this issue?

For reference the code in scikit-learn that causes this problem is:

github.com

scikit-learn/scikit-learn/blob/15a949460/sklearn/datasets/_california_housing.py#L157




else:
    cal_housing = joblib.load(filepath)


feature_names = ["MedInc", "HouseAge", "AveRooms", "AveBedrms",
                 "Population", "AveOccup", "Latitude", "Longitude"]


target, data = cal_housing[:, 0], cal_housing[:, 1:]


# avg rooms = total rooms / households
data[:, 2] /= data[:, 5]


# avg bed rooms = total bed rooms / households
data[:, 3] /= data[:, 5]


# avg occupancy = population / households
data[:, 5] = data[:, 4] / data[:, 5]


# target in units of 100,000
target = target / 100000.0

and here is the documentation for that dataset in the scikit-learn website:

https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset

We should at least improve the scikit-learn documentation to give more precise decriptions of those engineered columns that are derivatives of the original dataset.

We should probably also consider adding an option to the loader to return the original features without preprocessing (fetch_california_housing(preprocessed=False) with preprocessed=True by default).

Mirzon · 28 May 2021 07:29

Sure, I’ll make sure to do so. (EDIT: pull request submited)

github.com/scikit-learn/scikit-learn

DOC Improve the description of california_housing

scikit-learn:main ← Whidou:california_housing

opened 08:59AM - 28 May 21 UTC

Whidou

+16 -10

Hello, I found the description of the `california_housing` dataset a bit confusi…ng and would like to suggest a few improvements. #### What does this implement/fix? Explain your changes. - Fix the source URL, the current one does not point to a source explaining the dataset, nor providing a link to it. This new URL comes from a comment in `_california_housing.py` pointing to the source. - Add the target's unit, it is hard to guess that the target is expressed in hundreds of thousands of dollars ($100,000) without looking at the source code. - Explain why the average number of rooms and bedrooms sometimes contain arbitrarily large values. #### Any other comments? Regarding this last point it could also be worth considering providing access to the raw data without preprocessing, possibly through an argument to `fetch_california_housing`. #### Reference Issues/PRs This issue was first raised in the SciKit-Learn MOOC forum: https://mooc-forums.inria.fr/moocsl/t/california-housing-dataset-issues/2824/8?u=Mirzon When creating an issue on this repo, the "Documentation improvement" category was accompanied by a suggestion to submit a pull request instead, so here it is.

Regarding row 1914, I dug a bit deeper, and it appears to relate to an area next to Lake Tahoe mostly filled with vacation resorts (map here).
I guess that could explain the high number of rooms and low population.

ogrisel · 31 May 2021 12:48

Very interesting. Thanks!