California housing dataset issues

Some attribute values look too big to be true:

Row Feature
1914 AveRooms = 141.909091
1979 AveRooms = 132.533333
19006 AveOccup = 1243.333333
etc etc

Should we be worried about the quality of this dataset?

There is probably something wrong with your code. I get the following:

from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing(as_frame=True)
data, target = housing.data, housing.target
data.head()

	MedInc 	HouseAge 	AveRooms 	AveBedrms 	Population 	AveOccup 	Latitude 	Longitude
0 	8.3252 	41.0 	6.984127 	1.023810 	322.0 	2.555556 	37.88 	-122.23
1 	8.3014 	21.0 	6.238137 	0.971880 	2401.0 	2.109842 	37.86 	-122.22
2 	7.2574 	52.0 	8.288136 	1.073446 	496.0 	2.802260 	37.85 	-122.24
3 	5.6431 	52.0 	5.817352 	1.073059 	558.0 	2.547945 	37.85 	-122.25
4 	3.8462 	52.0 	6.281853 	1.081081 	565.0 	2.181467 	37.85 	-122.25

Those are the first 5 data rows. If you look at row 1914, for example, youā€™ll see some attribute values that donā€™t make sense.

Thanks!

In [5]: from sklearn.datasets import fetch_california_housing
   ...: 
   ...: housing = fetch_california_housing(as_frame=True)
   ...: data, target = housing.data, housing.target
   ...: data.loc[[1914], :]
Out[5]: 
      MedInc  HouseAge    AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude
1914   1.875      33.0  141.909091  25.636364        30.0  2.727273     38.91     -120.1

Looking at the info can give some details:

In [6]: data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
dtypes: float64(8)
memory usage: 1.3 MB

Since that the dtypes are float64 it means that they cannot contain any string such as AveOccup = 1243.333333

I didnā€™t mean that the dataset contains string such as AveOccup = 1243.333333. What I meant is that the attribute AveRoom takes the value 141.909091 at row 1914, for example.

I donā€™t think that an average of 142 rooms per house makes any sense.

Thanks for looking into this!

OK, I see :slight_smile: It could be some artifacts when census data was collected. This is part of machine-learning to deal with potential artifacts and noisy data :slight_smile:

It means that such a sample should not have an impact on the rule created during training. If encountered during testing, our model will probably not predict correctly for such sample.

The column AveRoom seems to have been obtained by dividing the TotalRooms column from the original dataset (1561 for row 1914) by the households column from the same dataset (11 for row 1914). This households column seems to be defined as

Total number of households, a group of people residing within a home unit, for a block

Which means that AveRoom is the number of rooms per household in that block, not per house. If there are many empty houses (or households not counted in the census because they are not US citizens or something), this value could possibly be correct, albeit less useful, and not an error in the dataset. :thinking:

https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

2 Likes

Thanks for your analysis @Mirzon. Would you be interested in opening an issue on the scikit-learn repo to document this issue?

For reference the code in scikit-learn that causes this problem is:

and here is the documentation for that dataset in the scikit-learn website:

https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset

We should at least improve the scikit-learn documentation to give more precise decriptions of those engineered columns that are derivatives of the original dataset.

We should probably also consider adding an option to the loader to return the original features without preprocessing (fetch_california_housing(preprocessed=False) with preprocessed=True by default).

2 Likes

Sure, Iā€™ll make sure to do so. (EDIT: pull request submited)

Regarding row 1914, I dug a bit deeper, and it appears to relate to an area next to Lake Tahoe mostly filled with vacation resorts (map here).
I guess that could explain the high number of rooms and low population.

1 Like

Very interesting. Thanks!