Some attribute values look too big to be true:
Row | Feature |
---|---|
1914 | AveRooms = 141.909091 |
1979 | AveRooms = 132.533333 |
19006 | AveOccup = 1243.333333 |
etc | etc |
Should we be worried about the quality of this dataset?
Some attribute values look too big to be true:
Row | Feature |
---|---|
1914 | AveRooms = 141.909091 |
1979 | AveRooms = 132.533333 |
19006 | AveOccup = 1243.333333 |
etc | etc |
Should we be worried about the quality of this dataset?
There is probably something wrong with your code. I get the following:
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame=True)
data, target = housing.data, housing.target
data.head()
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25
Those are the first 5 data rows. If you look at row 1914, for example, youāll see some attribute values that donāt make sense.
Thanks!
In [5]: from sklearn.datasets import fetch_california_housing
...:
...: housing = fetch_california_housing(as_frame=True)
...: data, target = housing.data, housing.target
...: data.loc[[1914], :]
Out[5]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude
1914 1.875 33.0 141.909091 25.636364 30.0 2.727273 38.91 -120.1
Looking at the info can give some details:
In [6]: data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MedInc 20640 non-null float64
1 HouseAge 20640 non-null float64
2 AveRooms 20640 non-null float64
3 AveBedrms 20640 non-null float64
4 Population 20640 non-null float64
5 AveOccup 20640 non-null float64
6 Latitude 20640 non-null float64
7 Longitude 20640 non-null float64
dtypes: float64(8)
memory usage: 1.3 MB
Since that the dtypes are float64
it means that they cannot contain any string such as AveOccup = 1243.333333
I didnāt mean that the dataset contains string such as AveOccup = 1243.333333
. What I meant is that the attribute AveRoom takes the value 141.909091 at row 1914, for example.
I donāt think that an average of 142 rooms per house makes any sense.
Thanks for looking into this!
OK, I see It could be some artifacts when census data was collected. This is part of machine-learning to deal with potential artifacts and noisy data
It means that such a sample should not have an impact on the rule created during training. If encountered during testing, our model will probably not predict correctly for such sample.
The column AveRoom
seems to have been obtained by dividing the TotalRooms
column from the original dataset (1561 for row 1914) by the households
column from the same dataset (11 for row 1914). This households
column seems to be defined as
Total number of households, a group of people residing within a home unit, for a block
Which means that AveRoom
is the number of rooms per household in that block, not per house. If there are many empty houses (or households not counted in the census because they are not US citizens or something), this value could possibly be correct, albeit less useful, and not an error in the dataset.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html
Thanks for your analysis @Mirzon. Would you be interested in opening an issue on the scikit-learn repo to document this issue?
For reference the code in scikit-learn that causes this problem is:
and here is the documentation for that dataset in the scikit-learn website:
https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset
We should at least improve the scikit-learn documentation to give more precise decriptions of those engineered columns that are derivatives of the original dataset.
We should probably also consider adding an option to the loader to return the original features without preprocessing (fetch_california_housing(preprocessed=False)
with preprocessed=True
by default).
Sure, Iāll make sure to do so. (EDIT: pull request submited)
Regarding row 1914, I dug a bit deeper, and it appears to relate to an area next to Lake Tahoe mostly filled with vacation resorts (map here).
I guess that could explain the high number of rooms and low population.
Very interesting. Thanks!