More background info about the dataset

I think it would be helpful for students to provide some more background information about the dataset that is being used. For example that it’s from the USA, which even the OpenML page doesn’t mention. Also: how was it obtained from the raw census data? There is a remark at some point that retired people were filtered out. I’d like to know such things right from the start, before even looking at the data.

Not sure about this one to be honest. How much more information would you want to have?

I added the fact that it was from the US.

My personal opinion: I think it is fine to have a link to the OpenML page for people who want to know more.

In a similar situation, I feel I would rather dive into looking at the dataset rather than having some description upfront but different people like different things …

Also I think in a lot of cases you don’t really know what is in the data rather than a very generic and sometimes misleading description. I can think of a case last year where I was involved hospital data for example. I would even argue this part can be realistic :wink:

A link to OpenML is fine indeed, but even the OpenML page doesn’t say much about the data.

The reason why I think that provenance information matters is that it permits students to critically inspect what they compute. And that is a good habit to develop. Example: if I know from the start that retired people have been filtered out, I can think about how this could/should impact the age distribution.

I do agree of course that real life is messy as well. But then, if you want future generations of doctors and hospital data managers to do a better job, consider the possibility that they are following your MOOC and try to teach them better practices :wink:

To be honest, I was not able to find a page with more detailed information than the OpenML one …

Example: if I know from the start that retired people have been filtered out, I can think about how this could/should impact the age distribution.

It seems like we are wired differently (which is completely fine, not trying to say my way is better at all :wink:). The retired people thing is actually something I found out the other way around.

  • I plotted the age distribution, then thought “that’s weird, where are old people?”
  • then I checked the OpenML page which shows a query they used filter the dataset (IIRC you can see from the query that they filtered out non-working people with something like hours-per-week > 0)

Yes, we are kind of re-enacting Arnold vs. Bourbaki in a different context.

Note, however, that your approach of “let the data speak for itself” or “scientists are detectives” has its limits. If you go the extreme of removing all contextual information about the dataset, which means in particular removing the labels on the columns, you will probably not be able to find anything of interest. Any data analysis relies on at least some contextual information. But then, why should the information in a CSV file be preferred over everything else?

This seems very interesting background material, thanks!

In the context of first public version of the MOOC with its launch approaching, it is fair to say that I think this is good enough for now … so marking it as solved although solved is not the right term here.