The California Housing Dataset
This real-world housing dataset from Google contains census data from housing blocks across the state of California.
https://csvbase.com/mdfarragher/California-Housing
17,000
Details
In machine learning circles, the California Housing dataset is a bit of a classic. It’s the dataset used in the second chapter of Aurélien Géron’s excellent machine learning book Hands-On Machine learning with Scikit-Learn and TensorFlow.
The dataset serves as an excellent introduction to building machine learning apps because it requires rudimentary data cleaning, has an easily understandable list of variables and has the perfect size for fast training and experimentation. it was compiled by Pace, R. Kelley and Ronald Barry for their 1997 paper titled Sparse Spatial Autoregressions. They built it using the 1990 California census data.
The dataset contains one record per census block group, with a census block group being the smallest geographical unit for which the U.S. Census Bureau publishes sample data. A census block group typically has a population of around 600 to 3,000 people.
Data Schema
The dataset can be downloaded in CSV, Parquet, XLSX or JSONL format and has the following schema:
| Column name | Column type | Missing data? |
|---|---|---|
| Row ID | Integer | Not allowed |
| longitude | Float | Allowed |
| latitude | Float | Allowed |
| housing_median_age | Integer | Allowed |
| total_rooms | Integer | Allowed |
| total_bedrooms | Integer | Allowed |
| population | Integer | Allowed |
| households | Integer | Allowed |
| median_income | Float | Allowed |
| median_house_value | Integer | Allowed |