A ‘bellwether’, defined by Wikipedia, is one that indicates or leads trends.

What region of the country (counties or cities) might be best representative of the rest of the country? The answer to this question might be of interest for a variety of applications like market research or polling. I wanted to address this question using the American Community Survey dataset from Kaggle. The ACS data is census data grouped into defined population-size areas called PUMAs (public use microdata areas). As a test case I attempted to predict the peoples’ occupations based on a model I built for each PUMA in the dataset using machine learning algorithms (support vector machines and principle component analysis).

figure1 results
TL;DR: More data is better, in this case.

full analysis pipeline