NYC Subway

the NYC subway map that shows all of the stations that accrue turnstile data.

the NYC subway map that shows all of the stations that accrue turnstile data.


When one thinks the epitome of a highly-functional massive public transportation system, a few railways come to mind. From Le Métro to The Tube to the NYC Subway (**cough** BART **cough**), there is inevitably going to be a plethora of ridership data that is waiting for a data scientist to discover meaningful patterns—demonstrating the power of big data.

Every single New Yorker (or tourist) that passes through a specific turnstile will have their time and place logged, and the treasure trove is fully available online. The dataset is organized so that each turnstile machine is recording the entry and exit counts by daily four-hour intervals.

Step 1: Clean the messy data

The first step in being able to analyze the data to map out the ridership profile of a subway station is to transform the data into a pandas DataFrame inside of a jupyter notebook. One great advantage of using pandas DataFrames is the error handling of missing data, which is replaced with pandas NaN values. This comes in handy because the first thing that any data scientist will tell you about examining a dataset is that the data is almost never clean, especially free public records

For example, some turnstiles will reset back to zero once it passes some arbitrary threshold whereas others will just err out and jump to an arbitrarily non-sensible larger number. On top of that, some turnstile machines become faulty and will periodically have missing data points. In situations like these, I took, for example, the mean, mode, or dropped the data all together using NumPy array calculations to fill in missing data points.

Step 2: Analyze the cleaned data

After cleaning the data, I took on the role of an advertising company whose business model was that of a pay-as-you-go (pun intended) and pay-per-view scheme. Succinctly, advertisers would be able to retroactively pay to post their subway ads based on a fixed-rate price that is to be multiplied by the real-time number of riders that went in and out of a subway station. In addition, knowing the ridership profile (i.e., morning commuters from 5-9am exiting 14 St-Union Square) of a subway station and its respective turnstile counts will allow for advertising companies to bid on ad slots with greater transparency.

In the event that the projections don't pan out, my company plans to eat the loss. This would boost client confidence because they would not be paying for an ad that subway riders are not viewing. It's also possible that this business plan will incentivize smaller businesses who are forced to be more strategically frugal to get in on the action thanks to the delayed payments and lower prices at subway stations with a more modest ridership profile.

I was able to experiment with the DataFrame by visualizing chunks of data using matplotlib with seaborn