First Week at Metis — Analysing New York City’s MTA data

Kelsey Heng
5 min readJul 15, 2019
Image adapted from ny.curbed

Week one at Metis data science bootcamp has came to completion. My career in the lab in the week preceding felt like it was years ago. We dove head into Project Benson (Metis named all its projects after TV detectives) on the first day using the very first skill we were acquired: Exploratory Data Analysis.

Drawing up graphs was my favorite job scope in my previous role as a biomedical researcher, getting into the nitty-gritty of producing graphs that tells a story. This also initiated my journey to be a data scientist. I begin looking into different statistical methods that would depict a story accurately. In the journey, I ventured into the world of coding 6 months ago and eventually found Metis. As a kinesthetic learner, the project-based curriculum would be the ideal learning avenue, which brings us back to Project Benson.

For this project, our fictional client is a non-profit organization, WomenTechWomenYes (yay to women in tech!). They engaged us to analyze New York’s MTA subway data to identify foot traffic patterns to optimize their outreach efforts to fill the event space of their annual gala in July. Target participants are individuals passionate about increasing the participation of women in technology and to build awareness.

Project goals and approach

We aimed to identify stations with high traffic and ideal demographics such as areas with a larger female population with higher education level and higher income. Areas with higher turnstile were thought to have higher foot traffic and the busier stations would allow us to reach more participants.

Cleaning and organizing data

Datasets were extracted or obtained from the following websites:
- MTA Turnstile Data from data.ny.gov
- MTA Geographical Data from data.ny.gov
- Demographic Data from the U.S. Census Bureau
- Mapping New York City Census Data on Kaggle.com

We looked at a year’s worth of MTA turnstile data. After browsing through our dataset, some of the columns did not make much sense to us to reduce processing time. We dropped a couple of irrelevant columns such as ‘Line Name’, ‘Division’, ‘Description’. Information on subway lines and description of subway schedule were not relevant to us at this point.

We realized that the values for Entries and Exits were recorded in a cumulative fashion. Therefore, we took the difference between entries and exits from their previous time interval. Results were recorded in a new column, Traffic.

Then, we removed any outliers that were above the 99 percentile. These numbers did not look humanly possible, with a calculation of almost a hundred commuters passing through a turnstile in one second.

Exploratory data analysis

To give us a headstart, we looked at the demographics of New York and found that every county has a slightly larger female population.

Although New York County did not have the largest population of females, the average income of its resident is about twice of its neighboring county.

On top of the high average income, subway stations in New York would be busier than the others as most people commute to NYC for their jobs. We first looked at stations with the highest amount of foot traffic.

With this information, we narrowed our investigation to the top 5 train stations with the highest amount of traffic, namely:
1. 34 Penn St
2. 23 St
3. Fulton St
4. Times Square 42 St
5. 42 St Port Authority

We would like to know the trends of traffic in individual months to better suggest a period of time for event outreach. May, August and October showed peaks of higher foot traffic.

Unable to insert figure legend, therefore, colour of stations is similar to the graph above.

Following that, we wanted to find out the busiest day at the train stations. While Wednesday, Thursday, and Friday has the highest foot traffic, we saw that traffic on the weekends decreased by approximately 30% from the weekdays. This suggests that a large proportion of commuters on the weekdays are professionals working in the area instead of tourists.

To maximize outreach efforts, we wanted to find out the time period with more commuters. Traffic throughout the day, particularly rush hours, has almost 100% higher foot traffic than non-rush hours. (We were unable to pull out traffic data for the top 5 stations only to chart a bar graph. Bar graph shown consist of total traffic across all stations.)

Just for fun and practice, we made a little interactive map for our client.

https://github.com/kencheah/projectbenson/blob/master/map.html

Conclusion

From the exploratory data analysis conducted, we would recommend our client, WomenTechWomenYes, to target the stations in the month of May if they would like to organize the outreach closer to the date of the gala. The best days would Wednesday to Friday. Evening hours from 4pm to 8pm would be preferred over the morning rush hours as commuters are not rushing to get to work and would more likely stop in their steps.

Further exploration ideas

One week was definitely too short to put out a detailed analysis (on top of our limited skill set!). If given more time, we would like to look at more demographic data — location of technology hubs, universities, wealthier neighborhoods.

This marks the end of week 1! It is been a great (and intense) learning experience and definitely a confidence booster to complete a project while being new to the whole of coding. I strive to put up blog posts that are an improved version of this and deviate from academic writing.

And…week 2 and project 2 begins!

--

--

Kelsey Heng

Neuroscience researcher turned analytics consultant. Huge love for data storytelling, turning numbers into fun facts!