Where the data science related jobs are? (Part 2)¶
This post is a continuation of Where the data science related jobs are (part 1). In this installment, we're going to analyze the dataset from part 1. The dataset contains employment information for non-U.S. workers, specifically H1B or nonimmigrant workers. The U.S. Department of Labor requires that H1B workers be paid the Prevailing Wage
, i.e. the same wage that a U.S. worker would be paid for the same or similar position.
We can take advantage of this dataset to gain insights into the general U.S. job market. In particular, we're going to look at data science related jobs. Assuming the prevailing wage theory is true, U.S. companies hire foreign workers because there are no qualified U.S. applicants to fill the roles. So a company that hires foreign workers likely also hires many U.S. workers. After all, they only hired non-U.S. workers because they have exhausted the pool of qualified U.S. applicants. Also, because H1B workers are paid the Prevailing Wage
, we can generalize their wage information to the broader U.S. job market.
In part 1, we curated, cleaned and enriched the dataset. Now, let's probe the dataset (available here as dataScienceJobs.csv
). The dataset contains the following fields that are relevant to our current analysis.
Submitted_Date
: Timestamp reflecting when the H1B application was received by the governmentEmployer_Name
: Name of the U.S. company filing the H1B applicationWork_State
: Full name of the state where the H1B job is locatedWork_State_Code
: Two letter state abbreviation where the H1B job is locatedJob_Category
: Unofficial job subcategory assigned to the Job Title listed on the applicationOffered_Salary_Adjusted
: Annual salary offered to the foreign worker beneficiary of the H1B applicationPrevailing_Salary_Adjusted
: Annual salary (prevailing wage) for similar jobsCensus_2015
: Population census for the year 2015
Load data¶
Here's a snapshot of the dataset.
# Load data
import pandas as pd
pd.options.mode.chained_assignment = None # For now, let's turn off panda's warning
dsJobs = pd.read_csv("dataScienceJobs.csv")
print("\n")
print("dsJobs has {0} rows and {1} columns".format(dsJobs.shape[0], dsJobs.shape[1]))
print("\n")
print(dsJobs.head())
Query data¶
Let's find out the Top 10 states for data science related H1B jobs.
data = dsJobs[["Work_State_Code", "Census_2015"]]
data["Job_Per_10000"] = 10000 * (1 / data["Census_2015"])
data = data.groupby(['Work_State_Code']).sum()
state_data = data.reset_index()
state_data.sort_values(by="Job_Per_10000", ascending=False, inplace=True)
print("\n")
print(state_data.head(10).reset_index(drop=True))
Except for Illinois (IL), all of these states are located either in the East Coast or the West Coast. In fact, East Coast states dominate.
Let's see the Bottom 10 states for data science related H1B jobs.
print("\n")
print(state_data.tail(10).reset_index(drop=True))
It seems that U.S. Territories and Outlying Areas dominate here.
Map data¶
Let's make an interactive map of the data in plotly
.
# Learn about API authentication here: https://plot.ly/python/getting-started
# Find your api_key here: https://plot.ly/settings/api
#!pip install plotly # Uncomment this line if you don't already have plotly
import pandas as pd
import plotly.plotly as py
import plotly.tools as tls
tls.set_credentials_file(username='samueledeh', api_key='2spdso18wk')
scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],\
[0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]
data = [dict(
type='choropleth',
colorscale = scl,
autocolorscale = False,
locations = state_data.Work_State_Code,
z = state_data['Job_Per_10000'].astype(float),
locationmode = 'USA-states',
text = state_data.Work_State_Code,
hoverinfo = 'location+z',
marker = dict(
line = dict (
color = 'rgb(255,255,255)',
width = 2
)
),
colorbar = dict(
title = "H1B Jobs Per 10,000 people"
)
)]
layout = dict(
title = 'Data science related Jobs
(Hover for breakdown)',
width = 700,
height = 700,
geo = dict(
scope='usa',
projection=dict( type='albers usa' ),
showlakes = True,
lakecolor = 'rgb(255, 255, 255)',
countrywidth= 2.5
)
)
fig = dict(data=data, layout=layout)
py.iplot(fig, validate=False, filename='ds-jobs-map')
The map provides an interactive view of H1B jobs per 10,000 people. From the map, Washington DC is the hottest market for data science related H1B jobs (to see it, zoom in using plotly interactive widget at the top right of the graph) followed by New Jersey and Delaware. New York is also very strong, finishing well ahead of California. In general, it seems that data science related H1B jobs are concentrated around the coasts of the U.S. With the exception of Illinois, Texas and Minnesota, the middle part of America seems relatively barren as far as data science related H1B jobs are concerned.
The U.S. Coasts are the hotbeds of data science related jobs¶
Are you in the job market for a data science related job? If so, well, you may have just gotten some relocation ideas lined up for you! Before you start packing though, let's finish this story in Tableau and find out the Top paying states and companies for data science related jobs.