Where the data science related jobs are? (Part 2)

Where the data science related jobs are? (Part 2)

This post is a continuation of Where the data science related jobs are (part 1). In this installment, we're going to analyze the dataset from part 1. The dataset contains employment information for non-U.S. workers, specifically H1B or nonimmigrant workers. The U.S. Department of Labor requires that H1B workers be paid the Prevailing Wage, i.e. the same wage that a U.S. worker would be paid for the same or similar position.

We can take advantage of this dataset to gain insights into the general U.S. job market. In particular, we're going to look at data science related jobs. Assuming the prevailing wage theory is true, U.S. companies hire foreign workers because there are no qualified U.S. applicants to fill the roles. So a company that hires foreign workers likely also hires many U.S. workers. After all, they only hired non-U.S. workers because they have exhausted the pool of qualified U.S. applicants. Also, because H1B workers are paid the Prevailing Wage, we can generalize their wage information to the broader U.S. job market.

In part 1, we curated, cleaned and enriched the dataset. Now, let's probe the dataset (available here as dataScienceJobs.csv). The dataset contains the following fields that are relevant to our current analysis.

  • Submitted_Date: Timestamp reflecting when the H1B application was received by the government
  • Employer_Name: Name of the U.S. company filing the H1B application
  • Work_State: Full name of the state where the H1B job is located
  • Work_State_Code: Two letter state abbreviation where the H1B job is located
  • Job_Category: Unofficial job subcategory assigned to the Job Title listed on the application
  • Offered_Salary_Adjusted: Annual salary offered to the foreign worker beneficiary of the H1B application
  • Prevailing_Salary_Adjusted: Annual salary (prevailing wage) for similar jobs
  • Census_2015: Population census for the year 2015

Load data

Here's a snapshot of the dataset.

In [1]:
# Load data
import pandas as pd
pd.options.mode.chained_assignment = None # For now, let's turn off panda's warning
dsJobs = pd.read_csv("dataScienceJobs.csv")
print("\n")
print("dsJobs has {0} rows and {1} columns".format(dsJobs.shape[0], dsJobs.shape[1]))
print("\n")
print(dsJobs.head())

dsJobs has 283323 rows and 10 columns


        Submitted_Date            Employer_Name      Work_City  \
0  2002-01-14 10:08:00  World Data Incorporated     Washington   
1  2002-01-14 10:14:00  World Data Incorporated  Washington DC   
2  2002-01-14 13:04:00  STMicroelectronics Inc.       San Jose   
3  2002-01-14 13:09:00   Network Associates Inc    Santa Clara   
4  2002-01-14 15:53:00              Aquila Inc.    Kansas City   

             Work_State      Job_Category Work_State_Code  Price_Deflator  \
0  District Of Columbia    market analyst              DC           100.0   
1  District Of Columbia    market analyst              DC           100.0   
2            California    market analyst              CA           120.0   
3            California  business analyst              CA           120.0   
4              Missouri    market analyst              MO            94.9   

   Offered_Salary_Adjusted  Prevailing_Salary_Adjusted  Census_2015  
0             35608.000000                37482.000000     672228.0  
1             35608.000000                37482.000000     672228.0  
2            116666.666667               104450.833333   39144818.0  
3             91666.666667                87450.000000   39144818.0  
4            105374.077977                85238.145416    6083672.0  

Query data

Let's find out the Top 10 states for data science related H1B jobs.

In [2]:
data = dsJobs[["Work_State_Code", "Census_2015"]]
data["Job_Per_10000"] = 10000 * (1 / data["Census_2015"])
data = data.groupby(['Work_State_Code']).sum()
state_data = data.reset_index()
state_data.sort_values(by="Job_Per_10000", ascending=False, inplace=True)
print("\n")
print(state_data.head(10).reset_index(drop=True))

  Work_State_Code   Census_2015  Job_Per_10000
0              DC  1.590491e+09      35.196392
1              NJ  2.499196e+11      31.144183
2              DE  1.842679e+09      20.593403
3              NY  7.564368e+11      19.303093
4              VA  1.089202e+11      15.499238
5              CT  1.939078e+10      15.038071
6              MA  6.776757e+10      14.679689
7              CA  2.191249e+12      14.300233
8              IL  1.821490e+11      11.014001
9              MD  3.431457e+10       9.511519

Except for Illinois (IL), all of these states are located either in the East Coast or the West Coast. In fact, East Coast states dominate.

Let's see the Bottom 10 states for data science related H1B jobs.

In [3]:
print("\n")
print(state_data.tail(10).reset_index(drop=True))

  Work_State_Code  Census_2015  Job_Per_10000
0              PW      41836.0       0.956114
1              AS     222076.0       0.720474
2              MS  640359262.0       0.715161
3              WV  204698208.0       0.601910
4              WY   19927638.0       0.580099
5              MH     157902.0       0.569974
6              PR  628826942.0       0.520986
7              FM     517745.0       0.482863
8              MT   48548603.0       0.455008
9              MP      53883.0       0.185587

It seems that U.S. Territories and Outlying Areas dominate here.

Map data

Let's make an interactive map of the data in plotly.

In [4]:
# Learn about API authentication here: https://plot.ly/python/getting-started
# Find your api_key here: https://plot.ly/settings/api
#!pip install plotly # Uncomment this line if you don't already have plotly
import pandas as pd
import plotly.plotly as py
import plotly.tools as tls
tls.set_credentials_file(username='samueledeh', api_key='2spdso18wk')

scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],\
            [0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]
    
data = [dict(
    type='choropleth',
    colorscale = scl,
    autocolorscale = False,
    locations = state_data.Work_State_Code,
    z = state_data['Job_Per_10000'].astype(float),
    locationmode = 'USA-states',
    text = state_data.Work_State_Code,
    hoverinfo = 'location+z',
    marker = dict(
        line = dict (
            color = 'rgb(255,255,255)',
            width = 2
        )
    ),
    colorbar = dict(
        title = "H1B Jobs Per 10,000 people"
    )
)]

layout = dict(
    title = 'Data science related Jobs
(Hover for breakdown)'
, width = 700, height = 700, geo = dict( scope='usa', projection=dict( type='albers usa' ), showlakes = True, lakecolor = 'rgb(255, 255, 255)', countrywidth= 2.5 ) ) fig = dict(data=data, layout=layout) py.iplot(fig, validate=False, filename='ds-jobs-map')
Out[4]:

The map provides an interactive view of H1B jobs per 10,000 people. From the map, Washington DC is the hottest market for data science related H1B jobs (to see it, zoom in using plotly interactive widget at the top right of the graph) followed by New Jersey and Delaware. New York is also very strong, finishing well ahead of California. In general, it seems that data science related H1B jobs are concentrated around the coasts of the U.S. With the exception of Illinois, Texas and Minnesota, the middle part of America seems relatively barren as far as data science related H1B jobs are concerned.

Are you in the job market for a data science related job? If so, well, you may have just gotten some relocation ideas lined up for you! Before you start packing though, let's finish this story in Tableau and find out the Top paying states and companies for data science related jobs.