Where the data science related jobs are? (Part 1)¶
Wouldn’t it be great if you knew where the data science related jobs are and the salary trends for those jobs across different states and companies? If you're a data professional, you could milk that for your job search. Even if you're not in the job market, this could make for an interesting coffee chat.
Along the way you'll see ideas that you can use in your own data science project. In real-world data science projects, you don't get a neatly packaged dataset you can browse. You have to get your own data, clean it, enrich it and do all that good stuff. In fact, this will take up 70% or more of your time.
In this post, we're going to curate, clean and enrich messy work visa data from the U.S. Department of Labor (DOL). In the next installment, Where the data science related jobs are (part 2), we'll probe the prepped dataset for insights and trends in data science related jobs, including:
- The Top 10 and Bottom 10 States for Data Science Related Jobs
- The Top Paying States for Data Science Related Jobs
- The Top Paying Companies for Data Science Related Jobs
And more.
When a U.S. company wants to hire a non-U.S. worker, the company is required to file a work visa (H1B) or permanent residency (greencard) application with DOL. As part of the application, the company must disclose how much it will pay the non-U.S. worker. The company is also required to disclose the Prevailing Wage
for the job, i.e. how much U.S. workers in the same or similar positions are being paid. The idea is to ensure that hiring non-U.S. workers does not negatively affect the job market for U.S. workers.
We'll take advantage of this publicly available data to explore information about data science related jobs. For this analysis, we're going to focus on H1B applications from 2002 to 2015.
Set up¶
First, let's import the packages we'll need and set things up.
import re
import time
import pickle
import platform
import numpy as np
import pandas as pd
import multiprocessing as mp
from urllib.request import urlopen
# For now, let's turn off panda's warning
pd.options.mode.chained_assignment = None
# Snippets like this is used thruout the code to get idea of runtime for different parts of the pipeline
total_time = time.time()
Let's also get some system information to help with any troubleshooting issues in the likely event you're running this code in a different environment than the one I used. I tested this code on an Amazon EC2 m4.2xlarge instance.
!python --version
print(platform.platform())
print("cpu cores: {0}".format(mp.cpu_count()))
Load data¶
The Python script to download the data is here. The data is fairly large at 864 mb. Let's read it in and see what it looks like.
h1bdataDF = pickle.load(open('h1bdataDF.pkl', 'rb'))
print("\n")
print(h1bdataDF.head(5))
print("\n")
print("h1bdataDF has {0} rows and {1} columns".format(h1bdataDF.shape[0], h1bdataDF.shape[1]))
print("\n")
Filter data¶
So we've about 5 million records. Let's remove any duplicate applications there may be in the record.
h1bdataDF2 = h1bdataDF.drop_duplicates(['Case_Number'], keep='last')
print("\n")
print("h1bdataDF2 has {0} rows and {1} columns".format(h1bdataDF2.shape[0], h1bdataDF2.shape[1]))
Let's see the available options for Approval_Status
column.
h1bdataDF.Approval_Status.unique()
For our purpose, it makes sense to keep only applications that were certified
. For example, a denied
application does not give us a lot of confidence about the quality of the application. Before we do that, we need to first remove rows with None
in Approval_Status
column.
filteredDF = h1bdataDF2.dropna(subset=['Approval_Status'])
print("filteredDF: {0}".format(filteredDF.shape))
certifiedDF = filteredDF[filteredDF.Approval_Status.str.lower().str.contains('certified')]
print("certifiedDF: {0}".format(certifiedDF.shape))
Categorize data science related jobs¶
We'll use Job_Title
column to select data science related jobs. Before we start doing anything with that column, let's first make sure it contains only string
s.
rows_with_strings = certifiedDF[["Job_Title"]].apply(
lambda row :
any([ isinstance(e, str) for e in row ])
, axis=1)
df_with_no_strings = certifiedDF[["Job_Title"]][~rows_with_strings]
print(df_with_no_strings.head())
It seems some job titles were entered as float
s. Let's clean out all float
job titles.
certifiedDF2 = certifiedDF[rows_with_strings]
certifiedDF2.shape
Now, we're ready to select and categorize data science related jobs. I chose the categorization criteria based on my view and research of the data science field. A better approach is perhaps to do some pre-clustering to see if interesting patterns emerge around the job titles. There isn't much context/synopsis with the job titles for clustering to give great results. In any case, my categorization criteria isn't perfect and I'm open to your ideas on this.
start_time = time.time()
titles_dict = {'data analyst': 'data analyst', 'business analyst':'business analyst',
'informatic': 'informatics', 'operations research':'operations research',
'data scien': 'data scientist', 'intelligence': 'business intelligence',
'analytics': 'business intelligence', 'data consultan': 'business intelligence',
'data manage': 'business intelligence','reporting & analysis': 'business intelligence',
'data min': 'business intelligence','reporting and analysis': 'business intelligence',
'database analyst':'data architect/engineer', 'data architect': 'data architect/engineer',
'data engineer': 'data architect/engineer','data warehous': 'data architect/engineer',
'database admin': 'data architect/engineer', 'datawarehous': 'data architect/engineer',
'data model': 'data architect/engineer','etl developer':'data architect/engineer',
'market research analyst': 'market analyst', 'market analyst': 'market analyst',
'customer insight': 'market analyst', 'market insight': 'market analyst',
'consumer insight': 'market analyst', 'marketing insight': 'market analyst',
'insights analyst': 'market analyst', 'analytic insight': 'market analyst',
'strategy and insight': 'market analyst', 'mkt. res. analyst': 'market analyst',
'marketing & insights': 'market analyst', 'global insights': 'market analyst',
'insights & analytic':'market analyst', 'strategy and insight': 'market analyst',
'marketing & insight': 'market analyst','insights & analytic':'market analyst',
'health care analyst':'market analyst', 'healthcare analyst': 'market analyst',
'quantitative analyst':'market analyst', 'financial analyst':'market analyst',
'marketing analyst':'market analyst', 'management analyst':'market analyst',
'business development': 'market analyst', 'consumer insight':'market analyst',
'analytic insight':'market analyst', 'data strateg':'market analyst',
'decision scien':'data scientist'}
titles_list = []
for index, row in certifiedDF2.iterrows():
title = row.Job_Title.lower()
currentTitle = [val for key,val in titles_dict.items() if key in title]
if currentTitle:
titles_list.append(currentTitle[0])
else:
titles_list.append("non data science related")
certifiedDF2["Job_Title_revised"] = titles_list
# Filter out non data science jobs
dsJobsDF = certifiedDF2[certifiedDF2.Job_Title_revised != "non data science related"]
dsJobsDF.reset_index(drop=True, inplace=True)
print("dsJobsDF shape: {0}".format(dsJobsDF.shape))
print("Time to categorize data science related jobs --- %s seconds ---" % (time.time() - start_time))
Standardize salary¶
Let's look at Wage_Rate
column and see if any standardization is necessary.
print(dsJobsDF.Wage_Rate_Unit.str.lower().unique())
Wage rates range from hourly, weekly, bi-weekly, monthly to yearly. Let's standardize to yearly.
start_time = time.time()
dsJobsDF2 = dsJobsDF
# Using regex, remove any special characters such as hyphen or dollar sign in the wage columns.
# Such characters are common in columns that contain monetary data such as the wage columns here.
# Before applying regex, ensure the columns are of string type.
# And remove rows with empty cells in the wage columns.
dsJobsDF2['Wage_Rate'] = dsJobsDF2['Wage_Rate'].astype(str)
dsJobsDF2['Prevailing_Wage'] = dsJobsDF2['Prevailing_Wage'].astype(str)
dsJobsDF2['Wage_Rate'].replace('', np.nan, inplace=True)
dsJobsDF2 = dsJobsDF2.dropna(subset=['Wage_Rate'])
dsJobsDF2['Prevailing_Wage'].replace('', np.nan, inplace=True)
dsJobsDF2 = dsJobsDF2.dropna(subset=['Prevailing_Wage'])
# Remove '$' and '-' signs from Wage_Rate column
dsJobsDF2['Wage_Rate'] = dsJobsDF2['Wage_Rate'].map(lambda x: re.sub('\$|-.*', '', x))
# Remove '$' and '-' signs from Prevailing_Wage column
dsJobsDF2['Prevailing_Wage'] = dsJobsDF2['Prevailing_Wage'].map(lambda x: re.sub('\$|-.*', '', x))
# Convert the cleaned wage columns to float
dsJobsDF2[['Wage_Rate','Prevailing_Wage']] = dsJobsDF2[['Wage_Rate','Prevailing_Wage']].apply(pd.to_numeric)
print('dsJobsDF2: {0}'.format(dsJobsDF2.shape))
print("Time to clean wage columns --- %s seconds ---" % (time.time() - start_time))
# Now we can standardize to yearly
start_time = time.time()
salary_offered = []
salary_prevailing = []
for index, row in dsJobsDF2.iterrows():
if row.Wage_Rate_Unit.lower()[0] == 'y':
salary_offered.append(row.Wage_Rate)
salary_prevailing.append(row.Prevailing_Wage)
elif row.Wage_Rate_Unit.lower()[0] == 'm':
salary_offered.append(row.Wage_Rate * 12)
salary_prevailing.append(row.Prevailing_Wage * 12)
elif row.Wage_Rate_Unit.lower() in ['bi', '2 weeks', 'bi-weekly']:
salary_offered.append(row.Wage_Rate * 26) # Assumes 52 work wks/yr
salary_prevailing.append(row.Prevailing_Wage * 26)
elif row.Wage_Rate_Unit.lower()[0] == 'w':
salary_offered.append(row.Wage_Rate * 52) # Assumes 52 work wks/yr
salary_prevailing.append(row.Prevailing_Wage * 52)
else:
salary_offered.append(row.Wage_Rate * 2080) # This handles the hourly rates. Assumes 52 work wks of 40 hrs/wk
salary_prevailing.append(row.Prevailing_Wage * 2080)
dsJobsDF3 = dsJobsDF2
dsJobsDF3["Salary_Offered"] = salary_offered
dsJobsDF3["Salary_Prevailing"] = salary_prevailing
# Let's remove any rows without salary Wage information
dsJobsDF3 = dsJobsDF3.dropna(subset=['Salary_Offered'])
dsJobsDF3 = dsJobsDF3.dropna(subset=['Salary_Prevailing'])
print('dsJobsDF3: {0}'.format(dsJobsDF3.shape))
dsJobsDF3.reset_index(drop=True, inplace=True) # Resets row index so they start from 0
print("Time to standardize salary --- %s seconds ---" % (time.time() - start_time))
Let's get some descriptive statistics on Salary_Offered
column to see the range of salaries we're dealing with here.
print(dsJobsDF3["Salary_Offered"].describe())
print("\n")
print(dsJobsDF3["Salary_Offered"].median())
Wow, the maximum annual salary is 1.3 billion! At the other extreme, the minimum is 15.4. These are likely due to entry errors; we rarely run into people who are paid these kinds of salaries.
The median salary, which is 60000.0, is a reasonable reflection of the center of the distribution. For our analysis, we'll consider any salary as unlilkely (or an outlier
) if it's less than 23760 or more than 10 times the median. I'm assuming that a data professional is at least making 2 times above the federal poverty line which is at least 11,880 in 2016 (2 x 11,880 = 23760).
The idea is that we want to consider salaries that are typical for the population. Investigating the outlying values can be a study in and of itself.
# Let's keep salaries that are greater than 23760 but less than 10 * the median salary
med_offered = dsJobsDF3["Salary_Offered"].median()
med_prevailing = dsJobsDF3["Salary_Prevailing"].median()
dsJobsDF3 = dsJobsDF3[(dsJobsDF3.Salary_Offered > 23760) & (dsJobsDF3.Salary_Offered < 10 * med_offered)]
dsJobsDF3 = dsJobsDF3[(dsJobsDF3.Salary_Prevailing > 23760) & (dsJobsDF3.Salary_Prevailing < 10 * med_prevailing)]
dsJobsDF3.shape
Add price parity information¶
It's a good idea to adjust salary to reflect regional price parity (inflation). Let's get regional inflation data from the Bureau of Economic Analysis, www.bea.gov.
start_time = time.time()
url = 'https://www.bea.gov/newsreleases/regional/rpp/2015/xls/rpp0615.xlsx'
response = urlopen(url)
parityDF = pd.read_excel(response, skiprows=3)
parityDF.drop(parityDF.columns[[1,2,3,4,5,6,9]], axis=1, inplace=True)
parityDF.columns = ["State", "2012", "2013"]
parityDF['Price_Deflator'] = (parityDF['2012'] + parityDF['2013']) / 2
parityDF.drop(parityDF.columns[[1,2]], axis=1, inplace=True)
print("parityDF shape: {0}".format(dsJobsDF.shape))
print("Time to read and clean BEA data --- %s seconds ---" % (time.time() - start_time))
Here's what the price parity data looks like.
print(parityDF.head())
We're almost ready to merge the price parity information with the salary data. The common column between both data frames is State
and we should take a look at State
columns in both data frames to make sure they match.
print(dsJobsDF3.Work_State.unique())
print(parityDF.State.unique())
Notice two things. First, the price parity dataset contains extraneous rows, maximum
, minimum
, range
and nan
. Second, the two datasets use different formats for state names.
Let's address the second issue and convert state abbreviation in the work visa dataset to state full name using the list here.
dsJobsDF3 = dsJobsDF3.dropna(subset=['Employer_State'])
print('dsJobsDF3: {0}'.format(dsJobsDF3.shape))
response = urlopen('https://raw.githubusercontent.com/sedeh/github.io/master/resources/states.txt')
lines = response.readlines()
state_names_dict = {}
for line in lines:
state_code, state_name = line.decode().split(":")
state_names_dict[state_code.strip()] = state_name.strip()
state_names_list = []
for index, row in dsJobsDF3.iterrows():
try:
state_names_list.append(state_names_dict[row.Work_State])
except KeyError:
state_names_list.append(state_names_dict[row.Employer_State])
dsJobsDF4 = dsJobsDF3
dsJobsDF4["Work_State_Code"] = dsJobsDF4.Work_State
dsJobsDF4["Work_State"] = state_names_list
We can now remove the extraneous rows in the price parity data and merge the parity and work visa datasets.
parityDF2 = parityDF[:-5]
parityDF2 = parityDF2.dropna(subset=['State'])
print('parityDF2: {0}'.format(parityDF2.shape))
dsJobsDF5 = dsJobsDF4.merge(parityDF2,how='left', left_on='Work_State', right_on='State')
print('dsJobsDF5: {0}'.format(dsJobsDF5.shape))
We're ready to use the price parity information to adjust salary to reflect regional cost of living differences. Notice that not all states have price parity information; there is nan
in the Price_Deflator
column below. We should replace nan
with 100.0 so salaries for those states remain unchanged after adjusting for cost of living differences.
print('nan present')
print(dsJobsDF5.Price_Deflator.unique())
print("\n")
where_are_NaNs = np.isnan(dsJobsDF5.Price_Deflator)
dsJobsDF5.Price_Deflator[where_are_NaNs] = 100.0
print('nan replaced with 100.0')
print(dsJobsDF5.Price_Deflator.unique())
We're ready to adjust salary with the price parity information. This isn't perfect because we're using price parity information from 2012 and 2013 to adjust salary information from 2002 to 2015. However, it seems likely that inflation shows consistent trend from year to year among states. For e.g., New York has been more expensive to live in than Iowa for as long as we can remember.
start_time = time.time()
salary_offered_adjusted = []
salary_prevailing_adjusted = []
for index, row in dsJobsDF5.iterrows():
salary_offered_adjusted.append(row.loc['Salary_Offered'] / (row.loc['Price_Deflator'] / 100))
salary_prevailing_adjusted.append(row.loc['Salary_Prevailing'] / (row.loc['Price_Deflator'] / 100))
dsJobsDF6 = dsJobsDF5
dsJobsDF6["Offered_Salary_Adjusted"] = salary_offered_adjusted
dsJobsDF6["Prevailing_Salary_Adjusted"] = salary_prevailing_adjusted
print("Time to adjust salary for inflation --- %s seconds ---" % (time.time() - start_time))
Add population information¶
Simply knowing that California has 30,000 data science related jobs compared to say 3500 for Alaska really does not tell us all that much. Ideally, we want relative truth where we can get a sense of the number of H1B jobs per capita. So let's add population information from census.gov to help us do that.
start_time = time.time()
url = 'http://www.census.gov/popest/data/state/totals/2015/tables/NST-EST2015-01.xlsx'
response = urlopen(url)
xl = pd.ExcelFile(response)
sheet_names = xl.sheet_names
DF = xl.parse(sheet_names[0], skiprows=8)
DF.columns = ["State", "Census", "Estimates Base", "2010", "2011", "2012", "2013", "2014", "Census_2015"]
censusDF = DF[["State", "Census_2015"]]
censusDF = censusDF.iloc[0:51,]
print(censusDF.head())
print("\n")
# Remove dot sign from the State column
censusDF['State'] = censusDF['State'].map(lambda x: re.sub('\.', '', x))
censusDF['State'] = censusDF['State'].str.title()
# Add population of US territories not included in the data source
temp = pd.DataFrame([['Puerto Rico', 3474182],
['Guam', 159358],
['Virgin Islands', 106405],
['American Samoa', 55519],
['Northern Mariana Islands', 53883],
['Palau', 20918],
['Federated States Of Micronesia', 103549],
['Marshall Islands', 52634]], columns=["State", "Census_2015"])
censusDF = censusDF.append(temp, ignore_index=True)
print(censusDF.head())
print("\n")
print("censusDF: {0}".format(censusDF.shape))
print("Time to read and clean census data --- %s seconds ---" % (time.time() - start_time))
Now we can add the census data to our work visa data.
dsJobsDF7 = dsJobsDF6.merge(censusDF,how='left', left_on='Work_State', right_on='State')
dsJobsDF7.shape
Prune data¶
At this point, we can drop columns we don't need. Here are the columns in our dataset.
dsJobsDF7.columns
dsJobsDF_final = dsJobsDF7.drop(dsJobsDF7.columns[[1,3,4,5,6,7,8,9,10,13,15,16,18,22]], axis=1)
dsJobsDF_final.rename(columns={'Job_Title_revised':'Job_Category'}, inplace=True)
Here's our final dataset after dropping some columns.
print(dsJobsDF_final.columns)
print("\n")
print("dsJobsDF_final has {0} rows and {1} columns".format(dsJobsDF_final.shape[0], dsJobsDF_final.shape[1]))
Persist data¶
The data is now ready for prime time! We're going to explore it in Where the data science related jobs are (part2). Before we save the data, let's quickly convert the date string in Submitted_Date
column to datetime
and then we can save away.
print(type(dsJobsDF_final.Submitted_Date[0]))
s = dsJobsDF_final["Submitted_Date"]
ts = pd.Series([pd.to_datetime(date_string) for date_string in s])
dsJobsDF_final["Submitted_Date"] = ts
print(type(dsJobsDF_final.Submitted_Date[0]))
dsJobsDF_final.to_csv('dataScienceJobs.csv', index=False)
print("Total time --- %s minutes ---" % ((time.time() - total_time) / 60.0))