Where the data science related jobs are? (Part 1)

Where the data science related jobs are? (Part 1)

Wouldn’t it be great if you knew where the data science related jobs are and the salary trends for those jobs across different states and companies? If you're a data professional, you could milk that for your job search. Even if you're not in the job market, this could make for an interesting coffee chat.

Along the way you'll see ideas that you can use in your own data science project. In real-world data science projects, you don't get a neatly packaged dataset you can browse. You have to get your own data, clean it, enrich it and do all that good stuff. In fact, this will take up 70% or more of your time.

In this post, we're going to curate, clean and enrich messy work visa data from the U.S. Department of Labor (DOL). In the next installment, Where the data science related jobs are (part 2), we'll probe the prepped dataset for insights and trends in data science related jobs, including:

  • The Top 10 and Bottom 10 States for Data Science Related Jobs
  • The Top Paying States for Data Science Related Jobs
  • The Top Paying Companies for Data Science Related Jobs

And more.

When a U.S. company wants to hire a non-U.S. worker, the company is required to file a work visa (H1B) or permanent residency (greencard) application with DOL. As part of the application, the company must disclose how much it will pay the non-U.S. worker. The company is also required to disclose the Prevailing Wage for the job, i.e. how much U.S. workers in the same or similar positions are being paid. The idea is to ensure that hiring non-U.S. workers does not negatively affect the job market for U.S. workers.

We'll take advantage of this publicly available data to explore information about data science related jobs. For this analysis, we're going to focus on H1B applications from 2002 to 2015.

Set up

First, let's import the packages we'll need and set things up.

In [1]:
import re
import time
import pickle
import platform
import numpy as np
import pandas as pd
import multiprocessing as mp
from urllib.request import urlopen

# For now, let's turn off panda's warning
pd.options.mode.chained_assignment = None 

# Snippets like this is used thruout the code to get idea of runtime for different parts of the pipeline
total_time = time.time() 

Let's also get some system information to help with any troubleshooting issues in the likely event you're running this code in a different environment than the one I used. I tested this code on an Amazon EC2 m4.2xlarge instance.

In [2]:
!python --version
print(platform.platform())
print("cpu cores: {0}".format(mp.cpu_count()))
Python 3.5.2 :: Anaconda 4.1.1 (64-bit)
Linux-3.13.0-92-generic-x86_64-with-debian-jessie-sid
cpu cores: 8

Load data

The Python script to download the data is here. The data is fairly large at 864 mb. Let's read it in and see what it looks like.

In [3]:
h1bdataDF = pickle.load(open('h1bdataDF.pkl', 'rb'))
print("\n")
print(h1bdataDF.head(5))
print("\n")
print("h1bdataDF has {0} rows and {1} columns".format(h1bdataDF.shape[0], h1bdataDF.shape[1]))
print("\n")

        Submitted_Date      Case_Number                Employer_Name  \
0  2002-01-14 09:46:00  I-02014-0000001             Monsanto Company   
1  2002-01-14 09:49:00  I-02014-0000003            Priceline.com LLC   
2  2002-01-14 09:58:00  I-02014-0000005     Vestrust Securities L.P.   
3  2002-01-14 10:08:00  I-02014-0000007  Cleveland Clinic Foundation   
4  2002-01-14 10:08:00  I-02014-0000009      World Data Incorporated   

  Employer_City Employer_State Employer_Postal_Code  \
0     St. Louis             MO           63167        
1       Norwalk             CT           06954        
2  Coral Gables             FL           33134        
3     Cleveland             OH           44195        
4    Washington             DC           20036        

                        Job_Title Approval_Status     Wage_Rate  \
0             MOLECULAR BIOLOGIST       Certified   66700.00000   
1     Principal Software Engineer       Certified  123936.00000   
2  Associate Sales & Distribution       Certified   40000.00000   
3              resident physician       Certified   35995.00000   
4         Market Research Analyst       Certified   35608.00000   

  Wage_Rate_Unit Part_Time     Work_City Work_State Prevailing_Wage  
0           Year         N     ST. LOUIS         MO     38869.00000  
1           Year         N       Norwalk         CT     70138.00000  
2           Year         N  Coral Gables         FL     35298.00000  
3           Year         N     Cleveland         OH     37208.00000  
4           Year         N    Washington         DC     37482.00000  


h1bdataDF has 5231213 rows and 14 columns


Filter data

So we've about 5 million records. Let's remove any duplicate applications there may be in the record.

In [4]:
h1bdataDF2 = h1bdataDF.drop_duplicates(['Case_Number'], keep='last')
print("\n")
print("h1bdataDF2 has {0} rows and {1} columns".format(h1bdataDF2.shape[0], h1bdataDF2.shape[1]))

h1bdataDF2 has 5133079 rows and 14 columns

Let's see the available options for Approval_Status column.

In [5]:
h1bdataDF.Approval_Status.unique()
Out[5]:
array(['Certified', 'Denied', None, '32000.00000', 'Pending', 'Hold',
       'Debarred', '', 'roye@fragomen.com', 'omboko@jacksonlewis.com',
       'aespiritusanto@fragomen.c', 'mkwok@mltsf.com', 'DENIED',
       'CERTIFIED', 'WITHDRAWN', 'CERTIFIED-WITHDRAWN',
       'PENDING QUALITY AND COMPLIANCE REVIEW - UNASSIGNED', 'REJECTED',
       'INVALIDATED'], dtype=object)

For our purpose, it makes sense to keep only applications that were certified. For example, a denied application does not give us a lot of confidence about the quality of the application. Before we do that, we need to first remove rows with None in Approval_Status column.

In [6]:
filteredDF = h1bdataDF2.dropna(subset=['Approval_Status'])
print("filteredDF: {0}".format(filteredDF.shape))
certifiedDF = filteredDF[filteredDF.Approval_Status.str.lower().str.contains('certified')]
print("certifiedDF: {0}".format(certifiedDF.shape))
filteredDF: (5133076, 14)
certifiedDF: (4882580, 14)

We'll use Job_Title column to select data science related jobs. Before we start doing anything with that column, let's first make sure it contains only strings.

In [7]:
rows_with_strings  = certifiedDF[["Job_Title"]].apply(
       lambda row : 
          any([ isinstance(e, str) for e in row ])
       , axis=1) 
    
df_with_no_strings = certifiedDF[["Job_Title"]][~rows_with_strings]
print(df_with_no_strings.head())
        Job_Title
1780173   66007.9
1802719     75483
1817997     27.89
1839158     71400
1891782    541613

It seems some job titles were entered as floats. Let's clean out all float job titles.

In [8]:
certifiedDF2 = certifiedDF[rows_with_strings]
certifiedDF2.shape
Out[8]:
(4882528, 14)

Now, we're ready to select and categorize data science related jobs. I chose the categorization criteria based on my view and research of the data science field. A better approach is perhaps to do some pre-clustering to see if interesting patterns emerge around the job titles. There isn't much context/synopsis with the job titles for clustering to give great results. In any case, my categorization criteria isn't perfect and I'm open to your ideas on this.

In [9]:
start_time = time.time()

titles_dict = {'data analyst': 'data analyst', 'business analyst':'business analyst',
               'informatic': 'informatics', 'operations research':'operations research',
               'data scien': 'data scientist', 'intelligence': 'business intelligence', 
               'analytics': 'business intelligence', 'data consultan': 'business intelligence', 
               'data manage': 'business intelligence','reporting & analysis': 'business intelligence',
               'data min': 'business intelligence','reporting and analysis': 'business intelligence',
               'database analyst':'data architect/engineer', 'data architect': 'data architect/engineer',
               'data engineer': 'data architect/engineer','data warehous': 'data architect/engineer',
               'database admin': 'data architect/engineer', 'datawarehous': 'data architect/engineer', 
               'data model': 'data architect/engineer','etl developer':'data architect/engineer', 
               'market research analyst': 'market analyst', 'market analyst': 'market analyst', 
               'customer insight': 'market analyst', 'market insight': 'market analyst',
               'consumer insight': 'market analyst', 'marketing insight': 'market analyst',
               'insights analyst': 'market analyst', 'analytic insight': 'market analyst',
               'strategy and insight': 'market analyst', 'mkt. res. analyst': 'market analyst',
               'marketing & insights': 'market analyst', 'global insights': 'market analyst',
               'insights & analytic':'market analyst', 'strategy and insight': 'market analyst',
               'marketing & insight': 'market analyst','insights & analytic':'market analyst',
               'health care analyst':'market analyst', 'healthcare analyst': 'market analyst',
               'quantitative analyst':'market analyst', 'financial analyst':'market analyst',
               'marketing analyst':'market analyst', 'management analyst':'market analyst', 
               'business development': 'market analyst', 'consumer insight':'market analyst',
               'analytic insight':'market analyst', 'data strateg':'market analyst',
               'decision scien':'data scientist'}

titles_list = []

for index, row in certifiedDF2.iterrows():
    title = row.Job_Title.lower()
    currentTitle = [val for key,val in titles_dict.items() if key in title]
    if currentTitle:
        titles_list.append(currentTitle[0])
    else:
        titles_list.append("non data science related")
        
certifiedDF2["Job_Title_revised"] = titles_list

# Filter out non data science jobs
dsJobsDF = certifiedDF2[certifiedDF2.Job_Title_revised != "non data science related"]
dsJobsDF.reset_index(drop=True, inplace=True)

print("dsJobsDF shape: {0}".format(dsJobsDF.shape))
print("Time to categorize data science related jobs --- %s seconds ---" % (time.time() - start_time))
dsJobsDF shape: (285421, 15)
Time to categorize data science related jobs --- 499.87189054489136 seconds ---

Standardize salary

Let's look at Wage_Rate column and see if any standardization is necessary.

In [10]:
print(dsJobsDF.Wage_Rate_Unit.str.lower().unique())
['year' 'hour' '2 weeks' 'month' 'week' 'bi-weekly' 'yr' 'hr' 'wk' 'mth'
 'bi']

Wage rates range from hourly, weekly, bi-weekly, monthly to yearly. Let's standardize to yearly.

In [11]:
start_time = time.time()

dsJobsDF2 = dsJobsDF

# Using regex, remove any special characters such as hyphen or dollar sign in the wage columns.
# Such characters are common in columns that contain monetary data such as the wage columns here. 
# Before applying regex, ensure the columns are of string type.
# And remove rows with empty cells in the wage columns.
dsJobsDF2['Wage_Rate'] = dsJobsDF2['Wage_Rate'].astype(str)
dsJobsDF2['Prevailing_Wage'] = dsJobsDF2['Prevailing_Wage'].astype(str)
dsJobsDF2['Wage_Rate'].replace('', np.nan, inplace=True)
dsJobsDF2 = dsJobsDF2.dropna(subset=['Wage_Rate'])
dsJobsDF2['Prevailing_Wage'].replace('', np.nan, inplace=True)
dsJobsDF2 = dsJobsDF2.dropna(subset=['Prevailing_Wage'])

# Remove '$' and '-' signs from Wage_Rate column
dsJobsDF2['Wage_Rate'] = dsJobsDF2['Wage_Rate'].map(lambda x: re.sub('\$|-.*', '', x))
# Remove '$' and '-' signs from Prevailing_Wage column
dsJobsDF2['Prevailing_Wage'] = dsJobsDF2['Prevailing_Wage'].map(lambda x: re.sub('\$|-.*', '', x))
        
# Convert the cleaned wage columns to float
dsJobsDF2[['Wage_Rate','Prevailing_Wage']] = dsJobsDF2[['Wage_Rate','Prevailing_Wage']].apply(pd.to_numeric)

print('dsJobsDF2: {0}'.format(dsJobsDF2.shape))
print("Time to clean wage columns --- %s seconds ---" % (time.time() - start_time))

# Now we can standardize to yearly
start_time = time.time()

salary_offered = []
salary_prevailing = []
for index, row in dsJobsDF2.iterrows():
    if row.Wage_Rate_Unit.lower()[0] == 'y':
        salary_offered.append(row.Wage_Rate)
        salary_prevailing.append(row.Prevailing_Wage)
    elif row.Wage_Rate_Unit.lower()[0] == 'm':
        salary_offered.append(row.Wage_Rate * 12)
        salary_prevailing.append(row.Prevailing_Wage * 12)
    elif row.Wage_Rate_Unit.lower() in ['bi', '2 weeks', 'bi-weekly']:
        salary_offered.append(row.Wage_Rate * 26) # Assumes 52 work wks/yr
        salary_prevailing.append(row.Prevailing_Wage * 26)
    elif row.Wage_Rate_Unit.lower()[0] == 'w':
        salary_offered.append(row.Wage_Rate * 52) # Assumes 52 work wks/yr
        salary_prevailing.append(row.Prevailing_Wage * 52)
    else:
        salary_offered.append(row.Wage_Rate * 2080) # This handles the hourly rates. Assumes 52 work wks of 40 hrs/wk
        salary_prevailing.append(row.Prevailing_Wage * 2080)

dsJobsDF3 = dsJobsDF2
dsJobsDF3["Salary_Offered"] = salary_offered
dsJobsDF3["Salary_Prevailing"] = salary_prevailing

# Let's remove any rows without salary Wage information
dsJobsDF3 = dsJobsDF3.dropna(subset=['Salary_Offered'])
dsJobsDF3 = dsJobsDF3.dropna(subset=['Salary_Prevailing'])

print('dsJobsDF3: {0}'.format(dsJobsDF3.shape))
dsJobsDF3.reset_index(drop=True, inplace=True) # Resets row index so they start from 0
print("Time to standardize salary --- %s seconds ---" % (time.time() - start_time))
dsJobsDF2: (285413, 15)
Time to clean wage columns --- 1.7312259674072266 seconds ---
dsJobsDF3: (285413, 17)
Time to standardize salary --- 37.64387631416321 seconds ---

Let's get some descriptive statistics on Salary_Offered column to see the range of salaries we're dealing with here.

In [12]:
print(dsJobsDF3["Salary_Offered"].describe())
print("\n")
print(dsJobsDF3["Salary_Offered"].median())
count    2.854130e+05
mean     1.623086e+05
std      4.644013e+06
min      1.538000e+01
25%      4.761100e+04
50%      6.000000e+04
75%      7.631500e+04
max      1.322880e+09
Name: Salary_Offered, dtype: float64


60000.0

Wow, the maximum annual salary is 1.3 billion! At the other extreme, the minimum is 15.4. These are likely due to entry errors; we rarely run into people who are paid these kinds of salaries.

The median salary, which is 60000.0, is a reasonable reflection of the center of the distribution. For our analysis, we'll consider any salary as unlilkely (or an outlier) if it's less than 23760 or more than 10 times the median. I'm assuming that a data professional is at least making 2 times above the federal poverty line which is at least 11,880 in 2016 (2 x 11,880 = 23760).

The idea is that we want to consider salaries that are typical for the population. Investigating the outlying values can be a study in and of itself.

In [13]:
# Let's keep salaries that are greater than 23760 but less than 10 * the median salary
med_offered = dsJobsDF3["Salary_Offered"].median()
med_prevailing = dsJobsDF3["Salary_Prevailing"].median()
dsJobsDF3 = dsJobsDF3[(dsJobsDF3.Salary_Offered > 23760) & (dsJobsDF3.Salary_Offered < 10 * med_offered)]
dsJobsDF3 = dsJobsDF3[(dsJobsDF3.Salary_Prevailing > 23760) & (dsJobsDF3.Salary_Prevailing < 10 * med_prevailing)]
dsJobsDF3.shape
Out[13]:
(283338, 17)

Add price parity information

It's a good idea to adjust salary to reflect regional price parity (inflation). Let's get regional inflation data from the Bureau of Economic Analysis, www.bea.gov.

In [14]:
start_time = time.time()

url = 'https://www.bea.gov/newsreleases/regional/rpp/2015/xls/rpp0615.xlsx'
response = urlopen(url)
parityDF = pd.read_excel(response, skiprows=3)
parityDF.drop(parityDF.columns[[1,2,3,4,5,6,9]], axis=1, inplace=True)
parityDF.columns = ["State", "2012", "2013"]
parityDF['Price_Deflator'] = (parityDF['2012'] + parityDF['2013']) / 2 
parityDF.drop(parityDF.columns[[1,2]], axis=1, inplace=True)

print("parityDF shape: {0}".format(dsJobsDF.shape))
print("Time to read and clean BEA data --- %s seconds ---" % (time.time() - start_time))
parityDF shape: (285421, 15)
Time to read and clean BEA data --- 0.9652659893035889 seconds ---

Here's what the price parity data looks like.

In [15]:
print(parityDF.head())
        State  Price_Deflator
0     Alabama           93.65
1      Alaska          113.15
2     Arizona          103.80
3    Arkansas           93.30
4  California          120.00

We're almost ready to merge the price parity information with the salary data. The common column between both data frames is State and we should take a look at State columns in both data frames to make sure they match.

In [16]:
print(dsJobsDF3.Work_State.unique())
['DC' 'CA' 'MO' 'TX' 'NY' 'IL' 'MN' 'MD' 'NJ' 'MA' 'FL' 'CT' 'WA' 'NC' 'VA'
 'PA' 'DE' 'KY' 'GA' 'AZ' 'OH' 'IN' 'TN' 'NH' 'OR' 'MS' 'AL' 'NV' 'MI' 'WI'
 'SC' 'CO' 'OK' 'KS' 'LA' 'ID' 'AR' 'IA' 'RI' 'HI' 'UT' 'WY' 'AK' 'WV' 'PW'
 'NE' 'PR' 'ND' 'SD' 'VT' 'NM' 'ME' 'MT' 'VI' 'FM' 'AS' 'MH' 'GU' 'MP']
In [17]:
print(parityDF.State.unique())
['Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado'
 'Connecticut' 'Delaware' 'District of Columbia' 'Florida' 'Georgia'
 'Hawaii' 'Idaho' 'Illinois' 'Indiana' 'Iowa' 'Kansas' 'Kentucky'
 'Louisiana' 'Maine' 'Maryland' 'Massachusetts' 'Michigan' 'Minnesota'
 'Mississippi' 'Missouri' 'Montana' 'Nebraska' 'Nevada' 'New Hampshire'
 'New Jersey' 'New Mexico' 'New York' 'North Carolina' 'North Dakota'
 'Ohio' 'Oklahoma' 'Oregon' 'Pennsylvania' 'Rhode Island' 'South Carolina'
 'South Dakota' 'Tennessee' 'Texas' 'Utah' 'Vermont' 'Virginia'
 'Washington' 'West Virginia' 'Wisconsin' 'Wyoming' nan 'Maximum' 'Minimum'
 'Range' 'Source: U.S. Bureau of Economic Analysis']

Notice two things. First, the price parity dataset contains extraneous rows, maximum, minimum, range and nan. Second, the two datasets use different formats for state names.

Let's address the second issue and convert state abbreviation in the work visa dataset to state full name using the list here.

In [18]:
dsJobsDF3 = dsJobsDF3.dropna(subset=['Employer_State'])
print('dsJobsDF3: {0}'.format(dsJobsDF3.shape))
response = urlopen('https://raw.githubusercontent.com/sedeh/github.io/master/resources/states.txt')
lines = response.readlines()
state_names_dict = {}
for line in lines:
    state_code, state_name = line.decode().split(":")
    state_names_dict[state_code.strip()] = state_name.strip()

state_names_list = []

for index, row in dsJobsDF3.iterrows():
    try:
        state_names_list.append(state_names_dict[row.Work_State])
    except KeyError:
        state_names_list.append(state_names_dict[row.Employer_State])
dsJobsDF4 = dsJobsDF3
dsJobsDF4["Work_State_Code"] = dsJobsDF4.Work_State
dsJobsDF4["Work_State"] = state_names_list
dsJobsDF3: (283323, 17)

We can now remove the extraneous rows in the price parity data and merge the parity and work visa datasets.

In [19]:
parityDF2 = parityDF[:-5]
parityDF2 = parityDF2.dropna(subset=['State'])
print('parityDF2: {0}'.format(parityDF2.shape))
dsJobsDF5 = dsJobsDF4.merge(parityDF2,how='left', left_on='Work_State', right_on='State')
print('dsJobsDF5: {0}'.format(dsJobsDF5.shape))
parityDF2: (51, 2)
dsJobsDF5: (283323, 20)

We're ready to use the price parity information to adjust salary to reflect regional cost of living differences. Notice that not all states have price parity information; there is nan in the Price_Deflator column below. We should replace nan with 100.0 so salaries for those states remain unchanged after adjusting for cost of living differences.

In [20]:
print('nan present')
print(dsJobsDF5.Price_Deflator.unique())
print("\n")
where_are_NaNs = np.isnan(dsJobsDF5.Price_Deflator)
dsJobsDF5.Price_Deflator[where_are_NaNs] = 100.0
print('nan replaced with 100.0')
print(dsJobsDF5.Price_Deflator.unique())
nan present
[    nan  120.     94.9   102.9   122.75  107.5   104.05  118.1   122.1
  114.1   105.35  115.95  110.     97.65  109.8   104.95  108.25   94.75
   97.9   103.8    95.35   97.35   96.6   112.75  105.15   92.3    93.65
  104.9   100.6    99.2   108.65   95.8    96.65   97.3    99.3    93.3
   96.05  104.8   124.35  103.6   102.3   113.15   94.3    96.4    97.1
   93.95  107.    101.3   104.4   100.3 ]


nan replaced with 100.0
[ 100.    120.     94.9   102.9   122.75  107.5   104.05  118.1   122.1
  114.1   105.35  115.95  110.     97.65  109.8   104.95  108.25   94.75
   97.9   103.8    95.35   97.35   96.6   112.75  105.15   92.3    93.65
  104.9   100.6    99.2   108.65   95.8    96.65   97.3    99.3    93.3
   96.05  104.8   124.35  103.6   102.3   113.15   94.3    96.4    97.1
   93.95  107.    101.3   104.4   100.3 ]

We're ready to adjust salary with the price parity information. This isn't perfect because we're using price parity information from 2012 and 2013 to adjust salary information from 2002 to 2015. However, it seems likely that inflation shows consistent trend from year to year among states. For e.g., New York has been more expensive to live in than Iowa for as long as we can remember.

In [21]:
start_time = time.time()

salary_offered_adjusted = []
salary_prevailing_adjusted = []

for index, row in dsJobsDF5.iterrows():
    salary_offered_adjusted.append(row.loc['Salary_Offered'] / (row.loc['Price_Deflator'] / 100))
    salary_prevailing_adjusted.append(row.loc['Salary_Prevailing'] / (row.loc['Price_Deflator'] / 100))
    
dsJobsDF6 = dsJobsDF5
dsJobsDF6["Offered_Salary_Adjusted"] = salary_offered_adjusted
dsJobsDF6["Prevailing_Salary_Adjusted"] = salary_prevailing_adjusted

print("Time to adjust salary for inflation --- %s seconds ---" % (time.time() - start_time))
Time to adjust salary for inflation --- 59.78936195373535 seconds ---

Add population information

Simply knowing that California has 30,000 data science related jobs compared to say 3500 for Alaska really does not tell us all that much. Ideally, we want relative truth where we can get a sense of the number of H1B jobs per capita. So let's add population information from census.gov to help us do that.

In [22]:
start_time = time.time()

url = 'http://www.census.gov/popest/data/state/totals/2015/tables/NST-EST2015-01.xlsx'
response = urlopen(url)
xl = pd.ExcelFile(response)
sheet_names = xl.sheet_names
DF = xl.parse(sheet_names[0], skiprows=8)
DF.columns = ["State", "Census", "Estimates Base", "2010", "2011", "2012", "2013", "2014", "Census_2015"]
censusDF = DF[["State", "Census_2015"]]
censusDF = censusDF.iloc[0:51,]
print(censusDF.head())
print("\n")

# Remove dot sign from the State column
censusDF['State'] = censusDF['State'].map(lambda x: re.sub('\.', '', x))
censusDF['State'] = censusDF['State'].str.title()

# Add population of US territories not included in the data source
temp = pd.DataFrame([['Puerto Rico', 3474182],
                   ['Guam', 159358],
                   ['Virgin Islands', 106405],
                   ['American Samoa', 55519],
                   ['Northern Mariana Islands', 53883],
                   ['Palau', 20918],
                   ['Federated States Of Micronesia', 103549],
                   ['Marshall Islands', 52634]], columns=["State", "Census_2015"])
censusDF = censusDF.append(temp, ignore_index=True)
print(censusDF.head())
print("\n")
print("censusDF: {0}".format(censusDF.shape))
print("Time to read and clean census data --- %s seconds ---" % (time.time() - start_time))
         State  Census_2015
0     .Alabama    4858979.0
1      .Alaska     738432.0
2     .Arizona    6828065.0
3    .Arkansas    2978204.0
4  .California   39144818.0


        State  Census_2015
0     Alabama    4858979.0
1      Alaska     738432.0
2     Arizona    6828065.0
3    Arkansas    2978204.0
4  California   39144818.0


censusDF: (59, 2)
Time to read and clean census data --- 0.40192627906799316 seconds ---

Now we can add the census data to our work visa data.

In [23]:
dsJobsDF7 = dsJobsDF6.merge(censusDF,how='left', left_on='Work_State', right_on='State')
dsJobsDF7.shape
Out[23]:
(283323, 24)

Prune data

At this point, we can drop columns we don't need. Here are the columns in our dataset.

In [24]:
dsJobsDF7.columns
Out[24]:
Index(['Submitted_Date', 'Case_Number', 'Employer_Name', 'Employer_City',
       'Employer_State', 'Employer_Postal_Code', 'Job_Title',
       'Approval_Status', 'Wage_Rate', 'Wage_Rate_Unit', 'Part_Time',
       'Work_City', 'Work_State', 'Prevailing_Wage', 'Job_Title_revised',
       'Salary_Offered', 'Salary_Prevailing', 'Work_State_Code', 'State_x',
       'Price_Deflator', 'Offered_Salary_Adjusted',
       'Prevailing_Salary_Adjusted', 'State_y', 'Census_2015'],
      dtype='object')
In [25]:
dsJobsDF_final = dsJobsDF7.drop(dsJobsDF7.columns[[1,3,4,5,6,7,8,9,10,13,15,16,18,22]], axis=1)
dsJobsDF_final.rename(columns={'Job_Title_revised':'Job_Category'}, inplace=True)

Here's our final dataset after dropping some columns.

In [26]:
print(dsJobsDF_final.columns)
print("\n")
print("dsJobsDF_final has {0} rows and {1} columns".format(dsJobsDF_final.shape[0], dsJobsDF_final.shape[1]))
Index(['Submitted_Date', 'Employer_Name', 'Work_City', 'Work_State',
       'Job_Category', 'Work_State_Code', 'Price_Deflator',
       'Offered_Salary_Adjusted', 'Prevailing_Salary_Adjusted', 'Census_2015'],
      dtype='object')


dsJobsDF_final has 283323 rows and 10 columns

Persist data

The data is now ready for prime time! We're going to explore it in Where the data science related jobs are (part2). Before we save the data, let's quickly convert the date string in Submitted_Date column to datetime and then we can save away.

In [27]:
print(type(dsJobsDF_final.Submitted_Date[0]))
s = dsJobsDF_final["Submitted_Date"]
ts = pd.Series([pd.to_datetime(date_string) for date_string in s])
dsJobsDF_final["Submitted_Date"] = ts
print(type(dsJobsDF_final.Submitted_Date[0]))


In [28]:
dsJobsDF_final.to_csv('dataScienceJobs.csv', index=False)
In [29]:
print("Total time --- %s minutes ---" % ((time.time() - total_time) / 60.0))
Total time --- 12.587802549203237 minutes ---