Han Zhang
One way to understand how a city government works is by looking at who it employs and how its employees are compensated.
SF Salaries data contains the names, job title, and compensation for San Francisco city employees on an annual basis from 2011 to 2014, including both numerical and categorical features. It matches the Lab One requirement and is free to download at kaggle.com. The main idea of this analysis is to know income sturcture and make futher predictions through visualizing the features, exploring relationships within features.
Through visualizing the features, exploring the features relationships and other methods, we are able to help government control public sentiments by adjusting related policies.
Overall, the analytics also aid to help government or personal organization to make future job requirement decision and help job seekers to match the best price for their job.
=======================================================================
Dataset: Human Resources Analytics URL: https://www.kaggle.com/kaggle/sf-salaries.
# load the SF-salaries dataset
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt
import warnings
#use notebook visualization
%matplotlib inline
# never print matching warnings
warnings.filterwarnings("ignore")
# Read CSV (comma-separated) file into DataFrame
df_salariesInput = pd.read_csv('/Users/hanzhang/myJupyter/My Practices/Lab_One/Salaries.csv')
df_salariesInput.head()
BasePay, OvertimePay, OtherPay and Benefits have wrong types. They are objects, but we think they should be number types and we want to convert them. In case that these columns may contain not numerical values, errors='coerce' attribute is used to coerce invalid values to NaN.
# convert the pay columns to numeric
for column in ['BasePay', 'OvertimePay', 'OtherPay', 'Benefits']:
df_salariesInput[column] = pd.to_numeric(df_salariesInput[column], errors='coerce')
print(df_salariesInput.info())
Based on the dataframe information above, especially the RangeIndex, some features have no missing values, such as "JobTitle" and "TotalPay", while some do, such as "Benefits".
Let's then take a closer look at them.
print('Agency unique values:')
df_salariesInput['Agency'].unique()
print('Status unique values:')
df_salariesInput['Status'].unique()
print('Notes unique values:')
df_salariesInput['Notes'].unique()
Agency column is all the same not giving us any useful information, and Notes column seems to be missing a lot of values. Status contains work schedule and has two option: Part-Time (PT) or Full-Time (FT). Laterly, we'll use those with meaningful Status values.
Let's drop Agency, Notes and Id columns using drop method.
# axis is either 0 for rows, 1 for columns
df_salariesInput.drop(["Id", "Notes", "Agency"], axis = 1, inplace = True)
df_salariesInput.head()
Since we have quite amount of missing values in Benefits and BasePay, we will just drop the rows with missing value in that two features.
#df_salariesInput = df_salariesInput[pd.notnull(df_salariesInput['Benefits'])]
#df_salariesInput.info()
df = df_salariesInput.dropna(subset=['Benefits'])
#df_salariesInput.info()
df_dropNan = df.dropna(subset=['BasePay'])
df_dropNan.info()
Now, we only have missing value in Status, which we will keep and use later.
df_dropNan["JobTitle"] = df_salariesInput.JobTitle.str.replace("\\d+|\b\w\b","").str.lower()
# create a data description table
data_tab = pd.DataFrame()
data_tab['Features'] = df_dropNan.columns
data_tab['Description'] = ['name','JobTitle','the base rate of pay for a job',
'pay received for the time beyond normal working hours',
'Supplemental Pay, Awards, Etc.',
" benefits in kind include various types of non-wage compensation provided to employees in addition to their normal wages or salaries",'','','',
'it is a part-time or full-time']
data_tab['Scales'] = ['norminal'] * 2 + ['ratio'] * 6 + ['interval'] + ['norminal']
data_tab['Discrete\Continuous'] = ['discrete'] * 2 + ['continuous'] * 6 + ['discrete'] * 2
#data_tab['Range'] = [repr(set(df_salariesInput.EmployeeName))] + [repr(set(df_salariesInput.JobTitle))] + ['n'] * 6 + ['2011, 2012,2013,2014'] + ['FT, PT, NaN']
data_tab
df_dropNan[df_dropNan['TotalPay'] < 0]
# list the top 5 people whose TotalPay is 0.
df_dropNan[df_dropNan['TotalPay'] == 0].head(5)
print("People work for no money: ",len(df_dropNan[df_dropNan['TotalPay'] == 0]))
So we have 73 people with 0 TotalPay. Perhaps they’re retired, on disability, or something else?
df_dropNan[(df_dropNan['TotalPay'] < 100) & (df_dropNan['TotalPay'] > 0)].head()
We have 372 people whith total pay under $100. This probably includes people who were paid for, say, an hour of their time.
Now take a look at the categorical variables. The representation of the JobTitle feature needs to be fixed a bit, since the same job in uppercase and lowercase are treated as two different jobs.
# Let's fix it and list the top 10 populer jobs.
df_dropNan["JobTitle"] = df_dropNan.JobTitle.str.replace("\\d+|\b\w\b","").str.lower()
df_dropNan["JobTitle"].value_counts().head(10)
fullTime = df_dropNan[df_dropNan['Status'] == 'FT']
partTime = df_dropNan[df_dropNan['Status'] == 'PT']
unique_job_titles = df_dropNan.JobTitle.unique()
print('All unique job titles:', len(df_dropNan['JobTitle'].unique()) - 1)
print('Full-time unique job titles:', len(fullTime['JobTitle'].unique()) - 1)
print('Part-time unique job titles:', len(partTime['JobTitle'].unique()) - 1)
Data in this column is very various and we need to simplify these values. I'm going to split all data into individual words and count their frequency. Then I'll look through top 200 words and try to divide them into several job groups. Each group corresponds a set of words that contain in the job titles of people working in this group. In the end, I'll spread each person to one of the groups. The remaining will be placed in the group called 'other'.
# Resource:https://www.kaggle.com/dmitriy19/explore-sf-salary-data-quick-eda
from collections import Counter
job_titles = df_dropNan['JobTitle'].unique()[:-1] # deleting the last element "Not provided"
words_in_titles = []
for job_title in job_titles:
words_in_titles += job_title.lower().split()
# a little cleaning
words = []
for word in words_in_titles:
if not word.isdigit() and len(word) > 3:
words.append(word)
words_count = Counter(words)
words_count.most_common(20)
# define our own job groups
job_groups = {'Fire' : ['fire'],
'Animal' : ['animal'],
'Mayor' : ['mayor'],
'Library' : ['librar'],
'Parking' : ['parking'],
'Clerk' : ['clerk'],
'Porter' : ['porter'],
'Engineer and Tech': ['engineer', 'programmer', 'electronic', 'tech'],
'Court' : ['court', 'legal', "attorney's", 'atty', 'eligibility'],
'Police' : ['sherif', 'officer', 'police', 'probation', "sheriff's", 'sergeant'],
'Medical' : ['nurse', 'medical', 'health', 'physician', 'therapist', 'psychiatric',
'treatment', 'hygienist','anesthetist','nursing','forensic toxicologist', 'patient'],
'Public Works' : ['public'],
'Food Service' : ['food'],
'Architectural' : ['architect'],
'Law': ['attorney'],
'Mechanic':['mechanic'],
'Transportation':['transit','transportation','airport'],
'Finance':['investments']}
def transform_func(title):
title = title.lower()
for key, value in job_groups.items():
for each_value in value:
if title.find(each_value) != -1:
return key
return 'Other'
df_dropNan['JobGroup'] = df_dropNan['JobTitle'].apply(transform_func)
df_dropNan.head(10)
df_dropNan['JobGroup'].unique()
As we drop the missing values in benefits, we are just dealing with data gathered in Year 2012, 2013, 2014.
df_dropNan['Year'].unique()
tp_2011 = df_dropNan[df_dropNan['Year'] == 2011]
tp_2012 = df_dropNan[df_dropNan['Year'] == 2012]
tp_2013 = df_dropNan[df_dropNan['Year'] == 2013]
tp_2014 = df_dropNan[df_dropNan['Year'] == 2014]
fig, ax = plt.pyplot.subplots(figsize=(9, 6))
sns.kdeplot(tp_2012['TotalPay'].dropna(), label="2012", shade=True, ax=ax)
sns.kdeplot(tp_2013['TotalPay'].dropna(), label="2013", shade=True, ax=ax)
sns.kdeplot(tp_2014['TotalPay'].dropna(), label="2014", shade=True, ax=ax)
plt.pyplot.xlabel('TotalPay')
plt.pyplot.ylabel('Density')
title = plt.pyplot.title('Total Pay Distribution for Each Year')
There are clearly outliers at the higher end of the range (and the boxplot makes it easier to see these outliers than the histogram).
# box plot
sns.violinplot('Year', 'TotalPay', data= df_dropNan)
Observation:
It appears that 2014 may have had a very slight increase relative to the previous two years.
Let's then look at the benefits. How benefits has changed over Year.
ax = sns.violinplot(x="Year", y="Benefits", data=df_dropNan)
# set data limit on y-axis
ax.set_ylim((0, df_dropNan.Benefits.max()))
Observation:
g = sns.FacetGrid(df_dropNan, col="JobGroup", col_wrap=4, size=4.5, dropna=True)
res = g.map(sns.kdeplot, 'TotalPay', shade=True)
Observations:
Emplyees in investment are getting quite high salaries.
Most of the graphs have more than one spikes. It is likely about the internal distribution within that specific job group. That's normal — some people have high posts, some have low posts.
Some groups have one spike (Architectural, Fire, Mayor, etc.). It shows that in these group people generally get the same salary.
fig, ax = plt.pyplot.subplots(figsize=(9, 6))
ft_job = df_dropNan[df_dropNan['Status'] == 'FT']
pt_job = df_dropNan[df_dropNan['Status'] == 'PT']
notprovided = df_dropNan[(df_dropNan['Status'] != "FT") & (df_dropNan['Status'] != "PT")]
sns.kdeplot(ft_job.TotalPay, label="Full-Time", shade=True, ax=ax)
sns.kdeplot(pt_job.TotalPay, label="Part-Time", shade=True, ax=ax)
sns.kdeplot(notprovided.TotalPay, label="NaN",shade=True, ax=ax)
plt.pyplot.xlabel('TotalPay')
plt.pyplot.ylabel('Density')
title = plt.pyplot.title('Total Pay Distribution')
Observations:
Part Time (PT) employees are compensated according to how much the city uses them - which is much less than full-time employees.
The story with blank Status is a little more complicated. It looks like a lot of the blank entries should be full time; and some should be part time. We can see the graph split.
Let's take a closer look at the job group of Law. We will divide the employees into two groups: FT and PT.(Here we drop the missing values in Status.)
ft_law = df_dropNan[(df_dropNan['Status'] == 'FT') & (df_dropNan['JobGroup'] == 'Law')]
pt_law = df_dropNan[(df_dropNan['Status'] == 'PT') & (df_dropNan['JobGroup'] == 'Law')]
fig, ax = plt.pyplot.subplots(figsize=(9, 6))
sns.kdeplot(ft_law['TotalPay'].dropna(), label="Full-Time", shade=True, ax=ax)
sns.kdeplot(pt_law['TotalPay'].dropna(), label="Part-Time", shade=True, ax=ax)
plt.pyplot.xlabel('TotalPay')
plt.pyplot.ylabel('Density')
title = plt.pyplot.title('Law Total Pay Distribution')
Observations:
# ft_job = df_dropNan[df_dropNan['Status'] == 'FT']
# we have 22334 available full time job data
len(ft_job)
ft_job.Year.unique()
ft_job.info()
Now, we find out why there are so many missing values in Status. Only Year 2014 has people categorized.
ft2014 = ft_job[(ft_job.Year == 2014) & (ft_job.Status == 'FT') ]
fig, ax = plt.pyplot.subplots(figsize=(9, 6))
sns.kdeplot(ft2014['BasePay'], label="BasePay", shade=True, ax=ax)
sns.kdeplot(ft2014['Benefits'], label="Benefits", shade=True, ax=ax)
sns.kdeplot(ft2014['OvertimePay'], label="OvertimePay", shade=True, ax=ax)
plt.pyplot.xlabel('Pay')
plt.pyplot.ylabel('Density')
title = plt.pyplot.title('Full Time Jobs Pay Distribution')
Observations:
we have a few spikes in Benefits and BasePay. Perhaps these are raised by diferent types of jobs, or different levels.
Interesting that the lowest benefit number is negative.
# get the top ten jobs
top_ten_occupations = df_dropNan.JobTitle.value_counts().sort_values(ascending=False).head(10).index
top_ten_occupations
# aggregate by job title and pick out the BasePay, Benefits, and Overtime features
pay_by_occupation = (df_dropNan[df_dropNan.JobTitle.isin(top_ten_occupations)]
.groupby('JobTitle')[['BasePay', 'Benefits', 'OvertimePay']]
.aggregate('mean')
)
ax = pay_by_occupation.plot(kind='barh', figsize=(8,8))
ax.set_xlabel('Mean Pay')
Let's see what's happening.
print("Number of employee as recreation leader:")
len(df_dropNan[df_dropNan.JobTitle == 'recreation leader'])
print("Number of employee as part-time recreation leader:")
len(df_dropNan[(df_dropNan.JobTitle == 'recreation leader') & (df_dropNan.Status == 'PT')])
People work as recreation leader generally are getting a relatively low base pay. Among this, part-time jobs take up almost 35%, which might be a factor.
Let's select the BasePay from the top 2 jobs, firefighter and police officer, in 2014.
df_combine = df_dropNan[((df_dropNan.JobTitle == 'firefighter') | (df_dropNan.JobTitle == 'police officer')) & (df_dropNan.Year == 2014)]
g = sns.factorplot(x="JobTitle", y="BasePay", hue="Status", data=df_combine, kind="box",size=4, aspect=.7);
Observations:
IQR in the boxplot of police officer is much wider than that of firefighter both in PT and FT. Not sure why this happened.
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# plot the correlation matrix using seaborn
sns.set(style="darkgrid") # one of the many styles to plot using
f, ax = plt.pyplot.subplots(figsize=(9, 9))
sns.heatmap(df_dropNan.corr(), cmap=cmap, annot=True)
f.tight_layout()
plt.pyplot.title('Correlation Matrix Graph')
Observations:
Scatterplot matrices are a great way to roughly determine if you have a linear correlation between multiple variables. This is particularly helpful in pinpointing specific variables that might have similar correlations to your genomic or proteomic data.
pay_columns = df_dropNan.columns[2:df_dropNan.columns.get_loc('Year')]
df_temp1 = df_dropNan[(df_dropNan['JobGroup'] == 'Medical') |(df_dropNan['JobGroup'] == 'Police')|
(df_dropNan['JobGroup'] == 'Fire') |(df_dropNan['JobGroup'] == 'Engineer and Tech')]
pd.scatter_matrix(df_temp1[pay_columns], figsize=(8,8))
Observation:
From the matrix of the scatter plots, we can see that the above features are not very good classifiers for the four job groups. Therefore, we are to transform the features into two principle components by using PCA in the section 4.
# Reference: http://seaborn.pydata.org/generated/seaborn.jointplot.html
df1 = pd.DataFrame(df_dropNan, columns = ['OvertimePay', 'BasePay'])
g = sns.jointplot("OvertimePay", "BasePay", data=df1, kind="reg")
Pearson correlation coefficient = 0.28
OvertimePay and BasePay have a small positive linear correlation.
df2 = pd.DataFrame(df_dropNan, columns = ['OvertimePay', 'Benefits'])
g = sns.jointplot("OvertimePay", "Benefits", data=df2, kind="reg")
Pearson correlation coefficient = 0.3
OvertimePay and Benefits have a small positive linear correlation.
df3 = pd.DataFrame(df_dropNan, columns = ['BasePay', 'Benefits'])
g = sns.jointplot("BasePay", "Benefits", data=df3, kind="reg")
Pearson correlation coefficient = 0.92
That is to say, BasePay and Benefits have a positive linear correlation. People getting a high basepay are very likely receive a high benefits.
In this section, four job groups are chosen to be analysed. They are 'Medical', 'Police', 'Fire', 'Engineer and Tech', which are replaced with number 1,2,3,4 for later use.
From the pairplot below, we can see that the job groups are not easily separated lineraly betweem most of the attribute pairs.
from sklearn.decomposition.pca import PCA
raw_data = pd.DataFrame(df_dropNan) # original data is a 111886 rows × 11 columns
df_temp = df_dropNan[(df_dropNan['JobGroup'] == 'Medical') |(df_dropNan['JobGroup'] == 'Police')|
(df_dropNan['JobGroup'] == 'Fire') |(df_dropNan['JobGroup'] == 'Engineer and Tech')]
sns.pairplot(df_temp, size=3,hue='JobGroup')
Now, we will use PCA to reduce the feature dimension and check whether types will be linearly separated in the reduced space.
First, the data need to be fixed a bit.
# replace the four job groups with number 1,2,3,4
df_temp.replace(to_replace = ['Fire', 'Police', 'Engineer and Tech','Medical'], value = range(1,5), inplace = True)
# get rid of all the str features, 39103*8 matrix
df_temp.drop(["Status",'JobTitle','EmployeeName'], axis = 1, inplace = True)
df_temp.head()
Now, all we have is numerical data.
The idea of PCA is to change the oringal data to a new space with less dimensionality, without any prior information. Since it is a pure mathematical transformation where the meaning of the attribute in the new space is not that understandable.
Transpose data to 391038, where every column is an instance and every row has 8 attributes. PCA reduce the dimensionality to n_components39103, where n_components is 2.
# set reduced dimensionality = 2 and plot
# Reference: http://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_vs_lda.html
pca = PCA(n_components=2)
pca.fit(df_temp.T)
# print components
print ('pca:', pca.components_)
print ('pca shape:', pca.components_.shape)
# components_ : array, shape (n_components, n_features), 39103 by 2 matrix here
x_new = pca.components_.T[:,0]
y_new = pca.components_.T[:,1]
jobgroup = np.array(df_temp.JobGroup)
index = np.arange(df_temp.JobGroup.count())
Plot it !
Also print the explained variance ratio of the first two components. Naturally, if the proportion of variation explained by the first k principal components is large, then not much information is lost by considering only the first k principal components.
type_1 = plt.pyplot.scatter(x=x_new[index[jobgroup == 1]],y=y_new[index[jobgroup == 1]],color='turquoise',s=2)
type_2 = plt.pyplot.scatter(x=x_new[index[jobgroup == 2]],y=y_new[index[jobgroup == 2]],color='darkorange',s=2)
type_3 = plt.pyplot.scatter(x=x_new[index[jobgroup == 3]],y=y_new[index[jobgroup == 3]],color='r',s=2)
type_4 = plt.pyplot.scatter(x=x_new[index[jobgroup == 4]],y=y_new[index[jobgroup == 4]],color='mediumpurple',s=2)
plt.pyplot.legend((type_1, type_2,type_3,type_4),( 'Fire', 'Police','Engineer and Tech','Medical'),loc='lower left')
plt.pyplot.xlabel("attribute_1")
plt.pyplot.ylabel("attribute_2")
plt.pyplot.title('LDA of Salaries of 4 Job Groups')
print('explained variance ratio (first two components): %s'
% str(pca.explained_variance_ratio_))
Observation:
Kaggle. SF Salaries. https://www.kaggle.com/kaggle/sf-salaries (Accessed Aug. 2017)