Stem Salaries (W22)
Project Leaders: Radi Akbar, "Alex" Yoon Sung Ji
Team 1: Alex Shandilis, Alyssa Flak, Gwenyth Jones
Team 2: Boren Ke, Jonathon Sonneborn, Daniel Malis
Team 3: Tejas Maire, Andrew Black
One of the motivating factors for people to go to college is the prospect of improving one’s welfare. According to the BLS, software developers, quality assurance analysts, and testers are one of the fastest growing occupations in terms of employment and demand (BLS, 2019). Although the computer science industry is growing, there are internal differences in terms of compensation for each individual and jobs. Our project draws data from a Kaggle page that stores individual level information about their compensation as well as their traits. Our goal is to understand the trends of compensations for STEM jobs through exploring different dimensions of the data and representing it through a visualization dashboard through Tableau. The first team explores compensation trends across race, gender, and company size. The second team investigates payoffs across education level, geography, and job titles. The third team analyzes serial trends in the compensation of each job.
One particularly useful aspect of this data set was its inclusion of location data, in the form of strings with city, country, and other information. However, because this was consolidated into a single column, without any information like latitude and longitude that would otherwise generate geolocation-based visualizations, we needed to find a way to first parse this information and generate new and more useful columns that visualization tools like Tableau could more easily handle. Using python, we found two particularly useful methods for achieving this.
First, when it came to locations based in the United States, the location string included city name and the two-letter code indicating the state, separated by a string, while other countries included the full-length country name as well as full-length province or state name. This made it possible to use a simple Python algorithm to identify which entries were based in the United States simply using the number of commas in the location entry, and then using the comma as a separator to parse state and city names. Second, we used a third-party API and python library nominatim to generate new and relevant columns including city, state/province, country as well as latitude and longitude for every location entry on the list. This method provided us with a data set covering all locations globally as opposed to exclusively the United States.
For the purpose of generating visualizations and offering a better analysis of the data, extrapolating this information from the data set using these methods enabled us to generate a wider variety of interesting and useful visualizations from Tableau and other tools that we tried.
At the start we looked through the data to see how much data was not going to be useful for our analysis. After cleaning that up we added some variables to help parse the data. Then we separated the data into two categories, large and small companies. Using these new sheets we made visualizations of each of the characteristics that we wanted to look at, comparing the total yearly compensation for all the characteristics.
Our team chose to focus on three main categorical variables: gender, education, and race. We spent most of our time cleaning the data to account for null values. For example, there were a lot of null values for education and race. While we dropped the null values for gender, for education and race we used the sklearn KNeighborsClassifier to assign null values in the data set. When creating the visualizations, we looked at how years of experience affected total yearly compensation, categorized by gender, education, or race.
As expected, for STEM careers across the board, salary trends upward as education level increases. However, our visualizations revealed this pattern is more prominent for some careers than others. For example, the salary of a data scientist increases with each increase in education level. However, the salary of a software engineering manager does not smoothly increase with each increase in education level. But an explanation for this is that a large percent of individuals who are software engineering managers have a doctoral degree, suggesting that the possession of a doctorate degree is simply the norm for this career which could explain why the education level trend is not seen with this career and is an important factor to keep in mind.
While there are some small conclusions to make in the dataset about different races making less. It could be due to a lack of data. However, in terms of work experience after working for 16 years Asian males made around 400k on average, while Asian females made only 300k. This trend was only present in Asians, but is a staggering difference of 100k between the two genders after working for 16 years.
The results of our data visualizations weren’t that surprising to us. One thing we discovered was that individuals with a PhD or Masters degree tended to have a higher salary than those with a bachelor’s or no college degree. Another observation we noticed was that gender had more of an impact than race, when it came to total yearly compensation. In our visualizations, males made a higher salary than females, but the salary was around the same for all races. Link to the dashboards can be found in the link below. public.tableau.com/app/profile/andrew.black6379