Chapter 3 Data transformation

3.1 Cleaning

The data in column Title.Classification and Level are categorical, thus need to be changed from initial readings which is characters. The classification is transformed in a following basis: Competitive - 1, Pending Classification - 2, Labor - 3, Exempt - 4 and Non-Competitive - 5. The level have 0-4 integer leveling, however M1 ~ M4 exist representing minor 1 ~ minor 4. The record with these minor level is few in number thus converted to corresponding higher levels for analysis. There are also special cases which occur less than ten times, these cases are dropped.

Next, since the salary of a job posting is being represented as an interval, we choose the mean value of the interval as the expected salary of each job posting for analysis. This becomes a new column Salary.Mean

There’s also two date column: ‘Posted From’ and ‘Post Until’ being imported as character, thus we need to convert them to date values. This enable us to calculate date difference and do further analysis.

3.2 Transformation

As stated in data collection part, this dataset is relatively clean and not much needed to be done for transformation. The few things we done is adding columns of mean value and difference value for some numrical column for data presenting. Another added column is the converted meaningful NA value of the Post Until column. We also did covert the department name of each job posting to a shorter version to fit the size of the graph. Some mis-spelling in original data is corrected in this process.

Newly generated columns are the following:

Column Name Description
Salary Mean Mean value of salary band of the job post. (num)
Filled If the job position is filled or not. (Boolean)
Post Difference Posting date numbered in days untile withdrawn for unfilled job positions. (num)