Chapter 2 Data sources

2.1 Collection

The choice of this dataset is made by the group while choosing the topic for the project. We first decide to do something with the topic of job offer, then come across this dataset. The dataset is collected and provided by the Department of Citywide Administrative Services (DCAS) and published on NYC Open data (https://data.cityofnewyork.us/City-Government/NYC-Jobs/kpav-sd4t) There were a handful of other choice available on the internet, but we choose DCAS’s data. This dataset has around three thousand rows each as a job post (till Nov 16) which is large in quantity, yet it also has 30 columns which contain a vast range of different information about each job post. More important is, the data is directly from the officals without pre-modifications made. Compared to other datasets on the web which could possibly contain underlying bias, this one win in both quantity and quality.

2.2 Data set intro

2.2.1 General information

This dataset contains various information about current job postings available on the City of New York’s official jobs site (http://www.nyc.gov/html/careers/html/search/search.shtml). As it is updating quickly on a weekly basis, we’re using the Nov 16th version of the data which stores in the resource folder of the project. The dataset contains a total of 2838 rows with 30 columns. Each row in dataset represents a jobpost once posted on New York’s job site from April 18, 2013 till November 16, 2021.

Part of important column used in analysis is described below:

Column Name Description
Agency Name of the New York City agency (“agency” or “hiring agency”) where a job vacancy exists. (Char)
# Of Positions The total number of vacancies to be filled under the job ID listed. (num)
Business Title The “business title”, or “office title”, for the job posting listed. (Char)
Title Classification The civil service title jurisdiction classification that corresponds to the civil service title posted. (Factor)
Level The civil service title assignment level that the posted position is being filled at. (Factor)
Job Category The occupational group in which the posted job belongs (Char)
Salary Range From The lowest salary on a job posting for a position within the salary band for the related civil service title. (num)
Salary Range To The highest salary on a job posting for a position within the salary band for the related civil service title. (num)
Salary Frequency The frequency of proposed salary. Possible salary frequency values include “hourly”, “daily”, and “annual”. (Factor)
Hours/Shift Projected working hours, working days and shift information. (Char)
Posting Date The date and time that a job vacancy was posted. (Date)
Post Until The last date that a job vacancy will be posted; blank cells indicate job vacancy posts that will remain listed until the position is filled (Date)

2.2.2 Columns information

Looking at the columns, there are self-explanatory ones but there’s also columns with vague name. Here we introduce some columns need explanations. The first important column is the job id column correspond to the number representing the job post in the agency’s system. However, this id is not unique since one job posting could have two types, external and internal, recorded in column ‘posting type’. Next is the Title Classification, This is a title jurisdiction classification, each job falls in one of the following five categories represented by a number : Competitive - 1, Pending Classification - 2, Labor - 3, Exempt - 4 and Non-Competitive - 5. The job title also has a level assignment correspond to the work duties assigned to that position, this info is recorded in ‘Level’ column. With 0-1 level means interns and part-time workers, 2 meaning normal full-time worker, and 3-4 meaning lower to higher management worker.

There’s two column with very similar name, Work Location and Work Location 1. This is not a duplication however, Work Location column records the location of agency posting the job, and Work Location 1 records the actual working place where employee goes every day.

The last three column are all with dates, while date format is in MM/DD/YY. There’s one column here need special care: ‘Post Until’ column records the last data that, a job vacancy was be posted. There needs to be a vacancy inorder to be posted, thus a NA value here indicate the job is fully filled. Only when a job posting being withdrawn without filled will it leave record in this column.

2.3 Issues and Problems

No known issues is given by the data provider, but when check data we found there’s some complete duplicating rows.