Data Wrangling (Pre-processing): Simple Employee Dataset

A full data wrangling pipeline built on a course assignment (DGIST BE202). Covers missing values, text cleaning, datetime parsing, and aggregation using pandas.

Published

March 12, 2026

Modified

Invalid Date

Complete a full data wrangling workflow on this messy dataset:

Explore the data and identify ALL problems
Handle missing values appropriately
Clean text inconsistencies in department and status
Convert hire_date to datetime and extract year
Filter to show only ‘Active’ employees from ‘Engineering’
Calculate the average salary by department

Code

import numpy as np
import pandas as pd

Code

# Create challenge dataset
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 5, 6, 7, 8],
    'name': ['Kim', 'Lee', 'Park', 'Choi', 'Jung', 'Kang', 'Yoon', 'Jang'],
    'department': ['Engineering', 'ENGINEERING', ' Sales', 'engineering', np.nan, 'Sales', 'HR', 'hr '],
    'salary': [75000, 80000, 55000, np.nan, 90000, 52000, 48000, np.nan],
    'hire_date': ['2020-01-15', '2019-06-20', '2021-03-10', '2020-11-05',
                  '2018-08-22', '2022-02-14', '2021-09-01', '2020-04-18'],
    'status': ['Active', 'active', 'Active', 'ACTIVE', 'Inactive', ' active', 'Active', 'inactive']
})
print("Messy employee data:")
print(employees)

Messy employee data:
   emp_id  name   department   salary   hire_date    status
0       1   Kim  Engineering  75000.0  2020-01-15    Active
1       2   Lee  ENGINEERING  80000.0  2019-06-20    active
2       3  Park        Sales  55000.0  2021-03-10    Active
3       4  Choi  engineering      NaN  2020-11-05    ACTIVE
4       5  Jung          NaN  90000.0  2018-08-22  Inactive
5       6  Kang        Sales  52000.0  2022-02-14    active
6       7  Yoon           HR  48000.0  2021-09-01    Active
7       8  Jang          hr       NaN  2020-04-18  inactive

1 EDA

Code

print("=== Data Exploration ===")
print(f"Shape: {employees.shape}")
print(f"\nHead:\n{employees.head()}")
print(f"\nInfo: \n{employees.info()}")
print(f"\nDescription: \n{employees.describe()}")

=== Data Exploration ===
Shape: (8, 6)

Head:
   emp_id  name   department   salary   hire_date    status
0       1   Kim  Engineering  75000.0  2020-01-15    Active
1       2   Lee  ENGINEERING  80000.0  2019-06-20    active
2       3  Park        Sales  55000.0  2021-03-10    Active
3       4  Choi  engineering      NaN  2020-11-05    ACTIVE
4       5  Jung          NaN  90000.0  2018-08-22  Inactive
<class 'pandas.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   emp_id      8 non-null      int64  
 1   name        8 non-null      str    
 2   department  7 non-null      str    
 3   salary      6 non-null      float64
 4   hire_date   8 non-null      str    
 5   status      8 non-null      str    
dtypes: float64(1), int64(1), str(4)
memory usage: 516.0 bytes

Info: 
None

Description: 
        emp_id        salary
count  8.00000      6.000000
mean   4.50000  66666.666667
std    2.44949  17270.398567
min    1.00000  48000.000000
25%    2.75000  52750.000000
50%    4.50000  65000.000000
75%    6.25000  78750.000000
max    8.00000  90000.000000

1.1 Problems

NaN (missing values) - department and salary column (Non-Null Count)
Identity? (unique representation) - Need to check for the column department and status. The other columns are okay because of their property
Data type? (time-series should be pandas datetime) - hire_date is now “string”

Code

print("===Identity Problems ===")
print(f"1. Missing values:\n{employees.isna().sum()}")

print(f"\n2.1. Unique departments:\n{employees['department'].unique()}")
print(f"\n2.2. Unique status:\n{employees['status'].unique()}")

print(f"\n3. hire_data dtype: {employees['hire_date'].dtype}")

===Identity Problems ===
1. Missing values:
emp_id        0
name          0
department    1
salary        2
hire_date     0
status        0
dtype: int64

2.1. Unique departments:
<StringArray>
['Engineering', 'ENGINEERING',      ' Sales', 'engineering',           nan,
       'Sales',          'HR',         'hr ']
Length: 8, dtype: str

2.2. Unique status:
<StringArray>
['Active', 'active', 'ACTIVE', 'Inactive', ' active', 'inactive']
Length: 6, dtype: str

3. hire_data dtype: str

1.2 Solution

NaN (missing values) - columns with NA are: department, salary
Identity? (unique representation) - department, status have to be identified.
Time-series (Not pandas datetime, just string)

2 Handling Missing Values

Code

# Count missing values per column
print("Missing values per column:")
print(employees.isna().sum())

# Percentage of missing values
print("\nMissing percentage per column:")
print((employees.isna().sum() / len(employees) * 100).round(1))

Missing values per column:
emp_id        0
name          0
department    1
salary        2
hire_date     0
status        0
dtype: int64

Missing percentage per column:
emp_id         0.0
name           0.0
department    12.5
salary        25.0
hire_date      0.0
status         0.0
dtype: float64

Since the Missing percentage are not large, so I will fill the NA.
- Categorical variable department: fill with “Unknown”
- Numerical variable salary: fill with the “mean” or “median”, and it will be decided from the distribution

Code

print(employees['salary'].describe())
print(f"\nskew: {employees['salary'].skew()}")

count        6.000000
mean     66666.666667
std      17270.398567
min      48000.000000
25%      52750.000000
50%      65000.000000
75%      78750.000000
max      90000.000000
Name: salary, dtype: float64

skew: 0.25657530620432983

Choose the “mean” to fill salary NA, since
- mean is similar to the median(50%), so that the distribution might be symmetry
- and the skew is not quite large.

Code

# Drop vs Fill examples
print(f"Original rows: {len(employees)}")

# Drop rows with any NaN
employees_dropped = employees.dropna()
print(f"After dropna(): {len(employees_dropped)} rows")

# Fill in the "department"
employees_filled = employees.copy()
employees_filled['department'] = employees_filled['department'].fillna('Unknown')

# Fill in the "salary"
employees_filled['salary'] = employees_filled['salary'].fillna(employees_filled['salary'].mean())

print(f"\n\rAfter filling: \n\r{employees_filled}")
print(f"After filling: {employees_filled.isna().sum().sum()} missing values")

Original rows: 8
After dropna(): 5 rows


After filling: 

   emp_id  name   department        salary   hire_date    status
0       1   Kim  Engineering  75000.000000  2020-01-15    Active
1       2   Lee  ENGINEERING  80000.000000  2019-06-20    active
2       3  Park        Sales  55000.000000  2021-03-10    Active
3       4  Choi  engineering  66666.666667  2020-11-05    ACTIVE
4       5  Jung      Unknown  90000.000000  2018-08-22  Inactive
5       6  Kang        Sales  52000.000000  2022-02-14    active
6       7  Yoon           HR  48000.000000  2021-09-01    Active
7       8  Jang          hr   66666.666667  2020-04-18  inactive
After filling: 0 missing values

3 Text Cleaning

Code

print("Unique departments:")
print(employees_filled['department'].unique())

print(f"\nValue counts (before cleaning):")
print(employees_filled['department'].value_counts())

Unique departments:
<StringArray>
['Engineering', 'ENGINEERING',      ' Sales', 'engineering',     'Unknown',
       'Sales',          'HR',         'hr ']
Length: 8, dtype: str

Value counts (before cleaning):
department
Engineering    1
ENGINEERING    1
 Sales         1
engineering    1
Unknown        1
Sales          1
HR             1
hr             1
Name: count, dtype: int64

departments
- The problems are: whitespace, lower or upper cases
- Remove the whitespace and using the title form

Code

employees_clean = employees_filled.copy()
employees_clean['department'] = employees_clean['department'].str.strip().str.title()

print("After cleaning:")
print(employees_clean['department'].value_counts())

After cleaning:
department
Engineering    3
Sales          2
Hr             2
Unknown        1
Name: count, dtype: int64

Code

print("Unique status:")
print(employees_filled['status'].unique())

print(f"\nValue counts (before cleaning):")
print(employees_filled['status'].value_counts())

Unique status:
<StringArray>
['Active', 'active', 'ACTIVE', 'Inactive', ' active', 'inactive']
Length: 6, dtype: str

Value counts (before cleaning):
status
Active      3
active      1
ACTIVE      1
Inactive    1
 active     1
inactive    1
Name: count, dtype: int64

status
- The problems are: whitespace, lower or upper cases
- Remove the whitespace and using the lower form

Code

employees_clean['status'] = employees_clean['status'].str.strip().str.lower()

print("After cleaning:")
print(employees_clean['status'].value_counts())

After cleaning:
status
active      6
inactive    2
Name: count, dtype: int64

4 Datetime Handling

Code

# Check the hire_date column
print("hire_date values:")
print(employees_clean['hire_date'])
print(f"\nData type: {employees_clean['hire_date'].dtype}")

hire_date values:
0    2020-01-15
1    2019-06-20
2    2021-03-10
3    2020-11-05
4    2018-08-22
5    2022-02-14
6    2021-09-01
7    2020-04-18
Name: hire_date, dtype: str

Data type: str

It’s not the datetype for Pandas, but string
- Convert to datetype
Also, the format is unique. (YYYY-MM-DD)
- Use the default format, not ‘mixed’
Lastly, create the year, month column (Not essential, I will create only year)

Code

employees_clean['hire_date'] = pd.to_datetime(employees_clean['hire_date'])
employees_clean['year'] = employees_clean['hire_date'].dt.year

print("Extracted components:")
employees_clean[['name', 'department', 'hire_date', 'year']]

Extracted components:

	name	department	hire_date	year
0	Kim	Engineering	2020-01-15	2020
1	Lee	Engineering	2019-06-20	2019
2	Park	Sales	2021-03-10	2021
3	Choi	Engineering	2020-11-05	2020
4	Jung	Unknown	2018-08-22	2018
5	Kang	Sales	2022-02-14	2022
6	Yoon	Hr	2021-09-01	2021
7	Jang	Hr	2020-04-18	2020

5 Some Applications

Code

print("After cleaning, employee data:")
print(employees_clean)

After cleaning, employee data:
   emp_id  name   department        salary  hire_date    status  year
0       1   Kim  Engineering  75000.000000 2020-01-15    active  2020
1       2   Lee  Engineering  80000.000000 2019-06-20    active  2019
2       3  Park        Sales  55000.000000 2021-03-10    active  2021
3       4  Choi  Engineering  66666.666667 2020-11-05    active  2020
4       5  Jung      Unknown  90000.000000 2018-08-22  inactive  2018
5       6  Kang        Sales  52000.000000 2022-02-14    active  2022
6       7  Yoon           Hr  48000.000000 2021-09-01    active  2021
7       8  Jang           Hr  66666.666667 2020-04-18  inactive  2020

5.1 Filter to show only ‘Active’ employees from ‘Engineering’

Code

print("Active employees in the Engineering department:")
print(
    employees_clean[
        (employees_clean["status"] == 'active') & (employees_clean["department"] == 'Engineering')
    ]
)

Active employees in the Engineering department:
   emp_id  name   department        salary  hire_date  status  year
0       1   Kim  Engineering  75000.000000 2020-01-15  active  2020
1       2   Lee  Engineering  80000.000000 2019-06-20  active  2019
3       4  Choi  Engineering  66666.666667 2020-11-05  active  2020

In the “Engineering” department, all employees are “active”.

5.2 Calculate the average salary by department

Code

print("Average salary by department:")
print(employees_clean.groupby("department")["salary"].mean().round(2))

Average salary by department:
department
Engineering    73888.89
Hr             57333.33
Sales          53500.00
Unknown        90000.00
Name: salary, dtype: float64