Data Wrangling (Pre-processing): Simple Employee Dataset

A full data wrangling pipeline built on a course assignment (DGIST BE202). Covers missing values, text cleaning, datetime parsing, and aggregation using pandas.
Published

March 12, 2026

Modified

Invalid Date

Complete a full data wrangling workflow on this messy dataset:

  1. Explore the data and identify ALL problems
  2. Handle missing values appropriately
  3. Clean text inconsistencies in department and status
  4. Convert hire_date to datetime and extract year
  5. Filter to show only ‘Active’ employees from ‘Engineering’
  6. Calculate the average salary by department
Code
import numpy as np
import pandas as pd
Code
# Create challenge dataset
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 5, 6, 7, 8],
    'name': ['Kim', 'Lee', 'Park', 'Choi', 'Jung', 'Kang', 'Yoon', 'Jang'],
    'department': ['Engineering', 'ENGINEERING', ' Sales', 'engineering', np.nan, 'Sales', 'HR', 'hr '],
    'salary': [75000, 80000, 55000, np.nan, 90000, 52000, 48000, np.nan],
    'hire_date': ['2020-01-15', '2019-06-20', '2021-03-10', '2020-11-05',
                  '2018-08-22', '2022-02-14', '2021-09-01', '2020-04-18'],
    'status': ['Active', 'active', 'Active', 'ACTIVE', 'Inactive', ' active', 'Active', 'inactive']
})
print("Messy employee data:")
print(employees)
Messy employee data:
   emp_id  name   department   salary   hire_date    status
0       1   Kim  Engineering  75000.0  2020-01-15    Active
1       2   Lee  ENGINEERING  80000.0  2019-06-20    active
2       3  Park        Sales  55000.0  2021-03-10    Active
3       4  Choi  engineering      NaN  2020-11-05    ACTIVE
4       5  Jung          NaN  90000.0  2018-08-22  Inactive
5       6  Kang        Sales  52000.0  2022-02-14    active
6       7  Yoon           HR  48000.0  2021-09-01    Active
7       8  Jang          hr       NaN  2020-04-18  inactive

1 EDA

Code
print("=== Data Exploration ===")
print(f"Shape: {employees.shape}")
print(f"\nHead:\n{employees.head()}")
print(f"\nInfo: \n{employees.info()}")
print(f"\nDescription: \n{employees.describe()}")
=== Data Exploration ===
Shape: (8, 6)

Head:
   emp_id  name   department   salary   hire_date    status
0       1   Kim  Engineering  75000.0  2020-01-15    Active
1       2   Lee  ENGINEERING  80000.0  2019-06-20    active
2       3  Park        Sales  55000.0  2021-03-10    Active
3       4  Choi  engineering      NaN  2020-11-05    ACTIVE
4       5  Jung          NaN  90000.0  2018-08-22  Inactive
<class 'pandas.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   emp_id      8 non-null      int64  
 1   name        8 non-null      str    
 2   department  7 non-null      str    
 3   salary      6 non-null      float64
 4   hire_date   8 non-null      str    
 5   status      8 non-null      str    
dtypes: float64(1), int64(1), str(4)
memory usage: 516.0 bytes

Info: 
None

Description: 
        emp_id        salary
count  8.00000      6.000000
mean   4.50000  66666.666667
std    2.44949  17270.398567
min    1.00000  48000.000000
25%    2.75000  52750.000000
50%    4.50000  65000.000000
75%    6.25000  78750.000000
max    8.00000  90000.000000

1.1 Problems

  • NaN (missing values) - department and salary column (Non-Null Count)
  • Identity? (unique representation) - Need to check for the column department and status. The other columns are okay because of their property
  • Data type? (time-series should be pandas datetime) - hire_date is now “string”
Code
print("===Identity Problems ===")
print(f"1. Missing values:\n{employees.isna().sum()}")

print(f"\n2.1. Unique departments:\n{employees['department'].unique()}")
print(f"\n2.2. Unique status:\n{employees['status'].unique()}")

print(f"\n3. hire_data dtype: {employees['hire_date'].dtype}")
===Identity Problems ===
1. Missing values:
emp_id        0
name          0
department    1
salary        2
hire_date     0
status        0
dtype: int64

2.1. Unique departments:
<StringArray>
['Engineering', 'ENGINEERING',      ' Sales', 'engineering',           nan,
       'Sales',          'HR',         'hr ']
Length: 8, dtype: str

2.2. Unique status:
<StringArray>
['Active', 'active', 'ACTIVE', 'Inactive', ' active', 'inactive']
Length: 6, dtype: str

3. hire_data dtype: str

1.2 Solution

  • NaN (missing values) - columns with NA are: department, salary
  • Identity? (unique representation) - department, status have to be identified.
  • Time-series (Not pandas datetime, just string)

2 Handling Missing Values

Code
# Count missing values per column
print("Missing values per column:")
print(employees.isna().sum())

# Percentage of missing values
print("\nMissing percentage per column:")
print((employees.isna().sum() / len(employees) * 100).round(1))
Missing values per column:
emp_id        0
name          0
department    1
salary        2
hire_date     0
status        0
dtype: int64

Missing percentage per column:
emp_id         0.0
name           0.0
department    12.5
salary        25.0
hire_date      0.0
status         0.0
dtype: float64
  • Since the Missing percentage are not large, so I will fill the NA.
    • Categorical variable department: fill with “Unknown”
    • Numerical variable salary: fill with the “mean” or “median”, and it will be decided from the distribution
Code
print(employees['salary'].describe())
print(f"\nskew: {employees['salary'].skew()}")
count        6.000000
mean     66666.666667
std      17270.398567
min      48000.000000
25%      52750.000000
50%      65000.000000
75%      78750.000000
max      90000.000000
Name: salary, dtype: float64

skew: 0.25657530620432983
  • Choose the “mean” to fill salary NA, since
    • mean is similar to the median(50%), so that the distribution might be symmetry
    • and the skew is not quite large.
Code
# Drop vs Fill examples
print(f"Original rows: {len(employees)}")

# Drop rows with any NaN
employees_dropped = employees.dropna()
print(f"After dropna(): {len(employees_dropped)} rows")

# Fill in the "department"
employees_filled = employees.copy()
employees_filled['department'] = employees_filled['department'].fillna('Unknown')

# Fill in the "salary"
employees_filled['salary'] = employees_filled['salary'].fillna(employees_filled['salary'].mean())

print(f"\n\rAfter filling: \n\r{employees_filled}")
print(f"After filling: {employees_filled.isna().sum().sum()} missing values")
Original rows: 8
After dropna(): 5 rows


After filling: 

   emp_id  name   department        salary   hire_date    status
0       1   Kim  Engineering  75000.000000  2020-01-15    Active
1       2   Lee  ENGINEERING  80000.000000  2019-06-20    active
2       3  Park        Sales  55000.000000  2021-03-10    Active
3       4  Choi  engineering  66666.666667  2020-11-05    ACTIVE
4       5  Jung      Unknown  90000.000000  2018-08-22  Inactive
5       6  Kang        Sales  52000.000000  2022-02-14    active
6       7  Yoon           HR  48000.000000  2021-09-01    Active
7       8  Jang          hr   66666.666667  2020-04-18  inactive
After filling: 0 missing values

3 Text Cleaning

Code
print("Unique departments:")
print(employees_filled['department'].unique())

print(f"\nValue counts (before cleaning):")
print(employees_filled['department'].value_counts())
Unique departments:
<StringArray>
['Engineering', 'ENGINEERING',      ' Sales', 'engineering',     'Unknown',
       'Sales',          'HR',         'hr ']
Length: 8, dtype: str

Value counts (before cleaning):
department
Engineering    1
ENGINEERING    1
 Sales         1
engineering    1
Unknown        1
Sales          1
HR             1
hr             1
Name: count, dtype: int64
  • departments
    • The problems are: whitespace, lower or upper cases
    • Remove the whitespace and using the title form
Code
employees_clean = employees_filled.copy()
employees_clean['department'] = employees_clean['department'].str.strip().str.title()

print("After cleaning:")
print(employees_clean['department'].value_counts())
After cleaning:
department
Engineering    3
Sales          2
Hr             2
Unknown        1
Name: count, dtype: int64
Code
print("Unique status:")
print(employees_filled['status'].unique())

print(f"\nValue counts (before cleaning):")
print(employees_filled['status'].value_counts())
Unique status:
<StringArray>
['Active', 'active', 'ACTIVE', 'Inactive', ' active', 'inactive']
Length: 6, dtype: str

Value counts (before cleaning):
status
Active      3
active      1
ACTIVE      1
Inactive    1
 active     1
inactive    1
Name: count, dtype: int64
  • status
    • The problems are: whitespace, lower or upper cases
    • Remove the whitespace and using the lower form
Code
employees_clean['status'] = employees_clean['status'].str.strip().str.lower()

print("After cleaning:")
print(employees_clean['status'].value_counts())
After cleaning:
status
active      6
inactive    2
Name: count, dtype: int64

4 Datetime Handling

Code
# Check the hire_date column
print("hire_date values:")
print(employees_clean['hire_date'])
print(f"\nData type: {employees_clean['hire_date'].dtype}")
hire_date values:
0    2020-01-15
1    2019-06-20
2    2021-03-10
3    2020-11-05
4    2018-08-22
5    2022-02-14
6    2021-09-01
7    2020-04-18
Name: hire_date, dtype: str

Data type: str
  • It’s not the datetype for Pandas, but string
    • Convert to datetype
  • Also, the format is unique. (YYYY-MM-DD)
    • Use the default format, not ‘mixed’
  • Lastly, create the year, month column (Not essential, I will create only year)
Code
employees_clean['hire_date'] = pd.to_datetime(employees_clean['hire_date'])
employees_clean['year'] = employees_clean['hire_date'].dt.year

print("Extracted components:")
employees_clean[['name', 'department', 'hire_date', 'year']]
Extracted components:
name department hire_date year
0 Kim Engineering 2020-01-15 2020
1 Lee Engineering 2019-06-20 2019
2 Park Sales 2021-03-10 2021
3 Choi Engineering 2020-11-05 2020
4 Jung Unknown 2018-08-22 2018
5 Kang Sales 2022-02-14 2022
6 Yoon Hr 2021-09-01 2021
7 Jang Hr 2020-04-18 2020

5 Some Applications

Code
print("After cleaning, employee data:")
print(employees_clean)
After cleaning, employee data:
   emp_id  name   department        salary  hire_date    status  year
0       1   Kim  Engineering  75000.000000 2020-01-15    active  2020
1       2   Lee  Engineering  80000.000000 2019-06-20    active  2019
2       3  Park        Sales  55000.000000 2021-03-10    active  2021
3       4  Choi  Engineering  66666.666667 2020-11-05    active  2020
4       5  Jung      Unknown  90000.000000 2018-08-22  inactive  2018
5       6  Kang        Sales  52000.000000 2022-02-14    active  2022
6       7  Yoon           Hr  48000.000000 2021-09-01    active  2021
7       8  Jang           Hr  66666.666667 2020-04-18  inactive  2020

5.1 Filter to show only ‘Active’ employees from ‘Engineering’

Code
print("Active employees in the Engineering department:")
print(
    employees_clean[
        (employees_clean["status"] == 'active') & (employees_clean["department"] == 'Engineering')
    ]
)
Active employees in the Engineering department:
   emp_id  name   department        salary  hire_date  status  year
0       1   Kim  Engineering  75000.000000 2020-01-15  active  2020
1       2   Lee  Engineering  80000.000000 2019-06-20  active  2019
3       4  Choi  Engineering  66666.666667 2020-11-05  active  2020
  • In the “Engineering” department, all employees are “active”.

5.2 Calculate the average salary by department

Code
print("Average salary by department:")
print(employees_clean.groupby("department")["salary"].mean().round(2))
Average salary by department:
department
Engineering    73888.89
Hr             57333.33
Sales          53500.00
Unknown        90000.00
Name: salary, dtype: float64