Code
import numpy as np
import pandas as pdMarch 12, 2026
Invalid Date
Complete a full data wrangling workflow on this messy dataset:
department and statushire_date to datetime and extract year# Create challenge dataset
employees = pd.DataFrame({
'emp_id': [1, 2, 3, 4, 5, 6, 7, 8],
'name': ['Kim', 'Lee', 'Park', 'Choi', 'Jung', 'Kang', 'Yoon', 'Jang'],
'department': ['Engineering', 'ENGINEERING', ' Sales', 'engineering', np.nan, 'Sales', 'HR', 'hr '],
'salary': [75000, 80000, 55000, np.nan, 90000, 52000, 48000, np.nan],
'hire_date': ['2020-01-15', '2019-06-20', '2021-03-10', '2020-11-05',
'2018-08-22', '2022-02-14', '2021-09-01', '2020-04-18'],
'status': ['Active', 'active', 'Active', 'ACTIVE', 'Inactive', ' active', 'Active', 'inactive']
})
print("Messy employee data:")
print(employees)Messy employee data:
emp_id name department salary hire_date status
0 1 Kim Engineering 75000.0 2020-01-15 Active
1 2 Lee ENGINEERING 80000.0 2019-06-20 active
2 3 Park Sales 55000.0 2021-03-10 Active
3 4 Choi engineering NaN 2020-11-05 ACTIVE
4 5 Jung NaN 90000.0 2018-08-22 Inactive
5 6 Kang Sales 52000.0 2022-02-14 active
6 7 Yoon HR 48000.0 2021-09-01 Active
7 8 Jang hr NaN 2020-04-18 inactive
=== Data Exploration ===
Shape: (8, 6)
Head:
emp_id name department salary hire_date status
0 1 Kim Engineering 75000.0 2020-01-15 Active
1 2 Lee ENGINEERING 80000.0 2019-06-20 active
2 3 Park Sales 55000.0 2021-03-10 Active
3 4 Choi engineering NaN 2020-11-05 ACTIVE
4 5 Jung NaN 90000.0 2018-08-22 Inactive
<class 'pandas.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 emp_id 8 non-null int64
1 name 8 non-null str
2 department 7 non-null str
3 salary 6 non-null float64
4 hire_date 8 non-null str
5 status 8 non-null str
dtypes: float64(1), int64(1), str(4)
memory usage: 516.0 bytes
Info:
None
Description:
emp_id salary
count 8.00000 6.000000
mean 4.50000 66666.666667
std 2.44949 17270.398567
min 1.00000 48000.000000
25% 2.75000 52750.000000
50% 4.50000 65000.000000
75% 6.25000 78750.000000
max 8.00000 90000.000000
department and salary column (Non-Null Count)department and status. The other columns are okay because of their propertyhire_date is now “string”===Identity Problems ===
1. Missing values:
emp_id 0
name 0
department 1
salary 2
hire_date 0
status 0
dtype: int64
2.1. Unique departments:
<StringArray>
['Engineering', 'ENGINEERING', ' Sales', 'engineering', nan,
'Sales', 'HR', 'hr ']
Length: 8, dtype: str
2.2. Unique status:
<StringArray>
['Active', 'active', 'ACTIVE', 'Inactive', ' active', 'inactive']
Length: 6, dtype: str
3. hire_data dtype: str
department, salarydepartment, status have to be identified.Missing values per column:
emp_id 0
name 0
department 1
salary 2
hire_date 0
status 0
dtype: int64
Missing percentage per column:
emp_id 0.0
name 0.0
department 12.5
salary 25.0
hire_date 0.0
status 0.0
dtype: float64
department: fill with “Unknown”salary: fill with the “mean” or “median”, and it will be decided from the distributioncount 6.000000
mean 66666.666667
std 17270.398567
min 48000.000000
25% 52750.000000
50% 65000.000000
75% 78750.000000
max 90000.000000
Name: salary, dtype: float64
skew: 0.25657530620432983
salary NA, since
# Drop vs Fill examples
print(f"Original rows: {len(employees)}")
# Drop rows with any NaN
employees_dropped = employees.dropna()
print(f"After dropna(): {len(employees_dropped)} rows")
# Fill in the "department"
employees_filled = employees.copy()
employees_filled['department'] = employees_filled['department'].fillna('Unknown')
# Fill in the "salary"
employees_filled['salary'] = employees_filled['salary'].fillna(employees_filled['salary'].mean())
print(f"\n\rAfter filling: \n\r{employees_filled}")
print(f"After filling: {employees_filled.isna().sum().sum()} missing values")Original rows: 8
After dropna(): 5 rows
After filling:
emp_id name department salary hire_date status
0 1 Kim Engineering 75000.000000 2020-01-15 Active
1 2 Lee ENGINEERING 80000.000000 2019-06-20 active
2 3 Park Sales 55000.000000 2021-03-10 Active
3 4 Choi engineering 66666.666667 2020-11-05 ACTIVE
4 5 Jung Unknown 90000.000000 2018-08-22 Inactive
5 6 Kang Sales 52000.000000 2022-02-14 active
6 7 Yoon HR 48000.000000 2021-09-01 Active
7 8 Jang hr 66666.666667 2020-04-18 inactive
After filling: 0 missing values
Unique departments:
<StringArray>
['Engineering', 'ENGINEERING', ' Sales', 'engineering', 'Unknown',
'Sales', 'HR', 'hr ']
Length: 8, dtype: str
Value counts (before cleaning):
department
Engineering 1
ENGINEERING 1
Sales 1
engineering 1
Unknown 1
Sales 1
HR 1
hr 1
Name: count, dtype: int64
departments
After cleaning:
department
Engineering 3
Sales 2
Hr 2
Unknown 1
Name: count, dtype: int64
Unique status:
<StringArray>
['Active', 'active', 'ACTIVE', 'Inactive', ' active', 'inactive']
Length: 6, dtype: str
Value counts (before cleaning):
status
Active 3
active 1
ACTIVE 1
Inactive 1
active 1
inactive 1
Name: count, dtype: int64
status
hire_date values:
0 2020-01-15
1 2019-06-20
2 2021-03-10
3 2020-11-05
4 2018-08-22
5 2022-02-14
6 2021-09-01
7 2020-04-18
Name: hire_date, dtype: str
Data type: str
year, month column (Not essential, I will create only year)Extracted components:
| name | department | hire_date | year | |
|---|---|---|---|---|
| 0 | Kim | Engineering | 2020-01-15 | 2020 |
| 1 | Lee | Engineering | 2019-06-20 | 2019 |
| 2 | Park | Sales | 2021-03-10 | 2021 |
| 3 | Choi | Engineering | 2020-11-05 | 2020 |
| 4 | Jung | Unknown | 2018-08-22 | 2018 |
| 5 | Kang | Sales | 2022-02-14 | 2022 |
| 6 | Yoon | Hr | 2021-09-01 | 2021 |
| 7 | Jang | Hr | 2020-04-18 | 2020 |
After cleaning, employee data:
emp_id name department salary hire_date status year
0 1 Kim Engineering 75000.000000 2020-01-15 active 2020
1 2 Lee Engineering 80000.000000 2019-06-20 active 2019
2 3 Park Sales 55000.000000 2021-03-10 active 2021
3 4 Choi Engineering 66666.666667 2020-11-05 active 2020
4 5 Jung Unknown 90000.000000 2018-08-22 inactive 2018
5 6 Kang Sales 52000.000000 2022-02-14 active 2022
6 7 Yoon Hr 48000.000000 2021-09-01 active 2021
7 8 Jang Hr 66666.666667 2020-04-18 inactive 2020
Active employees in the Engineering department:
emp_id name department salary hire_date status year
0 1 Kim Engineering 75000.000000 2020-01-15 active 2020
1 2 Lee Engineering 80000.000000 2019-06-20 active 2019
3 4 Choi Engineering 66666.666667 2020-11-05 active 2020