Data Exploration and Preparation
Hands-on exercises with the California Fire Incidents dataset using Orange and Python
Objectives
By the end of this practical work, you will be able to:
- Load and explore a real-world dataset in Orange
- Identify and handle missing values
- Deal with imbalanced classes using resampling techniques
- Extract useful features from date/time columns
- Build and evaluate a basic classification model
Prerequisites
- Orange Data Mining installed (download here)
- Basic understanding of Python (optional, for advanced exercises)
- The California Fire Incidents dataset
Install required Python packages (if using Python Script widget):
pip install pandas numpy scikit-learn
Instructions
Step 1: Load the Dataset
Download and load the California Fire Incidents dataset into Orange:
- Download the dataset: California_Fire_Incidents.csv
- In Orange, add a File widget to the canvas
- Double-click the widget and select the downloaded CSV file
- Connect a Data Table widget to view the data
Note: Take a moment to explore the dataset. How many rows and columns does it have? What types of features are present?
Step 2: Explore the Data
Add visualization widgets to understand the data:
- Connect a Distributions widget to see feature distributions
- Connect a Scatter Plot widget to explore relationships
- If you have geo add-on installed, use Geo Map to visualize fire locations
Questions to answer:
- What is the distribution of
MajorIncident(True vs False)? - Which counties have the most fires?
- Is there a correlation between
AcresBurnedandPersonnelInvolved?
Step 3: Handle Missing Values
Use the Impute widget to handle missing values:
- Connect an Impute widget after your File widget
- Choose imputation strategy: Average/Most frequent for numeric/categorical
- Compare the data before and after imputation
Alternatively, use the Python Script widget for custom imputation:
from Orange.data.pandas_compat import table_from_frame, table_to_frame
df = table_to_frame(in_data)
# Fill numeric with median
df['AcresBurned'] = df['AcresBurned'].fillna(df['AcresBurned'].median())
# Fill categorical with 'Unknown'
df['Counties'] = df['Counties'].fillna('Unknown')
out_data = table_from_frame(df)
Step 4: Remove Duplicates
Use a Python Script widget to remove duplicate rows:
from Orange.data.pandas_compat import table_from_frame, table_to_frame
df = table_to_frame(in_data)
original_count = len(df)
df = df.drop_duplicates()
new_count = len(df)
print(f"Removed {original_count - new_count} duplicates")
out_data = table_from_frame(df)
Step 5: Handle Imbalanced Classes
The MajorIncident target is imbalanced (~66% False, ~34% True). Use undersampling:
import pandas as pd
from Orange.data.pandas_compat import table_from_frame, table_to_frame
df = table_to_frame(in_data)
# Separate by class
major = df[df['MajorIncident'] == True]
non_major = df[df['MajorIncident'] == False]
# Undersample majority class
non_major_sampled = non_major.sample(n=len(major), random_state=42)
# Combine
balanced_df = pd.concat([major, non_major_sampled])
print(f"Balanced dataset: {len(balanced_df)} rows")
out_data = table_from_frame(balanced_df)
Step 6: Feature Engineering - Date Features
Extract useful features from the Started date column:
import pandas as pd
from Orange.data.pandas_compat import table_from_frame, table_to_frame
df = table_to_frame(in_data)
# Convert to datetime
df['Started'] = pd.to_datetime(df['Started'], errors='coerce')
# Extract features
df['Month'] = df['Started'].dt.month
df['DayOfWeek'] = df['Started'].dt.dayofweek
df['Season'] = df['Month'].map({
12:1, 1:1, 2:1, # Winter
3:2, 4:2, 5:2, # Spring
6:3, 7:3, 8:3, # Summer
9:4, 10:4, 11:4 # Fall
})
# Drop original date (prevents data leakage)
df = df.drop(columns=['Started', 'Extinguished'], errors='ignore')
out_data = table_from_frame(df)
Step 7: Build a Classification Model
Now build and evaluate a Random Forest model:
- Add a Select Columns widget to set
MajorIncidentas target - Add a Random Forest widget (or other classifier)
- Add a Test and Score widget to evaluate
- Connect: Data → Select Columns → Test and Score ← Random Forest
Expected Output: You should see accuracy metrics. Aim for >70% accuracy with balanced data.
Expected Output
After completing this practical work, you should have:
- A cleaned dataset with no missing values or duplicates
- A balanced dataset with equal class distribution
- New date-based features (Month, DayOfWeek, Season)
- A trained Random Forest model with evaluation metrics
Deliverables
- Orange Workflow: Save your Orange workflow (.ows file)
- Cleaned Dataset: Export the processed data as CSV
- Model Evaluation: Screenshot of Test and Score results
- Report: Brief summary of your findings (1-2 paragraphs)
Bonus Challenges
- Challenge 1: Merge with weather data (weather-sf.csv) to add climate features
- Challenge 2: Try different classifiers (SVM, Neural Network) and compare results
- Challenge 3: Use feature importance to identify the most predictive features
- Challenge 4: Create visualizations showing fire patterns by season or location