By Example
Session 2
Hands-on data exploration and feature engineering using California fire incidents
2026 WayUp
Learning by doing with real-world datasets
A main exploration scenario is prepared, but you can propose your own ideas and experiments along the way
California Fire Incidents
Real data about fire incidents in California, including location, resources deployed, and outcomes
Understanding the key features
MajorIncident | Target variable (boolean) |
Latitude | GPS coordinate |
Longitude | GPS coordinate |
Counties | Geographic region |
AcresBurned | Fire size metric |
Started | Fire start date/time |
Extinguished | Fire end date/time |
PersonnelInvolved | Count of firefighters |
Engines | Number deployed |
Helicopters | Number deployed |
StructuresDestroyed | Damage count |
Understanding composition and quality
How many rows? How many columns? What data types?
What questions could this dataset answer?
Are there missing values? Duplicates? Outliers?
Which features relate to potential target variables?
Defining the machine learning task
The target is defined by the MajorIncident feature (boolean)
Now that you know the target, revisit the dataset with this specific question in mind
Use scatterplots, distributions, and geomaps to understand patterns
Remove duplicates and obviously irrelevant features
Using pandas in Orange's Python Script widget
Orange doesn't have a built-in deduplicate widget, so we use a Python Script with pandas
from Orange.data.pandas_compat import table_from_frame, table_to_frame
# Convert Orange table to pandas DataFrame
df = table_to_frame(in_data)
# Remove duplicate rows
df = df.drop_duplicates()
# Convert back to Orange table
out_data = table_from_frame(df)
Assessing class distribution
66%
Non-major incidents (False)
34%
Major incidents (True)
This is imbalanced, but not critically so
Ratio: approximately 2:1 (non-major to major)
A model that always predicts "False" would achieve 66% accuracy
This baseline affects some algorithms' learning
Strategies to balance your dataset
Duplicate minority class samples to match majority class size
Pro: No data loss
Con: Risk of overfitting
Randomly remove majority class samples to match minority class
Pro: Fast, balanced dataset
Con: Data loss
Give higher penalty to misclassifying minority class
Pro: No data modification
Con: Not all algorithms support it
Find additional minority class examples
Pro: Best solution
Con: Often not feasible
Python Script widget in Orange
Use Orange's Python Script widget with this code to balance the classes
import pandas as pd
from Orange.data.pandas_compat import table_from_frame, table_to_frame
# Convert to pandas
df = table_to_frame(in_data)
# Separate by class
major = df[df['MajorIncident'] == True]
non_major = df[df['MajorIncident'] == False]
# Undersample majority class to match minority
non_major_sampled = non_major.sample(n=len(major), random_state=42)
# Combine both classes
balanced_df = pd.concat([major, non_major_sampled])
# Shuffle the dataset
balanced_df = balanced_df.sample(frac=1, random_state=42).reset_index(drop=True)
# Convert back to Orange table
out_data = table_from_frame(balanced_df)
Testing with Random Forest
Look at accuracy, precision, recall, and F1-score
Which features contribute most to predictions?
Can we remove irrelevant or redundant features?
Weather? Terrain? Vegetation data?
Three main strategies when you cannot recover data
Drop rows or columns with missing values
Use when: Small % missing, plenty of data remains
Fill with statistical values (mean, median, mode) or use regression models
Use when: Missing values are random, not systematic
Add a boolean column indicating "value was missing"
Use when: Missingness itself is informative
Use the Impute widget for GUI-based imputation, or Python Script for custom logic
# Python Script alternative for custom imputation
from Orange.data.pandas_compat import table_from_frame, table_to_frame
df = table_to_frame(in_data)
# Fill numeric columns with median
df['AcresBurned'] = df['AcresBurned'].fillna(df['AcresBurned'].median())
# Fill categorical with mode or 'Unknown'
df['Counties'] = df['Counties'].fillna('Unknown')
out_data = table_from_frame(df)
Extracting useful temporal features
Convert timestamps into meaningful categorical or cyclical features
import pandas as pd
from Orange.data.pandas_compat import table_from_frame, table_to_frame
df = table_to_frame(in_data)
# Convert to datetime
df['Started'] = pd.to_datetime(df['Started'])
# Extract useful features
df['Month'] = df['Started'].dt.month
df['DayOfWeek'] = df['Started'].dt.dayofweek
df['Hour'] = df['Started'].dt.hour
# Create season feature (1=Winter, 2=Spring, 3=Summer, 4=Fall)
df['Season'] = df['Month'].map({
12:1, 1:1, 2:1, # Winter
3:2, 4:2, 5:2, # Spring
6:3, 7:3, 8:3, # Summer
9:4, 10:4, 11:4 # Fall
})
# Drop original date column
df = df.drop(columns=['Started'])
out_data = table_from_frame(df)
Enriching with weather information
Download and merge: Daily weather in California (1998-2020)
Use the Merge Data widget to join datasets
Complete preparation pipeline
Compare metrics before and after each enhancement
Which features are most predictive?
What worked? What didn't? Why?
1
Understand your target first. Re-explore the data with your specific prediction question in mind.
2
Clean systematically. Duplicates, missing values, and outliers all affect model quality.
3
Feature engineering is critical. Extract useful information from dates, text, and complex fields.
4
External data adds value. Weather, terrain, and contextual data dramatically improve predictions.