Practical Work: Data Exploration

Objectives

By the end of this practical work, you will be able to:

Load and explore a real-world dataset in Orange
Identify and handle missing values
Deal with imbalanced classes using resampling techniques
Extract useful features from date/time columns
Build and evaluate a basic classification model

Prerequisites

Orange Data Mining installed (download here)
Basic understanding of Python (optional, for advanced exercises)
The California Fire Incidents dataset

Install required Python packages (if using Python Script widget):

pip install pandas numpy scikit-learn

Instructions

Step 1: Load the Dataset

Download and load the California Fire Incidents dataset into Orange:

Download the dataset: California_Fire_Incidents.csv
In Orange, add a File widget to the canvas
Double-click the widget and select the downloaded CSV file
Connect a Data Table widget to view the data

Note: Take a moment to explore the dataset. How many rows and columns does it have? What types of features are present?

Step 2: Explore the Data

Add visualization widgets to understand the data:

Connect a Distributions widget to see feature distributions
Connect a Scatter Plot widget to explore relationships
If you have geo add-on installed, use Geo Map to visualize fire locations

Questions to answer:

What is the distribution of MajorIncident (True vs False)?
Which counties have the most fires?
Is there a correlation between AcresBurned and PersonnelInvolved?

Step 3: Handle Missing Values

Use the Impute widget to handle missing values:

Connect an Impute widget after your File widget
Choose imputation strategy: Average/Most frequent for numeric/categorical
Compare the data before and after imputation

Alternatively, use the Python Script widget for custom imputation:

from Orange.data.pandas_compat import table_from_frame, table_to_frame

df = table_to_frame(in_data)
# Fill numeric with median
df['AcresBurned'] = df['AcresBurned'].fillna(df['AcresBurned'].median())
# Fill categorical with 'Unknown'
df['Counties'] = df['Counties'].fillna('Unknown')
out_data = table_from_frame(df)

Step 4: Remove Duplicates

Use a Python Script widget to remove duplicate rows:

from Orange.data.pandas_compat import table_from_frame, table_to_frame

df = table_to_frame(in_data)
original_count = len(df)
df = df.drop_duplicates()
new_count = len(df)
print(f"Removed {original_count - new_count} duplicates")
out_data = table_from_frame(df)

Step 5: Handle Imbalanced Classes

The MajorIncident target is imbalanced (~66% False, ~34% True). Use undersampling:

import pandas as pd
from Orange.data.pandas_compat import table_from_frame, table_to_frame

df = table_to_frame(in_data)
# Separate by class
major = df[df['MajorIncident'] == True]
non_major = df[df['MajorIncident'] == False]
# Undersample majority class
non_major_sampled = non_major.sample(n=len(major), random_state=42)
# Combine
balanced_df = pd.concat([major, non_major_sampled])
print(f"Balanced dataset: {len(balanced_df)} rows")
out_data = table_from_frame(balanced_df)

Step 6: Feature Engineering - Date Features

Extract useful features from the Started date column:

import pandas as pd
from Orange.data.pandas_compat import table_from_frame, table_to_frame

df = table_to_frame(in_data)
# Convert to datetime
df['Started'] = pd.to_datetime(df['Started'], errors='coerce')
# Extract features
df['Month'] = df['Started'].dt.month
df['DayOfWeek'] = df['Started'].dt.dayofweek
df['Season'] = df['Month'].map({
    12:1, 1:1, 2:1,   # Winter
    3:2, 4:2, 5:2,    # Spring
    6:3, 7:3, 8:3,    # Summer
    9:4, 10:4, 11:4   # Fall
})
# Drop original date (prevents data leakage)
df = df.drop(columns=['Started', 'Extinguished'], errors='ignore')
out_data = table_from_frame(df)

Step 7: Build a Classification Model

Now build and evaluate a Random Forest model:

Add a Select Columns widget to set MajorIncident as target
Add a Random Forest widget (or other classifier)
Add a Test and Score widget to evaluate
Connect: Data → Select Columns → Test and Score ← Random Forest

Expected Output: You should see accuracy metrics. Aim for >70% accuracy with balanced data.

Expected Output

After completing this practical work, you should have:

A cleaned dataset with no missing values or duplicates
A balanced dataset with equal class distribution
New date-based features (Month, DayOfWeek, Season)
A trained Random Forest model with evaluation metrics

Deliverables

Orange Workflow: Save your Orange workflow (.ows file)
Cleaned Dataset: Export the processed data as CSV
Model Evaluation: Screenshot of Test and Score results
Report: Brief summary of your findings (1-2 paragraphs)

Bonus Challenges

Challenge 1: Merge with weather data (weather-sf.csv) to add climate features
Challenge 2: Try different classifiers (SVM, Neural Network) and compare results
Challenge 3: Use feature importance to identify the most predictive features
Challenge 4: Create visualizations showing fire patterns by season or location

Data Exploration and Preparation