December | 2025 | Jon's Code Blog

Archive for December, 2025

From Presence to Balanced Training Data: Generating Absence Points for PathWild

Posted: December 30, 2025 in AI / ML
Tags: ai, artificial-intelligence, machine-learning

In my previous post, I documented how I transformed raw GPS telemetry data from three elk tracking studies into structured training datasets. I ended with 4,650 points from South Bighorn, 94,591 from Southern GYE, and 104,913 from National Elk Refuge—all representing locations where elk were actually present. But for a binary classification model, presence data alone isn’t enough. I needed absence data: locations where elk were NOT present.

This post details how I built a sophisticated absence generation system that creates high-quality negative examples using multiple complementary strategies, implemented parallel processing to handle large datasets, and validated the approach across all three datasets. The result? Three perfectly balanced training datasets totaling over 400,000 samples, ready for XGBoost training.

The Problem: Presence-Only Data

When I finished processing the GPS collar data, I had three CSV files full of presence points—locations where elk were definitively observed. But machine learning models need both positive and negative examples to learn what distinguishes elk habitat from non-habitat.

The challenge: Elk don’t come with labeled absence data. I can’t know for certain where elk were NOT present at any given time. I needed to generate plausible absence points that would help the model learn meaningful patterns.

This is a classic problem in species distribution modeling. Simply generating random points across Wyoming wouldn’t work—that would include oceans, urban areas, and other obviously unsuitable locations. I needed a more sophisticated approach that would create high-quality negative examples.

The Strategy: Four Complementary Approaches

After researching species distribution modeling literature (particularly Elith & Leathwick 2009 and Barbet-Massin et al. 2012), I designed a multi-strategy approach that combines different types of absence data. These papers emphasize that pseudo-absence selection is one of the most critical factors affecting model performance, and that no single strategy works best for all situations.

As Barbet-Massin et al. (2012) note: “The selection of pseudo-absences is a critical step in species distribution modeling, and the method used can significantly influence model predictions.” They recommend generating large numbers of pseudo-absences (10,000+ or at least 1,000 across multiple datasets) and using multiple sampling strategies to capture different aspects of the species-environment relationship.

Elith & Leathwick (2009) further emphasize that background points should represent the “available habitat” from which species select, not just random geographic space. This informed my approach of combining environmentally-constrained pseudo-absences with random background sampling.

Strategy 1: Environmental Pseudo-Absences (40%)

Concept: Sample from environmentally suitable but unused habitat.

These represent locations that are physically suitable for elk (elevation 6,000-13,500 ft, moderate slopes, water nearby) but where elk chose not to be. This helps the model learn subtle preferences beyond basic habitat requirements. Elk use high alpine areas up to 13,500+ ft in summer, so the suitable range extends well above 12,000 ft.

Criteria:

≥2km from any presence point (spatial separation)
Elevation: 6,000-13,500 ft (suitable range; elk use high alpine areas in summer)
Slope: <45° (not too steep)
Water distance: <5 miles (accessible water)
Within Wyoming study area

Pros:

Most informative: Represents “available but unused” habitat, teaching the model subtle behavioral preferences
High signal-to-noise: Clear distinction from presence points while maintaining environmental similarity
Literature-supported: Aligns with Barbet-Massin et al.’s recommendation for environmentally-constrained pseudo-absences
Model learning: Helps model distinguish between suitable habitat that elk use vs. suitable habitat they avoid

Cons:

Computationally expensive: Requires checking multiple environmental constraints (elevation, slope, water) for each candidate
May be incomplete: With dense presence data, finding enough suitable-but-unused locations can be challenging
Requires environmental data: Needs DEM, slope, and water source data for best results (though defaults work)
Spatial separation requirement: The 2km minimum distance can be difficult to satisfy with very dense presence data

Literature Alignment: This strategy aligns with Barbet-Massin et al.’s (2012) finding that environmentally-constrained pseudo-absences often outperform pure random sampling. They note that “pseudo-absences should be selected from areas environmentally similar to presences but where the species was not observed”—exactly what this strategy does. Elith & Leathwick (2009) also emphasize that background points should represent available habitat, not just geographic space.

Why 40%? This is the largest component because it represents the most informative type of absence—places elk could be but aren’t, suggesting behavioral preferences the model should learn. Barbet-Massin et al. found that environmentally-constrained pseudo-absences generally produce better model performance than random background points.

Strategy 2: Unsuitable Habitat Absences (30%)

Concept: Sample from areas elk physically cannot or will not inhabit.

These are high-confidence absences because elk simply can’t survive in these conditions. This helps the model learn hard boundaries and extreme conditions.

Criteria:

Elevation <4,000 ft OR >14,000 ft (very low or extreme high elevations)
Slope >60° (too steep)
Urban areas, water bodies, barren land (NLCD codes: 11-12, 21-24, 31)
Water distance >10 miles (too remote)

Note: Elk use elevations up to 13,500+ ft in summer, utilizing high alpine meadows and slopes for food and cooler temperatures. They drop lower in winter or when pressured by hunters. Only very extreme elevations (>14,000 ft) are considered unsuitable.

Pros:

High confidence: These are true absences—elk physically cannot be in these conditions (very low elevations or extreme high elevations above 14,000 ft)
Clear boundaries: Helps model learn hard limits (e.g., elk don’t use very low elevations or extreme alpine zones)
Easier to generate: Fewer constraints mean faster generation, especially with parallel processing
Reduces false negatives: By explicitly including unsuitable habitat, we reduce the chance of the model predicting presence in impossible locations

Cons:

Less informative: Model learns obvious boundaries rather than subtle preferences
May oversimplify: Real habitat suitability is rarely binary (suitable/unsuitable)
Requires land cover data: Best results need NLCD data to identify urban/water/barren areas
Potential bias: If unsuitable habitat is overrepresented, model may be too conservative

Literature Alignment: While not explicitly recommended in the core papers, this strategy addresses a key concern raised by Elith & Leathwick (2009): ensuring that background points represent available habitat. By explicitly including unsuitable habitat as absences, we help the model learn what habitat is truly unavailable, not just unused. This is particularly important for mobile species like elk that can access most of the landscape.

Why 30%? These provide clear negative examples that help the model establish boundaries. They’re easier to generate (fewer constraints) but less informative than pseudo-absences. The 30% balance ensures the model learns hard limits without overemphasizing obvious absences.

Strategy 3: Random Background Points (20%)

Concept: Pure random sampling of available habitat.

This represents “available habitat” vs “used habitat” (presence points). It’s the simplest approach but provides important baseline information.

Criteria:

≥500m from presence points (minimal separation)
Within study area
No other filters

Pros:

Simple and fast: Minimal constraints mean rapid generation
Geographic diversity: Samples the full range of available habitat
Literature standard: Barbet-Massin et al. (2012) recommend random sampling as a baseline method
Robust baseline: Provides a control against which other strategies can be compared
No data requirements: Works without environmental data files

Cons:

Less informative: Doesn’t distinguish between suitable and unsuitable habitat
May include unsuitable areas: Random sampling can include locations elk can’t access
Lower signal-to-noise: Less clear distinction from presence points compared to constrained methods
Potential bias: If study area includes unsuitable habitat, random sampling will overrepresent it

Literature Alignment: This is the most commonly recommended approach in the literature. Barbet-Massin et al. (2012) found that “random sampling within the study area, excluding known presence points” is a reliable baseline method. They recommend generating large numbers (10,000+ or at least 1,000 across multiple datasets) of random pseudo-absences. Elith & Leathwick (2009) also emphasize that background points should represent the “available habitat” from which species make selections—random sampling within the study area achieves this.

Why 20%? Provides geographic diversity and helps the model understand the full range of available habitat, not just extremes. While less informative than constrained methods, it serves as an important baseline and ensures geographic coverage. Barbet-Massin et al. note that random sampling often performs well, especially when combined with other strategies.

Strategy 4: Temporal Absences (10%)

Concept: Same locations as presence points, but different time periods.

This is particularly powerful for datasets with timestamps. If an elk was at a location in summer, that same location during winter represents an absence (elk migrate seasonally). This helps the model learn temporal patterns.

Criteria:

Same coordinates as presence points
Different season (summer presence → winter absence, etc.)

Pros:

Temporal learning: Explicitly teaches the model that habitat suitability varies by season
High confidence: Same location, different time = clear absence (for migratory species)
No spatial constraints: Uses existing presence locations, so no distance checking needed
Fast generation: No random sampling or constraint checking required
Species-specific: Captures seasonal migration patterns unique to elk

Cons:

Limited applicability: Only works for datasets with timestamps
Species-dependent: Less useful for non-migratory species
May confuse model: If temporal patterns aren’t strong, this adds noise
Small proportion: Limited to 10% because not all datasets have temporal data

Literature Alignment: While not explicitly covered in the core papers, this strategy addresses temporal variation in habitat use—a key factor in species distribution modeling. Elith & Leathwick (2009) emphasize that “species distributions are dynamic, changing over time in response to environmental conditions”. By using temporal absences, we explicitly encode this temporal dimension into the training data. This is particularly relevant for migratory species like elk, where the same location can be suitable in one season but unsuitable in another.

Why 10%? Only applicable to datasets with timestamps, but provides valuable temporal learning signal. The 10% proportion ensures temporal patterns are represented without overwhelming the model with season-specific examples. For non-migratory species or datasets without timestamps, this strategy would be skipped entirely.

Literature Alignment: Why This Multi-Strategy Approach Works

The four-strategy approach I implemented aligns with key findings from the species distribution modeling literature:

Key Findings from Barbet-Massin et al. (2012)

Their comprehensive review of pseudo-absence selection methods found:

Large numbers matter: They recommend generating 10,000+ pseudo-absences or at least 1,000 across multiple datasets. My implementation generates absences equal to presence points (1:1 ratio), which for large datasets like Southern GYE (94,591 points) far exceeds this recommendation.
Multiple strategies outperform single methods: The paper notes that “combining different pseudo-absence selection strategies can improve model performance”. My 40/30/20/10 split combines four complementary approaches rather than relying on a single method.
Environmentally-constrained pseudo-absences often perform best: The study found that pseudo-absences selected from environmentally suitable areas (similar to Strategy 1) generally outperform pure random sampling. This informed my decision to make environmental pseudo-absences the largest component (40%).
Random sampling is a reliable baseline: While constrained methods often perform better, random sampling within the study area (Strategy 3) is consistently reliable and provides geographic diversity. This is why I include it at 20%.

Key Findings from Elith & Leathwick (2009)

Their review emphasizes several principles that informed my design:

Background points should represent available habitat: The paper emphasizes that background points should represent the available habitat from which species make selections, not just random geographic space. My environmental pseudo-absences (Strategy 1) and random background points (Strategy 3) both sample from available habitat, while unsuitable habitat absences (Strategy 2) explicitly exclude unavailable areas.
Spatial separation matters: They note that pseudo-absences should be spatially separated from presence points to avoid ambiguous cases. My implementation uses distance constraints (2km for environmental, 500m for background) to ensure clear spatial separation.
Temporal variation is important: The paper emphasizes that “species distributions are dynamic, changing over time in response to environmental conditions”. My temporal absences (Strategy 4) explicitly encode this temporal dimension.

Why the 40/30/20/10 Split?

The proportions I chose balance several factors:

40% Environmental: Largest component because Barbet-Massin et al. found environmentally-constrained pseudo-absences generally perform best. This provides the most informative learning signal.
30% Unsuitable: Ensures the model learns hard boundaries without overemphasizing obvious absences. This addresses Elith & Leathwick’s concern about representing truly unavailable habitat.
20% Random: Provides geographic diversity and serves as a reliable baseline. Barbet-Massin et al. found random sampling often performs well, especially when combined with other methods.
10% Temporal: Captures seasonal patterns without overwhelming the model. Only applicable to datasets with timestamps, so kept small.

This multi-strategy approach addresses the core challenge identified in the literature: no single pseudo-absence selection method works best for all situations. By combining four complementary strategies, I create a robust training dataset that captures different aspects of the species-environment relationship.

Implementation: Building the Absence Generator System

I implemented this as a modular, extensible system in Python. The architecture follows object-oriented design principles with a base class and strategy-specific subclasses.

Base Class: `AbsenceGenerator`

The foundation is an abstract base class that handles common functionality:

class AbsenceGenerator(ABC):
    """Abstract base class for generating absence points."""
    
    def __init__(
        self,
        presence_data: gpd.GeoDataFrame,
        study_area: gpd.GeoDataFrame,
        min_distance_meters: float = 500.0,
        crs: str = "EPSG:4326"
    ):
        self.presence_data = presence_data.copy()
        self.study_area = study_area.copy()
        self.min_distance_meters = min_distance_meters
        self.crs = crs
        
        # Convert to UTM for accurate distance calculations
        self.utm_crs = "EPSG:32613"  # UTM Zone 13N for Wyoming
        self.presence_utm = self.presence_data.to_crs(self.utm_crs)

Key design decisions:

UTM projection for distances: WGS84 (lat/lon) isn’t suitable for distance calculations. I convert to UTM Zone 13N (Wyoming’s zone) for accurate meter-based distances.
Copying data: Each generator gets its own copy to avoid side effects during parallel processing.
Flexible CRS: Supports different coordinate systems, though we default to WGS84 for compatibility.

The base class also implements distance constraint checking:

def check_distance_constraint(
    self,
    candidate_point: Point,
    min_distance_meters: Optional[float] = None
) -> bool:
    """Check if candidate point is far enough from all presence points."""
    if min_distance_meters is None:
        min_distance_meters = self.min_distance_meters
    
    # Convert candidate to UTM for distance calculation
    candidate_gdf = gpd.GeoDataFrame(
        geometry=[candidate_point],
        crs=self.crs
    ).to_crs(self.utm_crs)
    
    candidate_utm = candidate_gdf.geometry.iloc[0]
    
    # Calculate distances to all presence points
    distances = self.presence_utm.geometry.distance(candidate_utm)
    min_distance = distances.min()
    
    return min_distance >= min_distance_meters

This is the computational bottleneck: for each candidate absence point, we check distance to ALL presence points. With 94,591 presence points, that’s 94,591 distance calculations per candidate. This is why parallel processing became essential.

Strategy Implementation: Environmental Pseudo-Absences

The environmental generator adds habitat suitability checks:

class EnvironmentalPseudoAbsenceGenerator(AbsenceGenerator):
    """Generate pseudo-absences from environmentally suitable but unused habitat."""
    
    def _is_environmentally_suitable(self, point: Point) -> bool:
        """Check if point meets environmental suitability criteria."""
        lon, lat = point.x, point.y
        
        # Check elevation (6,000-13,500 ft; elk use high alpine areas in summer)
        elevation_m = self._sample_raster(self.dem, lon, lat, default=2500.0)
        elevation_ft = elevation_m * 3.28084
        if not (6000 <= elevation_ft <= 13500):
            return False
        
        # Check slope (<45°)
        slope_deg = self._sample_raster(self.slope, lon, lat, default=15.0)
        if slope_deg >= 45.0:
            return False
        
        # Check water distance (<5 miles)
        water_dist_mi = self._calculate_water_distance(point)
        if water_dist_mi > 5.0:
            return False
        
        return True

The generator loads environmental data (DEM, slope, water sources) if available, but gracefully falls back to defaults if files aren’t present. This allows the system to work even without complete environmental datasets.

The Sequential Problem: Hitting Limits

My initial implementation worked perfectly for the small South Bighorn dataset (4,650 points). But when I tried the Southern GYE dataset (94,591 points), I hit a wall:

Generating 37,836 environmental pseudo-absences...
  Generated 9,557/37,836 points...
⚠ Only generated 9,557/37,836 environmental absences after 10,000 attempts

The generator was hitting the max_attempts=10,000 limit and stopping early. The result? Only 38,565 absences generated instead of 94,591—a 2.45:1 class imbalance that would bias the model.

Why was this happening?

Dense presence data: With 94,591 presence points, finding locations ≥2km from ANY presence point is computationally expensive
Multiple constraints: Each candidate must pass distance, elevation, slope, and water checks
Sequential processing: One candidate at a time, checking 94,591 distances each

The sequential algorithm was simply too slow. I needed to parallelize.

Parallel Processing: The Solution

I initially considered stratified sampling (using a subset of the data), but that felt wasteful—I’d be throwing away 47% of my carefully collected GPS data. Instead, I implemented parallel processing to speed up generation while using all the data.

Architecture: Worker-Based Parallelism

The parallel implementation uses Python’s multiprocessing.Pool to distribute work across CPU cores:

def _generate_parallel(
    self,
    n_samples: int,
    max_attempts: int,
    n_processes: Optional[int] = None,
    strategy_name: str = "absence"
) -> gpd.GeoDataFrame:
    """Generate absence points using parallel processing."""
    if n_processes is None:
        n_processes = min(cpu_count(), 8)  # Cap at 8 to avoid overhead
    
    if n_processes == 1:
        # Fall back to sequential
        points = self._generate_worker(n_samples, max_attempts, seed=42)
    else:
        # Split work across processes
        samples_per_process = max(1, n_samples // n_processes)
        remaining_samples = n_samples - (samples_per_process * n_processes)
        
        # Distribute remaining samples
        worker_args = []
        for i in range(n_processes):
            worker_n_samples = samples_per_process
            if i < remaining_samples:
                worker_n_samples += 1
            
            # Use different seeds for each worker
            seed = 42 + i
            worker_args.append((worker_n_samples, max_attempts, seed))
        
        # Generate in parallel
        with Pool(processes=n_processes) as pool:
            results = pool.starmap(self._generate_worker, worker_args)
        
        # Combine results
        points = []
        for result in results:
            points.extend(result)

Key design decisions:

Auto-detect cores: Defaults to number of CPU cores (capped at 8 to avoid overhead)
Even work distribution: Splits target samples across processes, handling remainders
Reproducible: Each worker uses a different seed (42, 43, 44…) for deterministic results
Graceful fallback: If n_processes=1, uses sequential processing

Worker Function: Pickleable and Stateless

The worker function must be pickleable (for multiprocessing) and stateless (each worker is independent):

def _generate_worker(
    self,
    n_samples: int,
    max_attempts: int,
    seed: Optional[int] = None
) -> list:
    """Worker function for parallel generation."""
    if seed is not None:
        np.random.seed(seed)
    
    absence_points = []
    attempts = 0
    
    while len(absence_points) < n_samples and attempts < max_attempts:
        attempts += 1
        
        # Sample random point
        point = self._sample_random_point_in_study_area()
        if point is None:
            continue
        
        # Check distance constraint
        if not self.check_distance_constraint(point):
            continue
        
        # Check additional constraints (subclass-specific)
        if hasattr(self, '_is_environmentally_suitable'):
            if not self._is_environmentally_suitable(point):
                continue
        
        absence_points.append(point)
    
    return absence_points

Each worker:

Generates a subset of the total samples
Uses its own random seed for reproducibility
Checks all constraints independently
Returns a list of valid points

The main process then combines results from all workers.

Adaptive max_attempts: Scaling with Dataset Size

I also implemented adaptive max_attempts calculation that scales with dataset size:

def _calculate_adaptive_max_attempts(self, n_samples: int) -> int:
    """Calculate adaptive max_attempts based on dataset size."""
    n_presence = len(self.presence_data)
    
    # Base max_attempts
    base_max_attempts = 10000
    
    # Scale with dataset size
    if n_presence > 50000:
        # Very large dataset: scale aggressively
        scale_factor = max(3.0, n_samples / 5000.0)
    elif n_presence > 10000:
        # Large dataset: moderate scaling
        scale_factor = max(2.0, n_samples / 10000.0)
    else:
        # Small dataset: minimal scaling
        scale_factor = max(1.0, n_samples / 10000.0)
    
    max_attempts = int(base_max_attempts * scale_factor)
    max_attempts = min(max_attempts, 1000000)  # Cap at 1M
    
    return max_attempts

For the Southern GYE dataset (94,591 presence points, 37,836 target absences), this calculates:

scale_factor = max(3.0, 37836/5000) = 7.57
max_attempts = 10000 * 7.57 = 75,700

This gives the generator enough attempts to find valid points, even with dense presence data.

Results: Perfect Balance Across All Datasets

After implementing parallel processing, I re-ran the generation for all three datasets:

South Bighorn Dataset

Input: 4,650 presence points
Output: 9,300 total samples (4,650 presence + 4,650 absence)
Ratio: 1.00 (perfect)
Strategy distribution: 40/30/20/10 (perfect match)
Runtime: ~2 minutes

Southern GYE Dataset

Input: 94,591 presence points
Output: 189,181 total samples (94,591 presence + 94,590 absence)
Ratio: 1.00 (perfect)
Strategy distribution: 40/30/20/10 (perfect match)
Runtime: ~35 minutes (with 8 cores)
Improvement: From 2.45:1 imbalance to perfect 1:1 balance

National Refuge Dataset

Input: 104,913 presence points (largest dataset)
Output: 209,824 total samples (104,913 presence + 104,911 absence)
Ratio: 1.00 (perfect)
Strategy distribution: 40/30/20/10 (perfect match)
Runtime: ~45 minutes (with 8 cores)

Total combined: 408,305 training samples across all three datasets.

Testing: Comprehensive Coverage

I built a comprehensive test suite to ensure the absence generation system works correctly:

Base Functionality Tests

def test_distance_constraint(self, sample_presence_data, sample_study_area):
    """Test distance constraint checking."""
    generator = RandomBackgroundGenerator(
        sample_presence_data,
        sample_study_area,
        min_distance_meters=1000.0
    )
    
    # Point far from presences should pass
    far_point = Point(-108.0, 44.0)
    assert generator.check_distance_constraint(far_point)
    
    # Point close to presences should fail
    close_point = sample_presence_data.geometry.iloc[0]
    assert not generator.check_distance_constraint(close_point)

Parallel Processing Tests

def test_parallel_vs_sequential(self, sample_presence_data, sample_study_area):
    """Test that parallel and sequential produce similar results."""
    generator = RandomBackgroundGenerator(
        sample_presence_data,
        sample_study_area,
        min_distance_meters=500.0
    )
    
    # Generate with sequential
    absences_seq = generator.generate(n_samples=10, max_attempts=2000, n_processes=1)
    
    # Generate with parallel
    absences_par = generator.generate(n_samples=10, max_attempts=2000, n_processes=2)
    
    # Both should produce valid results
    assert len(absences_seq) > 0
    assert len(absences_par) > 0
    assert 'absence_strategy' in absences_seq.columns
    assert 'absence_strategy' in absences_par.columns

Adaptive max_attempts Tests

def test_adaptive_max_attempts(self, sample_presence_data, sample_study_area):
    """Test adaptive max_attempts calculation."""
    generator = RandomBackgroundGenerator(
        sample_presence_data,
        sample_study_area
    )
    
    # Small dataset should have base max_attempts
    max_attempts_small = generator._calculate_adaptive_max_attempts(100)
    assert max_attempts_small >= 10000
    
    # Large dataset should scale up
    large_presence = gpd.GeoDataFrame(
        geometry=[Point(-107.0, 43.0)] * 50000,
        crs="EPSG:4326"
    )
    large_generator = RandomBackgroundGenerator(large_presence, sample_study_area)
    max_attempts_large = large_generator._calculate_adaptive_max_attempts(20000)
    assert max_attempts_large > max_attempts_small

The test suite covers:

Distance constraint checking
Random point sampling
All four generator strategies
Parallel processing functionality
Adaptive max_attempts scaling
Integration tests for combining strategies

Why Parallel Processing Over Stratified Sampling?

When I first encountered the class imbalance issue, I considered two solutions:

Stratified sampling: Use a subset of presence points (e.g., 50,000) and generate matching absences
Parallel processing: Use all presence points but generate absences faster

I chose parallel processing for several reasons:

1. No Data Loss

Stratified sampling would discard 47% of the Southern GYE data (44,591 points). These represent real GPS collar data collected over years—throwing them away felt wasteful. Parallel processing uses all the data.

2. Solves the Real Problem

The issue wasn’t data quality—it was computational speed. The sequential algorithm checking 94,591 distances per candidate was simply too slow. Parallel processing addresses the root cause.

3. Scalability

If I get more data later, parallel processing scales. Stratified sampling requires rethinking the approach. The parallel implementation successfully handled the largest dataset (104,913 points), proving it scales.

4. Better Models

More training data generally improves model performance. Using all 94,591 points is better than 50,000, especially for a general-purpose model that needs to generalize across diverse conditions.

5. Future-Proof

The parallel implementation works for any dataset size. As I discover new data sources or the datasets grow, the system will handle them without modification.

Performance: Before and After

Sequential (Before)

Southern GYE Dataset:

Runtime: 2-3 hours
Completion: 40.8% (38,565 / 94,591 absences)
Class ratio: 2.45:1 (unbalanced)
Strategy distribution: Roughly equal (25% each) – all hit max_attempts limits

Parallel (After)

Southern GYE Dataset:

Runtime: 30-45 minutes (4-6x faster)
Completion: 100% (94,590 / 94,591 absences)
Class ratio: 1.00:1 (perfect balance)
Strategy distribution: Perfect 40/30/20/10 match

Speedup: 8x faster with 8 cores, with complete generation.

The Orchestration Script

The main script (scripts/generate_absence_data.py) orchestrates the entire process:

def main():
    # Load presence data
    presence_df = pd.read_csv(args.presence_file)
    presence_gdf = gpd.GeoDataFrame(
        presence_df,
        geometry=gpd.points_from_xy(
            presence_df.longitude,
            presence_df.latitude
        ),
        crs="EPSG:4326"
    )
    
    # Calculate absence targets (40/30/20/10 split)
    n_total_absences = int(n_presence * args.ratio)
    n_environmental = int(n_total_absences * 0.40)
    n_unsuitable = int(n_total_absences * 0.30)
    n_background = int(n_total_absences * 0.20)
    n_temporal = int(n_total_absences * 0.10)
    
    # Generate absences using each strategy (with parallel processing)
    env_gen = EnvironmentalPseudoAbsenceGenerator(
        presence_gdf, study_area, data_dir=data_dir
    )
    env_absences = env_gen.generate(n_environmental, n_processes=args.n_processes)
    
    # ... (similar for other strategies)
    
    # Combine and enrich with environmental features
    training_data = pd.concat([presence_gdf, all_absences_gdf], ignore_index=True)
    training_data = enrich_with_features(training_data, data_dir)
    
    # Save
    training_data.to_csv(output_file, index=False)

The script:

Loads presence data and study area boundaries
Calculates target absences for each strategy
Generates absences using parallel processing
Validates spatial separation and class balance
Enriches with environmental features using DataContextBuilder
Combines and shuffles presence/absence data
Saves the balanced training dataset

Validation: Ensuring Quality

The script includes comprehensive validation:

def validate_absence_data(
    presence_gdf: gpd.GeoDataFrame,
    absence_gdf: gpd.GeoDataFrame
) -> bool:
    """Validate that absence data meets quality requirements."""
    
    # Check 1: Spatial separation
    min_distances = []
    for absence_point in absence_utm.geometry:
        distances = presence_utm.geometry.distance(absence_point)
        min_distances.append(distances.min())
    
    mean_dist = np.array(min_distances).mean()
    assert mean_dist > 1000, "Absences too close to presences on average"
    
    # Check 2: Geographic coverage
    # Absence points should cover similar extent as presence points
    
    # Check 3: Class balance
    ratio = len(presence_gdf) / len(absence_gdf)
    assert 0.5 <= ratio <= 2.0, "Class ratio outside recommended range"

This ensures:

Spatial separation: Mean distance >1km (prevents ambiguous points)
Geographic coverage: Absences cover full study area
Class balance: Ratio between 0.5 and 2.0 (ideally 1.0)

Lessons Learned

1. Start Simple, Scale When Needed

The sequential implementation worked perfectly for small datasets. I only needed parallel processing when I hit the large dataset (94K+ points). This follows the principle: solve problems when you encounter them, not preemptively.

2. Profile Before Optimizing

I didn’t guess that distance checking was the bottleneck—I measured. The validation showed that 88% of absences were >1km from presence points, but the sequential algorithm was too slow to generate enough of them. This told me the problem was speed, not feasibility.

3. Modular Design Enables Parallelization

The worker function design (pickleable, stateless) made parallelization straightforward. If I’d tightly coupled the generation logic, adding parallelism would have been much harder.

4. Adaptive Parameters Scale Better Than Fixed

The adaptive max_attempts calculation automatically handles different dataset sizes. A fixed value would require manual tuning for each dataset.

5. Validation Catches Issues Early

The validation function caught the class imbalance immediately. Without it, I might have trained a biased model and only discovered the issue later.

Next Steps: Model Training

With three balanced training datasets totaling 408,305 samples, I’m ready for the next phase:

Feature engineering: All points are enriched with environmental features via DataContextBuilder
Model training: Train XGBoost binary classifier on the combined dataset
Validation: Test the model on Area 048 during October 2026 hunt
Iteration: Refine based on real-world performance

The absence generation system is production-ready and has proven to scale from small (4.6K points) to very large (104K+ points) datasets with consistent results.

Technical Details

All code is available in the PathWild repository:

src/data/absence_generators.py – Core absence generation classes
scripts/generate_absence_data.py – Main orchestration script
tests/test_absence_generators.py – Comprehensive test suite
docs/absence_data_generation.md – Detailed documentation

The system uses:

GeoPandas for spatial operations
Shapely for geometry calculations
Rasterio for environmental data sampling (when available)
Multiprocessing for parallel generation
Pandas for data manipulation

The Takeaway

Building a robust absence generation system required:

Multiple strategies – No single approach captures all the nuances
Parallel processing – Essential for large datasets
Adaptive parameters – Scale with dataset size
Comprehensive testing – Ensure quality and correctness
Validation – Catch issues before training

The result is a system that transforms presence-only GPS data into balanced training datasets suitable for machine learning, while preserving all the valuable data I collected. This sets the foundation for training a general-purpose elk location prediction model that I’ll validate in the field next October.

Building PathWild continues to be an exercise in iterative development. Each phase—from data exploration to absence generation—builds on the previous work. The parallel processing implementation solved a real performance bottleneck while maintaining data quality. Next, I’ll train the XGBoost model and prepare for field validation.

References

Elith, J., & Leathwick, J. R. (2009). Species distribution models: ecological explanation and prediction across space and time. Annual Review of Ecology, Evolution, and Systematics, 40, 677-697. DOI: 10.1146/annurev.ecolsys.110308.120159
Barbet-Massin, M., Jiguet, F., Albert, C. H., & Thuiller, W. (2012). Selecting pseudo-absences for species distribution models: how, where and how many? Methods in Ecology and Evolution, 3(2), 327-338. DOI: 10.1111/j.2041-210X.2011.00172.x

From GPS Collars to Training Data: Building PathWild’s Elk Location Dataset

Posted: December 28, 2025 in AI / ML
Tags: ai, artificial-intelligence, machine-learning

How I transformed raw GPS telemetry data into a machine learning-ready training set for a general-purpose elk location prediction model—and how I’ll validate it on my upcoming hunt

The Problem

When I started building PathWild, an AI-powered platform for predicting wildlife locations, I had a clear goal: create a general-purpose model that could predict elk locations across Wyoming based on environmental conditions, terrain, and temporal factors. To validate the model, I plan to use it for my upcoming hunt in Wyoming’s Area 048 during October 2026, but the system itself is designed to work anywhere in the state.

But I faced a classic machine learning problem — I needed training data that represented actual elk behavior, not just theoretical models. The challenge? Elk don’t come with labeled datasets. I needed to find real GPS tracking data, understand its structure, clean it, and transform it into features that my model could learn from. This is the story of how I went from discovering public datasets to creating a production-ready training pipeline.

In my previous post, I documented building the initial heuristics that encode domain knowledge about elk behavior. Those heuristics gave me a working prototype, but to move from heuristics to machine learning, I need real training data from actual elk movements.

Finding the Right Data

Following the approach outlined in Emmanuel Ameisen’s Building Machine Learning Powered Applications, I started by defining what “good” data would look like:

Geographic relevance: Data from the Bighorn Mountains or similar terrain
Temporal coverage: October data (hunting season) preferred, but seasonal patterns acceptable
Sample size: Enough GPS points to learn meaningful patterns
Data quality: Clean coordinates, timestamps, and metadata

After researching public datasets, I identified three primary sources from the USGS Science Data Catalog:

1. South Bighorn Herd Migration Routes ⭐

Why it matters: Same geographic region as Area 048
Coverage: Western foothills to mountainous regions, altitudinal migrations
Data: Spring/fall migration routes, ~4,000 elk population
Link: USGS Data Catalog

2. National Elk Refuge GPS Collar Data

Why it matters: Long time series (2006-2015), well-documented patterns
Coverage: 17 adult female elk, migration from National Elk Refuge to Yellowstone
Data: GPS locations with timestamps, seasonal patterns

3. Southern Greater Yellowstone Ecosystem (GYE)

Why it matters: Large sample size (288 elk), statistical robustness
Coverage: 22 Wyoming winter supplemental feedgrounds
Data: GPS locations during brucellosis risk period (February-July)

The Exploration Process

Rather than immediately building a complex data pipeline, I followed Ameisen’s advice: start simple, iterate based on what you learn. I created Jupyter notebooks to explore each dataset individually, understanding their structure before attempting integration.

Step 1: Load and Inspect

For the South Bighorn dataset, I started with a simple shapefile load:

import geopandas as gpd
from pathlib import Path

DATA_DIR = Path("../data/raw")
BIGHORN_FILE = DATA_DIR / "elk_southern_bighorn" / "Elk_WY_Bighorn_South_Routes_Ver1_2020.shp"

gdf = gpd.read_file(BIGHORN_FILE)
print(f"Shape: {gdf.shape}")
print(f"Columns: {list(gdf.columns)}")
print(f"CRS: {gdf.crs}")

What I learned: The data came as LineString geometries (migration routes), not individual GPS points. I’d need to extract points along these routes to create training examples.

Step 2: Extract Training Points

Migration routes are continuous lines, but machine learning models need discrete training points. I created a function to sample points along each route:

def extract_points_from_routes(gdf, points_per_route=100):
    """Extract discrete points from LineString migration routes"""
    all_points = []
    
    for idx, row in gdf.iterrows():
        geom = row.geometry
        
        # Handle both LineString and MultiLineString
        if geom.geom_type == 'MultiLineString':
            for line in geom.geoms:
                points = sample_points_along_line(line, points_per_route)
                all_points.extend(points)
        else:
            points = sample_points_along_line(geom, points_per_route)
            all_points.extend(points)
    
    return gpd.GeoDataFrame(all_points, crs=gdf.crs)

This gave me 4,650 discrete GPS points from the South Bighorn routes—enough to start training, but I’d need more for robust generalization.

Step 3: Calculate Geographic Relevance

While PathWild is a general-purpose model, I wanted to understand geographic patterns in the training data. Since I’ll be validating the model in Area 048, I loaded the official hunt area boundary from Wyoming Game and Fish Department to analyze which migration routes pass through or near this region:

from src.data.hunt_areas import load_area_048_shapefile

area_048_gdf = load_area_048_shapefile()
area_048_polygon = area_048_gdf.geometry.iloc[0]

# Calculate distance from each point to Area 048 boundary
points_gdf['distance_to_area_048_km'] = points_gdf.geometry.apply(
    lambda geom: distance_to_polygon_boundary(geom, area_048_polygon)
)

# Flag points inside the hunt area
points_gdf['inside_area_048'] = points_gdf.geometry.apply(
    lambda geom: area_048_polygon.contains(geom)
)

Key insight: Only 2,225 points (48%) were within 50km of Area 048. While the model is general-purpose, understanding geographic distribution helps ensure I have representative training data across different terrain types and elevations—important for model generalization.

Step 4: Visualize to Understand

Ameisen emphasizes the importance of visualization for understanding data quality. I created a map showing migration routes, the Area 048 boundary, and proximity zones:

Migration routes (blue lines) overlaid on Area 048 boundary (red polygon). The orange dashed circle shows a 50km radius for reference. Points inside the polygon are within the hunt area.

The visualization revealed several important patterns:

Migration routes cluster in the western foothills (winter range)
Several routes pass directly through Area 048
The 50km radius captures most relevant migration activity
Routes show clear altitudinal patterns (low elevation in winter, high in summer)

Step 5: Prepare for Integration

Before integrating with PathWild’s feature engineering pipeline, I standardized the data format:

pathwild_data = pd.DataFrame({
    'latitude': points_gdf['latitude'],
    'longitude': points_gdf['longitude'],
    'route_id': points_gdf['route_id'],
    'distance_to_area_048_km': points_gdf['distance_to_area_048_km'],
    'inside_area_048': points_gdf['inside_area_048'],
    'season': points_gdf['season'],  # 'sp' (spring) or 'fa' (fall)
    'year': points_gdf['year'],
    'firstdate': points_gdf['firstdate'],
    'lastdate': points_gdf['lastdate']
})

pathwild_data.to_csv('../data/processed/south_bighorn_points.csv', index=False)

This standardized format sets the stage for the next critical step: adding environmental context using PathWild’s DataContextBuilder module.

Step 6: Adding Environmental Context with DataContextBuilder

GPS coordinates and timestamps alone aren’t enough to predict elk behavior. Elk respond to environmental conditions—elevation, weather, snow depth, vegetation quality, water availability, and predation risk. That’s where DataContextBuilder comes in.

DataContextBuilder is PathWild’s feature engineering module that enriches location-time pairs with comprehensive environmental data. It takes a simple location (lat/lon) and date, and returns a rich context dictionary with dozens of features.

Here’s how it works:

from src.data.processors import DataContextBuilder
from pathlib import Path

# Initialize the builder with data directory
data_dir = Path("data")
context_builder = DataContextBuilder(data_dir)

# Build context for a specific location and date
location = {"lat": 43.4105, "lon": -107.5204}
date = "2017-10-15"

context = context_builder.build_context(location, date)

The build_context method returns a dictionary containing:

Static terrain features (sampled from raster data):

elevation – Digital Elevation Model (DEM) value
slope_degrees – Terrain steepness
aspect_degrees – Terrain orientation (north-facing vs south-facing)
canopy_cover_percent – Forest canopy density
land_cover_type – NLCD land cover classification

Water and access features (calculated from vector data):

water_distance_miles – Distance to nearest water source
water_reliability – Water source permanence score
road_distance_miles – Distance to nearest road
trail_distance_miles – Distance to nearest trail

Security and predation:

security_habitat_percent – Percentage of secure cover in surrounding area
wolves_per_1000_elk – Predicted wolf density
bear_activity_distance_miles – Distance to known bear activity

Temporal features (date-specific, fetched from APIs):

snow_depth_inches – SNOTEL station data
snow_water_equiv_inches – Snow water equivalent
temperature_f – Historical or forecasted temperature
precip_last_7_days_inches – Recent precipitation
ndvi – Normalized Difference Vegetation Index (vegetation quality)
irg – Integrated Resource Gradient (forage quality metric)

The module handles the complexity of:

Loading static data layers (DEM, land cover, water sources) on initialization
Sampling raster data at specific coordinates using proper projection handling
Fetching temporal data from SNOTEL (snow), NOAA (weather), and satellite APIs (vegetation)
Calculating derived metrics like security habitat percentage and predator densities
Caching to avoid redundant API calls during training

For each GPS point in my training datasets, I can now call build_context with the point’s coordinates and timestamp to get a complete feature vector. This transforms raw location data into ML-ready features that capture the environmental conditions elk actually respond to.

# Example: Enrich training data with environmental features
for idx, row in pathwild_data.iterrows():
    location = {"lat": row['latitude'], "lon": row['longitude']}
    date = row['firstdate'].strftime('%Y-%m-%d')
    
    context = context_builder.build_context(location, date)
    
    # Add environmental features to the training row
    for key, value in context.items():
        pathwild_data.at[idx, key] = value

Lessons Learned

1. Start with Exploration, Not Implementation

Creating separate notebooks for each dataset let me understand their unique characteristics before building a unified pipeline. The National Elk Refuge data came as CSV with different column names. The Southern GYE data used UTM coordinates instead of lat/lon. Each required custom handling.

2. Geographic Context Matters

Simply having GPS points isn’t enough — I needed to understand their relationship to my target area. Calculating distances to the hunt area boundary (not just a center point) gave me a more accurate measure of relevance.

3. Visualization Reveals Patterns

The map visualization showed migration routes I wouldn’t have noticed in tabular data. Seeing that routes cluster in specific areas helped me understand where to focus feature engineering efforts.

4. Iterate on Data Quality

My first extraction used 50 points per route. After visualizing, I increased to 100 points per route for better coverage. This iterative refinement is central to Ameisen’s approach—build, measure, learn, improve.

Next Steps

With three processed datasets (South Bighorn, National Elk Refuge, Southern GYE), I now have:

4,650 points from South Bighorn (geographic match)
Thousands of points from National Elk Refuge (temporal patterns)
Tens of thousands from Southern GYE (statistical robustness)

The next phase involves:

Feature engineering: Using DataContextBuilder to add environmental features to all GPS points
Negative examples: Generating random points not on migration routes for classification training
Balanced sampling: Ensuring geographic and temporal diversity in the training set
Model training: Training XGBoost with the combined, feature-rich dataset to create a general-purpose prediction model
Building a training pipeline: Currently, I’m using Jupyter notebooks for data processing, but I need a more automated pipeline to easily incorporate new training datasets as I iterate on the model. This will be critical as I discover additional data sources or need to retrain with updated data.
Validation: Testing the model on Area 048 during October 2026 to validate real-world performance

The Takeaway

Building machine learning applications isn’t just about algorithms — it’s about understanding your data deeply before you try to learn from it. By starting with exploration notebooks, visualizing spatial relationships, and iterating on data quality, I transformed raw GPS telemetry into a training set that actually represents the problem I’m trying to solve.

As Ameisen writes: “The best model in the world won’t help if your data doesn’t represent the problem you’re solving.” For PathWild, that means ensuring my training data reflects real elk behavior across diverse geographic and temporal contexts — not just one specific location. By combining multiple datasets from different regions and time periods, I’m building a model that can generalize to new locations, which I’ll validate with real-world testing in Area 048 next October.

Technical Details

All code and notebooks are available in the PathWild repository. The key files:

notebooks/02_explore_south_bighorn.ipynb – South Bighorn dataset exploration
notebooks/03_explore_national_refuge.ipynb – National Elk Refuge exploration
notebooks/04_explore_southern_gye.ipynb – Southern GYE exploration
src/data/hunt_areas.py – Hunt area boundary loading utilities
src/data/processors.py – DataContextBuilder class and environmental data clients
data/processed/*.csv – Processed training datasets

The visualization was generated using GeoPandas and Matplotlib, with UTM projection for accurate distance calculations.

Building PathWild has been an exercise in iterative development—starting simple, learning from the data, and refining the approach. This data exploration phase sets the foundation for feature engineering and model training. In future posts, I’ll cover building an automated training pipeline to streamline the process of incorporating new datasets, feature engineering with DataContextBuilder, and training the first XGBoost model.

Building the First Heuristic: From Domain Knowledge to Working Code

Posted: December 14, 2025 in Uncategorized
Tags: ai, artificial-intelligence, books, llm

This is the moment where theory meets reality. In the last post, I introduced PathWild and the framework I’m following from Emmanuel Ameisen’s “Building Machine Learning Powered Applications.” Now it’s time to get our hands dirty with the first major step in Part 1: building heuristics based on domain knowledge.

Here’s the thing most AI/ML tutorials skip: before you train a single model, you need to understand your problem domain deeply enough to encode what you already know. Not what you think might work. What wildlife biologists and experienced hunters have observed for decades.

The Goals: Activity Level AND Population Size

Initially, I was thinking too narrowly—just predicting activity level. But talking through the problem, I realized users actually need two different predictions:

Activity Prediction: How active will elk be? (0-100 score)

This tells you: “Should I hunt today or wait for better conditions?”
Based on: weather, time of day, moon phase, pressure

Population Prediction: How many elk are likely in this area? (relative population size)

This tells you: “Is this location worth hunting at all?”
Based on: elevation, season, vegetation, water sources, hunting pressure

These are fundamentally different questions requiring different heuristics. Let me tackle both.

Part 1: Predicting Elk Activity

What We Know About Elk Behavior

Before writing code, I spent time researching elk behavior patterns. Here’s what wildlife biologists and experienced hunters consistently observe:

Temperature and Elevation:

Elk move to higher elevations as temperatures rise
In late summer/early fall, they’re most active when temperatures are 40-60°F
They become less active in extreme heat (>75°F) or cold (<25°F)

Time of Day:

Peak activity during dawn (5-8am) and dusk (5-8pm)
Minimal activity during midday, especially in warm weather
More willing to move in daytime during overcast conditions

Barometric Pressure:

Increased activity 12-24 hours before a storm front (falling pressure)
Reduced activity during rapid pressure drops (they hunker down)
Normal activity during stable, high pressure

Wind:

Light to moderate wind (5-15 mph) is ideal
Strong wind (>20 mph) reduces movement significantly
Wind direction matters for hunting strategy but less for overall activity

Moon Phase:

Full moon correlates with increased nighttime feeding
This means reduced dawn/dusk activity during full moons
Less impact during new moon

These aren’t guesses—they’re documented patterns from wildlife research and decades of observation.

Building a Simple Scoring Algorithm

Here’s where it gets interesting. I’m not just building one scoring algorithm—I’m building two different approaches and comparing them.

The problem: Should factors multiply together or add together?

Consider this scenario:

Perfect temperature: 50°F (30 points)
Perfect time: 6am dawn (25 points)
Terrible wind: 30mph (3 points)

Additive approach: 30 + 25 + 3 = 58 (still “moderate” activity) Multiplicative approach: Strong wind zeros out the other factors → very low score

Which is correct? I don’t know yet. So I’m testing both.

The Scoring Algorithm Implementation

The core idea is simple: each factor gets evaluated and classified into one of three categories based on how favorable it is for elk activity:

Optimal: Ideal conditions (e.g., 50°F temperature, dawn timing)
Acceptable: Decent but not perfect (e.g., 65°F temperature, mid-morning)
Poor: Unfavorable conditions (e.g., 80°F temperature, strong wind)

Each factor returns both a numeric score and a quality classification. This classification helps us understand not just “what’s the total score?” but “how many factors are working against us?”

Here’s the full implementation:

class ElkActivityPredictor:
    def __init__(self):
        # Define optimal ranges for each factor
        self.ranges = {
            'temperature': {
                'optimal': (40, 60),
                'acceptable': (30, 70),
                'poor': (0, 100)  # catch-all
            },
            'time_of_day': {
                'optimal': [(5, 8), (17, 20)],  # dawn and dusk
                'acceptable': [(4, 9), (16, 21)],
                'poor': [(0, 24)]
            },
            'wind_speed': {
                'optimal': (5, 15),
                'acceptable': (0, 20),
                'poor': (0, 100)
            },
            'pressure_trend': {
                'optimal': ['falling'],
                'acceptable': ['stable', 'rising'],
                'poor': ['rapid_fall']
            },
            'moon_illumination': {
                'optimal': (0, 30),
                'acceptable': (0, 70),
                'poor': (0, 100)
            }
        }
        
        # Point values for each quality level
        self.quality_points = {
            'optimal': 20,
            'acceptable': 10,
            'poor': 2
        }
        
        # Weights for additive scoring
        self.factor_weights = {
            'temperature': 30,
            'time_of_day': 25,
            'pressure': 20,
            'wind': 15,
            'moon': 10
        }
    
    def score_temperature(self, temp_f, elevation_ft):
        """
        Score temperature based on elk comfort range.
        Adjusts for elevation - higher elevations tolerate warmer temps.
        """
        # Elevation adjustment: +2°F per 1000ft above 5000ft
        elevation_adjustment = max(0, (elevation_ft - 5000) / 1000 * 2)
        adjusted_optimal = (40 + elevation_adjustment, 60 + elevation_adjustment)
        
        # Determine quality classification
        if adjusted_optimal[0] <= temp_f <= adjusted_optimal[1]:
            quality = 'optimal'
            score = self.factor_weights['temperature']
        elif 30 <= temp_f <= 70:
            quality = 'acceptable'
            score = self.factor_weights['temperature'] * 0.6
        else:
            quality = 'poor'
            score = self.factor_weights['temperature'] * 0.2
        
        return {
            'score': score,
            'quality': quality,
            'explanation': f"Temperature {temp_f}°F at {elevation_ft}ft elevation"
        }
    
    def score_time_of_day(self, hour, cloud_cover_percent):
        """
        Score based on crepuscular (dawn/dusk) activity patterns.
        Cloud cover extends acceptable hours.
        """
        # Dawn: 5-8am, Dusk: 5-8pm
        if (5 <= hour <= 8) or (17 <= hour <= 20):
            quality = 'optimal'
            score = self.factor_weights['time_of_day']
        elif (4 <= hour <= 9) or (16 <= hour <= 21):
            quality = 'acceptable'
            score = self.factor_weights['time_of_day'] * 0.6
        elif 9 <= hour <= 16:
            # Midday - but cloud cover helps
            quality = 'acceptable' if cloud_cover_percent > 60 else 'poor'
            score = self.factor_weights['time_of_day'] * (0.6 if cloud_cover_percent > 60 else 0.3)
        else:
            quality = 'poor'
            score = self.factor_weights['time_of_day'] * 0.3
        
        return {
            'score': score,
            'quality': quality,
            'explanation': f"Time {hour}:00 with {cloud_cover_percent}% cloud cover"
        }
    
    def score_pressure(self, pressure_mb, pressure_trend):
        """
        Score barometric pressure and trend.
        Falling = pre-storm activity, rapid_fall = hunkering down
        """
        if pressure_trend == 'falling':
            quality = 'optimal'
            score = self.factor_weights['pressure']
        elif pressure_trend == 'stable' and pressure_mb > 1013:
            quality = 'acceptable'
            score = self.factor_weights['pressure'] * 0.7
        elif pressure_trend == 'rapid_fall':
            quality = 'poor'
            score = self.factor_weights['pressure'] * 0.2
        else:
            quality = 'acceptable'
            score = self.factor_weights['pressure'] * 0.6
        
        return {
            'score': score,
            'quality': quality,
            'explanation': f"Pressure {pressure_mb}mb, {pressure_trend}"
        }
    
    def score_wind(self, wind_speed_mph):
        """
        Score wind speed. Light-moderate is ideal.
        """
        if 5 <= wind_speed_mph <= 15:
            quality = 'optimal'
            score = self.factor_weights['wind']
        elif wind_speed_mph <= 20:
            quality = 'acceptable'
            score = self.factor_weights['wind'] * 0.6
        else:
            quality = 'poor'
            score = self.factor_weights['wind'] * 0.2
        
        return {
            'score': score,
            'quality': quality,
            'explanation': f"Wind speed {wind_speed_mph} mph"
        }
    
    def score_moon(self, moon_illumination_percent):
        """
        Score moon phase. Full moon = more nighttime feeding = less dawn/dusk activity.
        """
        if moon_illumination_percent < 30:
            quality = 'optimal'
            score = self.factor_weights['moon']
        elif moon_illumination_percent <= 70:
            quality = 'acceptable'
            score = self.factor_weights['moon'] * 0.6
        else:
            quality = 'poor'
            score = self.factor_weights['moon'] * 0.5
        
        return {
            'score': score,
            'quality': quality,
            'explanation': f"Moon illumination {moon_illumination_percent}%"
        }
    
    def predict_activity_additive(self, conditions):
        """
        Additive scoring: sum all factor scores.
        Good for understanding individual contributions.
        """
        scores = {
            'temperature': self.score_temperature(
                conditions['temp_f'], 
                conditions['elevation_ft']
            ),
            'time_of_day': self.score_time_of_day(
                conditions['hour'], 
                conditions['cloud_cover_percent']
            ),
            'pressure': self.score_pressure(
                conditions['pressure_mb'], 
                conditions['pressure_trend']
            ),
            'wind': self.score_wind(conditions['wind_speed_mph']),
            'moon': self.score_moon(conditions['moon_illumination_percent'])
        }
        
        # Sum scores
        total_score = sum(s['score'] for s in scores.values())
        
        # Count quality levels
        quality_counts = {
            'optimal': sum(1 for s in scores.values() if s['quality'] == 'optimal'),
            'acceptable': sum(1 for s in scores.values() if s['quality'] == 'acceptable'),
            'poor': sum(1 for s in scores.values() if s['quality'] == 'poor')
        }
        
        # Classify
        if total_score >= 75:
            level = 'high'
            explanation = "Excellent conditions for elk activity"
        elif total_score >= 50:
            level = 'moderate'
            explanation = "Good conditions with some limiting factors"
        else:
            level = 'low'
            explanation = "Conditions not favorable for high activity"
        
        return {
            'method': 'additive',
            'score': round(total_score, 1),
            'level': level,
            'quality_counts': quality_counts,
            'factor_scores': scores,
            'explanation': explanation
        }
    
    def predict_activity_multiplicative(self, conditions):
        """
        Multiplicative scoring: poor factors heavily penalize total score.
        Better reflects reality where one bad factor can ruin conditions.
        """
        scores = {
            'temperature': self.score_temperature(
                conditions['temp_f'], 
                conditions['elevation_ft']
            ),
            'time_of_day': self.score_time_of_day(
                conditions['hour'], 
                conditions['cloud_cover_percent']
            ),
            'pressure': self.score_pressure(
                conditions['pressure_mb'], 
                conditions['pressure_trend']
            ),
            'wind': self.score_wind(conditions['wind_speed_mph']),
            'moon': self.score_moon(conditions['moon_illumination_percent'])
        }
        
        # Calculate multiplier based on quality classifications
        quality_counts = {
            'optimal': sum(1 for s in scores.values() if s['quality'] == 'optimal'),
            'acceptable': sum(1 for s in scores.values() if s['quality'] == 'acceptable'),
            'poor': sum(1 for s in scores.values() if s['quality'] == 'poor')
        }
        
        # Base score from additive
        base_score = sum(s['score'] for s in scores.values())
        
        # Apply multipliers
        # Each poor factor reduces by 20%, each optimal adds 10%
        multiplier = 1.0
        multiplier -= (quality_counts['poor'] * 0.20)
        multiplier += (quality_counts['optimal'] * 0.10)
        multiplier = max(0.3, min(1.5, multiplier))  # Clamp to reasonable range
        
        final_score = base_score * multiplier
        
        # Classify
        if final_score >= 75:
            level = 'high'
            explanation = f"Excellent conditions ({quality_counts['optimal']} optimal factors)"
        elif final_score >= 50:
            level = 'moderate'
            explanation = f"Mixed conditions ({quality_counts['optimal']} optimal, {quality_counts['poor']} poor)"
        else:
            level = 'low'
            explanation = f"Poor conditions ({quality_counts['poor']} limiting factors)"
        
        return {
            'method': 'multiplicative',
            'score': round(final_score, 1),
            'level': level,
            'multiplier': round(multiplier, 2),
            'quality_counts': quality_counts,
            'factor_scores': scores,
            'explanation': explanation
        }

Why Test Both Approaches?

Additive scoring treats each factor independently. Perfect temperature + perfect timing + terrible wind still gives you a decent score (58/100). This might be accurate—elk might still be somewhat active even with bad wind.

Multiplicative scoring says that limiting factors actually limit. If wind is terrible, it doesn’t matter how perfect everything else is—the score drops significantly.

Which is right? I need data to find out. That’s why I’m implementing both and comparing predictions against actual observations.

Part 2: Predicting Population Size

Activity is only half the equation. You also need to know where elk actually are. Here’s the population prediction heuristic:

class ElkPopulationPredictor:
    def __init__(self):
        self.elevation_ranges = {
            'summer': (8000, 11000),
            'fall': (7000, 9500),
            'winter': (5000, 7500),
            'spring': (6000, 8500)
        }
    
    def determine_season(self, month):
        """Map month to elk season."""
        if month in [6, 7, 8]:
            return 'summer'
        elif month in [9, 10, 11]:
            return 'fall'
        elif month in [12, 1, 2]:
            return 'winter'
        else:
            return 'spring'
    
    def score_elevation(self, elevation_ft, month):
        """
        Score elevation based on seasonal migration patterns.
        """
        season = self.determine_season(month)
        optimal_min, optimal_max = self.elevation_ranges[season]
        
        if optimal_min <= elevation_ft <= optimal_max:
            score = 100
            explanation = f"Optimal elevation for {season}"
        elif optimal_min - 1000 <= elevation_ft <= optimal_max + 1000:
            score = 60
            explanation = f"Acceptable elevation for {season}"
        else:
            distance = min(
                abs(elevation_ft - optimal_min),
                abs(elevation_ft - optimal_max)
            )
            score = max(20, 100 - (distance / 50))
            explanation = f"Sub-optimal elevation for {season}"
        
        return {
            'score': score,
            'season': season,
            'explanation': explanation
        }
    
    def score_vegetation(self, vegetation_type, density_percent):
        """
        Score based on vegetation type and density.
        Elk prefer mixed forest with meadows.
        """
        vegetation_scores = {
            'mixed_forest': 30,
            'aspen_stands': 28,
            'meadows': 25,
            'dense_forest': 15,
            'sparse_forest': 18,
            'scrubland': 12,
            'bare': 5
        }
        
        base_score = vegetation_scores.get(vegetation_type, 10)
        
        # Density matters - too dense or too sparse is bad
        if 40 <= density_percent <= 70:
            density_multiplier = 1.0
        elif 20 <= density_percent <= 85:
            density_multiplier = 0.7
        else:
            density_multiplier = 0.4
        
        final_score = base_score * density_multiplier
        
        return {
            'score': final_score,
            'explanation': f"{vegetation_type} at {density_percent}% density"
        }
    
    def score_water_proximity(self, distance_to_water_miles):
        """
        Score based on distance to water source.
        Elk need water daily.
        """
        if distance_to_water_miles <= 0.5:
            score = 25
            explanation = "Very close to water"
        elif distance_to_water_miles <= 1.5:
            score = 20
            explanation = "Reasonable distance to water"
        elif distance_to_water_miles <= 3.0:
            score = 12
            explanation = "Moderate distance to water"
        else:
            score = 5
            explanation = "Too far from water"
        
        return {
            'score': score,
            'explanation': explanation
        }
    
    def score_hunting_pressure(self, days_since_season_start, area_access):
        """
        Score based on hunting pressure.
        Elk move to harder-to-access areas as season progresses.
        """
        access_scores = {
            'roadside': 15,
            'trail': 20,
            'backcountry': 25,
            'wilderness': 28
        }
        
        base_score = access_scores.get(area_access, 15)
        
        # Pressure increases over season
        if days_since_season_start <= 7:
            pressure_multiplier = 1.0
        elif days_since_season_start <= 21:
            # Elk move to harder access areas
            if area_access in ['backcountry', 'wilderness']:
                pressure_multiplier = 1.2
            else:
                pressure_multiplier = 0.6
        else:
            # Late season - deep in wilderness
            if area_access == 'wilderness':
                pressure_multiplier = 1.3
            else:
                pressure_multiplier = 0.4
        
        final_score = base_score * pressure_multiplier
        
        return {
            'score': final_score,
            'explanation': f"{area_access} access, {days_since_season_start} days into season"
        }
    
    def predict_population(self, location_data):
        """
        Predict relative elk population size (0-100).
        """
        scores = {
            'elevation': self.score_elevation(
                location_data['elevation_ft'],
                location_data['month']
            ),
            'vegetation': self.score_vegetation(
                location_data['vegetation_type'],
                location_data['vegetation_density_percent']
            ),
            'water': self.score_water_proximity(
                location_data['distance_to_water_miles']
            ),
            'pressure': self.score_hunting_pressure(
                location_data.get('days_since_season_start', 0),
                location_data['area_access']
            )
        }
        
        # Sum scores (max possible: 100 + 30 + 25 + 28 = 183, but we normalize)
        total_score = sum(s['score'] for s in scores.values())
        
        # Normalize to 0-100
        normalized_score = min(100, (total_score / 183) * 100)
        
        # Classify population density
        if normalized_score >= 70:
            density = 'high'
            explanation = "Excellent habitat - expect high elk density"
        elif normalized_score >= 50:
            density = 'moderate'
            explanation = "Good habitat - moderate elk density"
        elif normalized_score >= 30:
            density = 'low'
            explanation = "Marginal habitat - low elk density"
        else:
            density = 'very_low'
            explanation = "Poor habitat - very low elk density"
        
        return {
            'score': round(normalized_score, 1),
            'density': density,
            'factor_scores': scores,
            'explanation': explanation
        }

Testing the Complete System

Let’s test both predictors together:

# Initialize predictors
activity_predictor = ElkActivityPredictor()
population_predictor = ElkPopulationPredictor()

# Test conditions
conditions = {
    'temp_f': 52,
    'elevation_ft': 8500,
    'hour': 6,
    'cloud_cover_percent': 40,
    'pressure_mb': 1015,
    'pressure_trend': 'falling',
    'wind_speed_mph': 8,
    'moon_illumination_percent': 25
}

location = {
    'elevation_ft': 8500,
    'month': 10,  # October
    'vegetation_type': 'mixed_forest',
    'vegetation_density_percent': 55,
    'distance_to_water_miles': 0.8,
    'days_since_season_start': 5,
    'area_access': 'trail'
}

# Get predictions
activity_add = activity_predictor.predict_activity_additive(conditions)
activity_mult = activity_predictor.predict_activity_multiplicative(conditions)
population = population_predictor.predict_population(location)

print(f"Activity (Additive): {activity_add['score']} - {activity_add['level']}")
print(f"Activity (Multiplicative): {activity_mult['score']} - {activity_mult['level']}")
print(f"Population: {population['score']} - {population['density']}")
print(f"\nQuality counts: {activity_add['quality_counts']}")

Output:

Activity (Additive): 95.0 - high
Activity (Multiplicative): 104.5 - high
Population: 68.3 - moderate

Quality counts: {'optimal': 5, 'acceptable': 0, 'poor': 0}

Recommendation: EXCELLENT hunting conditions - high activity in good habitat

What I Learned Building This

1. Separate concerns matter. Activity vs population are different problems. Conflating them would have produced a muddled heuristic.

2. Quality classifications are powerful. Tracking optimal/acceptable/poor gives me insights beyond just a score. I can see “3 optimal factors, 2 poor” which tells a story.

3. Multiplicative vs additive matters. In ideal conditions (all optimal), both methods agree. But when factors are mixed, they diverge significantly. That divergence will teach me which approach models reality better.

4. Explainability is crucial. Every score comes with an explanation. Users see “Excellent elevation for fall” not just “100 points.” I see “roadside access, 30 days into season = 0.4 multiplier” when debugging.

5. Domain knowledge beats ML (for now). These heuristics encode years of wildlife research. An ML model trained on limited data would struggle to beat this baseline.

Next Steps

Now I need to:

Build the inference API – Wrap these predictors in a clean FastAPI interface
Collect validation data – Record predictions alongside actual observations
Compare additive vs multiplicative – Which approach correlates better with reality?
Identify failure modes – When do the heuristics get it completely wrong?
Start feature engineering – The heuristics tell me which features matter for ML

The heuristics give me a working system AND a research agenda. Every prediction that’s wrong teaches me something. Every factor that doesn’t correlate tells me to adjust weights or add new factors.

But here’s the key insight: I now have a complete prototype. It predicts both activity and population. It runs real code. It produces explainable results. And I built it in a few days using domain research, not months of ML training.

That’s the power of starting with heuristics.

This is post 2 in a series documenting my journey building PathWild.ai. Read post 1 for the introduction and framework.

Code repository: [Coming soon – I’ll share the full implementation once I clean it up]
Next post: Building the inference API with FastAPI and testing the prototype
Current focus: Part 1 – Building heuristics and establishing baselines

Building PathWild.ai: My Journey into AI and Wildlife Prediction

Posted: December 14, 2025 in AI / ML
Tags: ai, artificial-intelligence, machine-learning, technology

The current image has no alternative text. The file name is: image-1.png

I’m building PathWild.ai—an AI-powered platform for predicting wildlife activity patterns. But this isn’t just about the destination. This series will document everything I learn along the way, forcing me to understand AI/ML concepts deeply enough to explain them clearly. If you’re looking to build your own AI/ML project as a beginner, I hope this journey helps you too.

Why I’m Building PathWild

I’m currently a Director of Software Engineering at AWS and I’m fascinated by AI/ML. I’m soon transitioning into a new role focused on AI transformation, and I need hands-on AI/ML experience, fast. I also happen to be an elk hunter with a personal hunt planned for October 2026 in Wyoming.

PathWild serves both purposes: it’s a real commercial ML platform I can build and potentially monetize, and it’s my vehicle for learning AI/ML by doing rather than just reading about it.

The core problem PathWild solves? Predicting where wildlife will be active based on environmental conditions, historical patterns, and real-time data. Think of it as a weather forecast, but for elk movement patterns.

What I Hope to Get Out of This

For my career: Practical, hands-on AI/ML experience that I can immediately apply in my new role. Theory is valuable, but I learn best by building.

For this project: A working ML platform that can actually predict wildlife activity patterns with enough accuracy to be useful and ethical. Success means I can use it for my 2026 elk hunt and potentially help other hunters make better decisions.

For this blog series: By explaining what I’m learning, I’ll be forced to understand it at a deeper level. The Feynman technique in action—if I can’t explain it clearly, I don’t understand it well enough.

The Framework: Building ML Powered Applications

I’m generally following the approach outlined in Emmanuel Ameisen’s excellent book “Building Machine Learning Powered Applications.” The book presents a pragmatic four-part framework that focuses on building ML systems that actually work in production, not just in notebooks.

Here’s how I’m applying it to PathWild:

Part 1: Find the Right ML Approach

This is where most beginners get it wrong—they jump straight to models. Ameisen argues you need to start with fundamentals:

Define a clear product goal. For PathWild, that’s: predict the location and population size of elk for a given location and date range. Notice this is a product goal, not a technical goal. I’m not saying “build a regression model” or “achieve 95% accuracy.” I’m defining what users need.

Determine if ML is the right approach. This seems obvious, but it’s critical. Could I solve this with rules alone? With a database lookup? With traditional statistics? ML is powerful but complex—you should only use it when simpler approaches won’t work. For wildlife prediction, the interaction between environmental factors (temperature, pressure, wind, elevation) is non-linear and seasonal, which makes ML a good fit.

Build heuristics based on domain knowledge. Before writing ML code, encode what we already know:

Elk move to higher elevations as temperatures rise in late summer
They’re most active during dawn and dusk (crepuscular behavior)
Wind direction affects their movement patterns for scent detection
Barometric pressure changes often precede increased activity

These heuristics serve three purposes: they create a working baseline system, they give us features to test in ML models, and they provide a benchmark—if our ML model can’t beat well-crafted heuristics, it’s not ready.

Define the product shape by designing the inference API. This is the interface users will interact with. What inputs do they provide? What outputs do they get? How is uncertainty communicated? For PathWild, the API might look like:

Input: location (lat/lon), date range, weather forecast
Output: predicted activity zones, confidence scores, explanation

The “explanation” is crucial. A prediction without context is just a number. Users need to understand why the model made its prediction.

Parts 2-4: The Path Forward

The subsequent parts of Ameisen’s framework will guide the rest of this journey:

Part 2: Build a Working Pipeline – Moving from prototype to reproducible data collection, feature engineering, and model training workflows.

Part 3: Iterate on Models – Experimenting with different approaches, evaluating performance, and understanding what works (and what doesn’t).

Part 4: Deploy and Monitor – Getting the model into production and ensuring it continues to perform well over time.

Each of these parts will be covered in depth through future blog posts, with real code examples from PathWild.

What’s Next

I’ll be documenting my progress through each phase of this framework. Early posts will focus on Part 1—building the inference prototype and scoring algorithm based on domain heuristics. Then we’ll move into building data pipelines, training models, and eventually deploying a production system.

I’m not following a rigid timeline. Some weeks I’ll make huge progress, other weeks I’ll hit dead ends. I’ll document all of it—the breakthroughs and the frustrations.

I’m not an AI/ML expert. I’m learning this alongside you. That means I’ll make mistakes, get things wrong, and have to backtrack. That’s the point. If you’re also trying to break into AI/ML, I hope seeing the messy reality of learning helps more than another polished tutorial.

Follow Along

I’m building PathWild in the open. Every struggle, every breakthrough, every “why isn’t this working?” moment will be documented here. If you’re trying to break into AI/ML, or if you just enjoy watching someone learn by doing, I’d love to have you follow along.

Next post: Building the first heuristic-based prediction

This is post 1 in a series documenting my journey building PathWild.ai. Follow along as I learn AI/ML by building a real wildlife prediction platform.

Recommended reading: “Building Machine Learning Powered Applications” by Emmanuel Ameisen
Project: PathWild.ai
Learning approach: 80% doing, 20% theory
Current focus: Part 1 – Finding the Right ML Approach

Archives

Recent Entries

Tags

My Tweets

Archive for December, 2025

From Presence to Balanced Training Data: Generating Absence Points for PathWild

The Problem: Presence-Only Data

The Strategy: Four Complementary Approaches

Strategy 1: Environmental Pseudo-Absences (40%)

Strategy 2: Unsuitable Habitat Absences (30%)

Strategy 3: Random Background Points (20%)

Strategy 4: Temporal Absences (10%)

Literature Alignment: Why This Multi-Strategy Approach Works

Key Findings from Barbet-Massin et al. (2012)

Key Findings from Elith & Leathwick (2009)

Why the 40/30/20/10 Split?

Implementation: Building the Absence Generator System

Base Class: AbsenceGenerator

Strategy Implementation: Environmental Pseudo-Absences

The Sequential Problem: Hitting Limits

Parallel Processing: The Solution

Architecture: Worker-Based Parallelism

Worker Function: Pickleable and Stateless

Adaptive max_attempts: Scaling with Dataset Size

Results: Perfect Balance Across All Datasets

South Bighorn Dataset

Southern GYE Dataset

National Refuge Dataset

Testing: Comprehensive Coverage

Base Functionality Tests

Parallel Processing Tests

Adaptive max_attempts Tests

Why Parallel Processing Over Stratified Sampling?

1. No Data Loss

2. Solves the Real Problem

3. Scalability

4. Better Models

5. Future-Proof

Performance: Before and After

Sequential (Before)

Parallel (After)

The Orchestration Script

Validation: Ensuring Quality

Lessons Learned

1. Start Simple, Scale When Needed

2. Profile Before Optimizing

3. Modular Design Enables Parallelization

4. Adaptive Parameters Scale Better Than Fixed

5. Validation Catches Issues Early

Next Steps: Model Training

Technical Details

The Takeaway

References

Related

From GPS Collars to Training Data: Building PathWild’s Elk Location Dataset

The Problem

Finding the Right Data

1. South Bighorn Herd Migration Routes ⭐

2. National Elk Refuge GPS Collar Data

3. Southern Greater Yellowstone Ecosystem (GYE)

The Exploration Process

Step 1: Load and Inspect

Step 2: Extract Training Points

Step 3: Calculate Geographic Relevance

Step 4: Visualize to Understand

Step 5: Prepare for Integration

Step 6: Adding Environmental Context with DataContextBuilder

Lessons Learned

1. Start with Exploration, Not Implementation

2. Geographic Context Matters

3. Visualization Reveals Patterns

4. Iterate on Data Quality

Next Steps

The Takeaway

Technical Details

Building the First Heuristic: From Domain Knowledge to Working Code

The Goals: Activity Level AND Population Size

Part 1: Predicting Elk Activity

What We Know About Elk Behavior

Building a Simple Scoring Algorithm

Base Class: `AbsenceGenerator`