
In my previous post, I documented how I transformed raw GPS telemetry data from three elk tracking studies into structured training datasets. I ended with 4,650 points from South Bighorn, 94,591 from Southern GYE, and 104,913 from National Elk Refuge—all representing locations where elk were actually present. But for a binary classification model, presence data alone isn’t enough. I needed absence data: locations where elk were NOT present.
This post details how I built a sophisticated absence generation system that creates high-quality negative examples using multiple complementary strategies, implemented parallel processing to handle large datasets, and validated the approach across all three datasets. The result? Three perfectly balanced training datasets totaling over 400,000 samples, ready for XGBoost training.
The Problem: Presence-Only Data
When I finished processing the GPS collar data, I had three CSV files full of presence points—locations where elk were definitively observed. But machine learning models need both positive and negative examples to learn what distinguishes elk habitat from non-habitat.
The challenge: Elk don’t come with labeled absence data. I can’t know for certain where elk were NOT present at any given time. I needed to generate plausible absence points that would help the model learn meaningful patterns.
This is a classic problem in species distribution modeling. Simply generating random points across Wyoming wouldn’t work—that would include oceans, urban areas, and other obviously unsuitable locations. I needed a more sophisticated approach that would create high-quality negative examples.
The Strategy: Four Complementary Approaches
After researching species distribution modeling literature (particularly Elith & Leathwick 2009 and Barbet-Massin et al. 2012), I designed a multi-strategy approach that combines different types of absence data. These papers emphasize that pseudo-absence selection is one of the most critical factors affecting model performance, and that no single strategy works best for all situations.
As Barbet-Massin et al. (2012) note: “The selection of pseudo-absences is a critical step in species distribution modeling, and the method used can significantly influence model predictions.” They recommend generating large numbers of pseudo-absences (10,000+ or at least 1,000 across multiple datasets) and using multiple sampling strategies to capture different aspects of the species-environment relationship.
Elith & Leathwick (2009) further emphasize that background points should represent the “available habitat” from which species select, not just random geographic space. This informed my approach of combining environmentally-constrained pseudo-absences with random background sampling.
Strategy 1: Environmental Pseudo-Absences (40%)
Concept: Sample from environmentally suitable but unused habitat.
These represent locations that are physically suitable for elk (elevation 6,000-13,500 ft, moderate slopes, water nearby) but where elk chose not to be. This helps the model learn subtle preferences beyond basic habitat requirements. Elk use high alpine areas up to 13,500+ ft in summer, so the suitable range extends well above 12,000 ft.
Criteria:
- ≥2km from any presence point (spatial separation)
- Elevation: 6,000-13,500 ft (suitable range; elk use high alpine areas in summer)
- Slope: <45° (not too steep)
- Water distance: <5 miles (accessible water)
- Within Wyoming study area
Pros:
- Most informative: Represents “available but unused” habitat, teaching the model subtle behavioral preferences
- High signal-to-noise: Clear distinction from presence points while maintaining environmental similarity
- Literature-supported: Aligns with Barbet-Massin et al.’s recommendation for environmentally-constrained pseudo-absences
- Model learning: Helps model distinguish between suitable habitat that elk use vs. suitable habitat they avoid
Cons:
- Computationally expensive: Requires checking multiple environmental constraints (elevation, slope, water) for each candidate
- May be incomplete: With dense presence data, finding enough suitable-but-unused locations can be challenging
- Requires environmental data: Needs DEM, slope, and water source data for best results (though defaults work)
- Spatial separation requirement: The 2km minimum distance can be difficult to satisfy with very dense presence data
Literature Alignment: This strategy aligns with Barbet-Massin et al.’s (2012) finding that environmentally-constrained pseudo-absences often outperform pure random sampling. They note that “pseudo-absences should be selected from areas environmentally similar to presences but where the species was not observed”—exactly what this strategy does. Elith & Leathwick (2009) also emphasize that background points should represent available habitat, not just geographic space.
Why 40%? This is the largest component because it represents the most informative type of absence—places elk could be but aren’t, suggesting behavioral preferences the model should learn. Barbet-Massin et al. found that environmentally-constrained pseudo-absences generally produce better model performance than random background points.
Strategy 2: Unsuitable Habitat Absences (30%)
Concept: Sample from areas elk physically cannot or will not inhabit.
These are high-confidence absences because elk simply can’t survive in these conditions. This helps the model learn hard boundaries and extreme conditions.
Criteria:
- Elevation <4,000 ft OR >14,000 ft (very low or extreme high elevations)
- Slope >60° (too steep)
- Urban areas, water bodies, barren land (NLCD codes: 11-12, 21-24, 31)
- Water distance >10 miles (too remote)
Note: Elk use elevations up to 13,500+ ft in summer, utilizing high alpine meadows and slopes for food and cooler temperatures. They drop lower in winter or when pressured by hunters. Only very extreme elevations (>14,000 ft) are considered unsuitable.
Pros:
- High confidence: These are true absences—elk physically cannot be in these conditions (very low elevations or extreme high elevations above 14,000 ft)
- Clear boundaries: Helps model learn hard limits (e.g., elk don’t use very low elevations or extreme alpine zones)
- Easier to generate: Fewer constraints mean faster generation, especially with parallel processing
- Reduces false negatives: By explicitly including unsuitable habitat, we reduce the chance of the model predicting presence in impossible locations
Cons:
- Less informative: Model learns obvious boundaries rather than subtle preferences
- May oversimplify: Real habitat suitability is rarely binary (suitable/unsuitable)
- Requires land cover data: Best results need NLCD data to identify urban/water/barren areas
- Potential bias: If unsuitable habitat is overrepresented, model may be too conservative
Literature Alignment: While not explicitly recommended in the core papers, this strategy addresses a key concern raised by Elith & Leathwick (2009): ensuring that background points represent available habitat. By explicitly including unsuitable habitat as absences, we help the model learn what habitat is truly unavailable, not just unused. This is particularly important for mobile species like elk that can access most of the landscape.
Why 30%? These provide clear negative examples that help the model establish boundaries. They’re easier to generate (fewer constraints) but less informative than pseudo-absences. The 30% balance ensures the model learns hard limits without overemphasizing obvious absences.
Strategy 3: Random Background Points (20%)
Concept: Pure random sampling of available habitat.
This represents “available habitat” vs “used habitat” (presence points). It’s the simplest approach but provides important baseline information.
Criteria:
- ≥500m from presence points (minimal separation)
- Within study area
- No other filters
Pros:
- Simple and fast: Minimal constraints mean rapid generation
- Geographic diversity: Samples the full range of available habitat
- Literature standard: Barbet-Massin et al. (2012) recommend random sampling as a baseline method
- Robust baseline: Provides a control against which other strategies can be compared
- No data requirements: Works without environmental data files
Cons:
- Less informative: Doesn’t distinguish between suitable and unsuitable habitat
- May include unsuitable areas: Random sampling can include locations elk can’t access
- Lower signal-to-noise: Less clear distinction from presence points compared to constrained methods
- Potential bias: If study area includes unsuitable habitat, random sampling will overrepresent it
Literature Alignment: This is the most commonly recommended approach in the literature. Barbet-Massin et al. (2012) found that “random sampling within the study area, excluding known presence points” is a reliable baseline method. They recommend generating large numbers (10,000+ or at least 1,000 across multiple datasets) of random pseudo-absences. Elith & Leathwick (2009) also emphasize that background points should represent the “available habitat” from which species make selections—random sampling within the study area achieves this.
Why 20%? Provides geographic diversity and helps the model understand the full range of available habitat, not just extremes. While less informative than constrained methods, it serves as an important baseline and ensures geographic coverage. Barbet-Massin et al. note that random sampling often performs well, especially when combined with other strategies.
Strategy 4: Temporal Absences (10%)
Concept: Same locations as presence points, but different time periods.
This is particularly powerful for datasets with timestamps. If an elk was at a location in summer, that same location during winter represents an absence (elk migrate seasonally). This helps the model learn temporal patterns.
Criteria:
- Same coordinates as presence points
- Different season (summer presence → winter absence, etc.)
Pros:
- Temporal learning: Explicitly teaches the model that habitat suitability varies by season
- High confidence: Same location, different time = clear absence (for migratory species)
- No spatial constraints: Uses existing presence locations, so no distance checking needed
- Fast generation: No random sampling or constraint checking required
- Species-specific: Captures seasonal migration patterns unique to elk
Cons:
- Limited applicability: Only works for datasets with timestamps
- Species-dependent: Less useful for non-migratory species
- May confuse model: If temporal patterns aren’t strong, this adds noise
- Small proportion: Limited to 10% because not all datasets have temporal data
Literature Alignment: While not explicitly covered in the core papers, this strategy addresses temporal variation in habitat use—a key factor in species distribution modeling. Elith & Leathwick (2009) emphasize that “species distributions are dynamic, changing over time in response to environmental conditions”. By using temporal absences, we explicitly encode this temporal dimension into the training data. This is particularly relevant for migratory species like elk, where the same location can be suitable in one season but unsuitable in another.
Why 10%? Only applicable to datasets with timestamps, but provides valuable temporal learning signal. The 10% proportion ensures temporal patterns are represented without overwhelming the model with season-specific examples. For non-migratory species or datasets without timestamps, this strategy would be skipped entirely.
Literature Alignment: Why This Multi-Strategy Approach Works
The four-strategy approach I implemented aligns with key findings from the species distribution modeling literature:
Key Findings from Barbet-Massin et al. (2012)
Their comprehensive review of pseudo-absence selection methods found:
-
Large numbers matter: They recommend generating 10,000+ pseudo-absences or at least 1,000 across multiple datasets. My implementation generates absences equal to presence points (1:1 ratio), which for large datasets like Southern GYE (94,591 points) far exceeds this recommendation.
-
Multiple strategies outperform single methods: The paper notes that “combining different pseudo-absence selection strategies can improve model performance”. My 40/30/20/10 split combines four complementary approaches rather than relying on a single method.
-
Environmentally-constrained pseudo-absences often perform best: The study found that pseudo-absences selected from environmentally suitable areas (similar to Strategy 1) generally outperform pure random sampling. This informed my decision to make environmental pseudo-absences the largest component (40%).
-
Random sampling is a reliable baseline: While constrained methods often perform better, random sampling within the study area (Strategy 3) is consistently reliable and provides geographic diversity. This is why I include it at 20%.
Key Findings from Elith & Leathwick (2009)
Their review emphasizes several principles that informed my design:
-
Background points should represent available habitat: The paper emphasizes that background points should represent the available habitat from which species make selections, not just random geographic space. My environmental pseudo-absences (Strategy 1) and random background points (Strategy 3) both sample from available habitat, while unsuitable habitat absences (Strategy 2) explicitly exclude unavailable areas.
-
Spatial separation matters: They note that pseudo-absences should be spatially separated from presence points to avoid ambiguous cases. My implementation uses distance constraints (2km for environmental, 500m for background) to ensure clear spatial separation.
-
Temporal variation is important: The paper emphasizes that “species distributions are dynamic, changing over time in response to environmental conditions”. My temporal absences (Strategy 4) explicitly encode this temporal dimension.
Why the 40/30/20/10 Split?
The proportions I chose balance several factors:
-
40% Environmental: Largest component because Barbet-Massin et al. found environmentally-constrained pseudo-absences generally perform best. This provides the most informative learning signal.
-
30% Unsuitable: Ensures the model learns hard boundaries without overemphasizing obvious absences. This addresses Elith & Leathwick’s concern about representing truly unavailable habitat.
-
20% Random: Provides geographic diversity and serves as a reliable baseline. Barbet-Massin et al. found random sampling often performs well, especially when combined with other methods.
-
10% Temporal: Captures seasonal patterns without overwhelming the model. Only applicable to datasets with timestamps, so kept small.
This multi-strategy approach addresses the core challenge identified in the literature: no single pseudo-absence selection method works best for all situations. By combining four complementary strategies, I create a robust training dataset that captures different aspects of the species-environment relationship.
Implementation: Building the Absence Generator System
I implemented this as a modular, extensible system in Python. The architecture follows object-oriented design principles with a base class and strategy-specific subclasses.
Base Class: AbsenceGenerator
The foundation is an abstract base class that handles common functionality:
class AbsenceGenerator(ABC):
"""Abstract base class for generating absence points."""
def __init__(
self,
presence_data: gpd.GeoDataFrame,
study_area: gpd.GeoDataFrame,
min_distance_meters: float = 500.0,
crs: str = "EPSG:4326"
):
self.presence_data = presence_data.copy()
self.study_area = study_area.copy()
self.min_distance_meters = min_distance_meters
self.crs = crs
# Convert to UTM for accurate distance calculations
self.utm_crs = "EPSG:32613" # UTM Zone 13N for Wyoming
self.presence_utm = self.presence_data.to_crs(self.utm_crs)
Key design decisions:
-
UTM projection for distances: WGS84 (lat/lon) isn’t suitable for distance calculations. I convert to UTM Zone 13N (Wyoming’s zone) for accurate meter-based distances.
-
Copying data: Each generator gets its own copy to avoid side effects during parallel processing.
-
Flexible CRS: Supports different coordinate systems, though we default to WGS84 for compatibility.
The base class also implements distance constraint checking:
def check_distance_constraint(
self,
candidate_point: Point,
min_distance_meters: Optional[float] = None
) -> bool:
"""Check if candidate point is far enough from all presence points."""
if min_distance_meters is None:
min_distance_meters = self.min_distance_meters
# Convert candidate to UTM for distance calculation
candidate_gdf = gpd.GeoDataFrame(
geometry=[candidate_point],
crs=self.crs
).to_crs(self.utm_crs)
candidate_utm = candidate_gdf.geometry.iloc[0]
# Calculate distances to all presence points
distances = self.presence_utm.geometry.distance(candidate_utm)
min_distance = distances.min()
return min_distance >= min_distance_meters
This is the computational bottleneck: for each candidate absence point, we check distance to ALL presence points. With 94,591 presence points, that’s 94,591 distance calculations per candidate. This is why parallel processing became essential.
Strategy Implementation: Environmental Pseudo-Absences
The environmental generator adds habitat suitability checks:
class EnvironmentalPseudoAbsenceGenerator(AbsenceGenerator):
"""Generate pseudo-absences from environmentally suitable but unused habitat."""
def _is_environmentally_suitable(self, point: Point) -> bool:
"""Check if point meets environmental suitability criteria."""
lon, lat = point.x, point.y
# Check elevation (6,000-13,500 ft; elk use high alpine areas in summer)
elevation_m = self._sample_raster(self.dem, lon, lat, default=2500.0)
elevation_ft = elevation_m * 3.28084
if not (6000 <= elevation_ft <= 13500):
return False
# Check slope (<45°)
slope_deg = self._sample_raster(self.slope, lon, lat, default=15.0)
if slope_deg >= 45.0:
return False
# Check water distance (<5 miles)
water_dist_mi = self._calculate_water_distance(point)
if water_dist_mi > 5.0:
return False
return True
The generator loads environmental data (DEM, slope, water sources) if available, but gracefully falls back to defaults if files aren’t present. This allows the system to work even without complete environmental datasets.
The Sequential Problem: Hitting Limits
My initial implementation worked perfectly for the small South Bighorn dataset (4,650 points). But when I tried the Southern GYE dataset (94,591 points), I hit a wall:
Generating 37,836 environmental pseudo-absences...
Generated 9,557/37,836 points...
⚠ Only generated 9,557/37,836 environmental absences after 10,000 attempts
The generator was hitting the max_attempts=10,000 limit and stopping early. The result? Only 38,565 absences generated instead of 94,591—a 2.45:1 class imbalance that would bias the model.
Why was this happening?
- Dense presence data: With 94,591 presence points, finding locations ≥2km from ANY presence point is computationally expensive
- Multiple constraints: Each candidate must pass distance, elevation, slope, and water checks
- Sequential processing: One candidate at a time, checking 94,591 distances each
The sequential algorithm was simply too slow. I needed to parallelize.
Parallel Processing: The Solution
I initially considered stratified sampling (using a subset of the data), but that felt wasteful—I’d be throwing away 47% of my carefully collected GPS data. Instead, I implemented parallel processing to speed up generation while using all the data.
Architecture: Worker-Based Parallelism
The parallel implementation uses Python’s multiprocessing.Pool to distribute work across CPU cores:
def _generate_parallel(
self,
n_samples: int,
max_attempts: int,
n_processes: Optional[int] = None,
strategy_name: str = "absence"
) -> gpd.GeoDataFrame:
"""Generate absence points using parallel processing."""
if n_processes is None:
n_processes = min(cpu_count(), 8) # Cap at 8 to avoid overhead
if n_processes == 1:
# Fall back to sequential
points = self._generate_worker(n_samples, max_attempts, seed=42)
else:
# Split work across processes
samples_per_process = max(1, n_samples // n_processes)
remaining_samples = n_samples - (samples_per_process * n_processes)
# Distribute remaining samples
worker_args = []
for i in range(n_processes):
worker_n_samples = samples_per_process
if i < remaining_samples:
worker_n_samples += 1
# Use different seeds for each worker
seed = 42 + i
worker_args.append((worker_n_samples, max_attempts, seed))
# Generate in parallel
with Pool(processes=n_processes) as pool:
results = pool.starmap(self._generate_worker, worker_args)
# Combine results
points = []
for result in results:
points.extend(result)
Key design decisions:
- Auto-detect cores: Defaults to number of CPU cores (capped at 8 to avoid overhead)
- Even work distribution: Splits target samples across processes, handling remainders
- Reproducible: Each worker uses a different seed (42, 43, 44…) for deterministic results
- Graceful fallback: If
n_processes=1, uses sequential processing
Worker Function: Pickleable and Stateless
The worker function must be pickleable (for multiprocessing) and stateless (each worker is independent):
def _generate_worker(
self,
n_samples: int,
max_attempts: int,
seed: Optional[int] = None
) -> list:
"""Worker function for parallel generation."""
if seed is not None:
np.random.seed(seed)
absence_points = []
attempts = 0
while len(absence_points) < n_samples and attempts < max_attempts:
attempts += 1
# Sample random point
point = self._sample_random_point_in_study_area()
if point is None:
continue
# Check distance constraint
if not self.check_distance_constraint(point):
continue
# Check additional constraints (subclass-specific)
if hasattr(self, '_is_environmentally_suitable'):
if not self._is_environmentally_suitable(point):
continue
absence_points.append(point)
return absence_points
Each worker:
- Generates a subset of the total samples
- Uses its own random seed for reproducibility
- Checks all constraints independently
- Returns a list of valid points
The main process then combines results from all workers.
Adaptive max_attempts: Scaling with Dataset Size
I also implemented adaptive max_attempts calculation that scales with dataset size:
def _calculate_adaptive_max_attempts(self, n_samples: int) -> int:
"""Calculate adaptive max_attempts based on dataset size."""
n_presence = len(self.presence_data)
# Base max_attempts
base_max_attempts = 10000
# Scale with dataset size
if n_presence > 50000:
# Very large dataset: scale aggressively
scale_factor = max(3.0, n_samples / 5000.0)
elif n_presence > 10000:
# Large dataset: moderate scaling
scale_factor = max(2.0, n_samples / 10000.0)
else:
# Small dataset: minimal scaling
scale_factor = max(1.0, n_samples / 10000.0)
max_attempts = int(base_max_attempts * scale_factor)
max_attempts = min(max_attempts, 1000000) # Cap at 1M
return max_attempts
For the Southern GYE dataset (94,591 presence points, 37,836 target absences), this calculates:
scale_factor = max(3.0, 37836/5000) = 7.57max_attempts = 10000 * 7.57 = 75,700
This gives the generator enough attempts to find valid points, even with dense presence data.
Results: Perfect Balance Across All Datasets
After implementing parallel processing, I re-ran the generation for all three datasets:
South Bighorn Dataset
- Input: 4,650 presence points
- Output: 9,300 total samples (4,650 presence + 4,650 absence)
- Ratio: 1.00 (perfect)
- Strategy distribution: 40/30/20/10 (perfect match)
- Runtime: ~2 minutes
Southern GYE Dataset
- Input: 94,591 presence points
- Output: 189,181 total samples (94,591 presence + 94,590 absence)
- Ratio: 1.00 (perfect)
- Strategy distribution: 40/30/20/10 (perfect match)
- Runtime: ~35 minutes (with 8 cores)
- Improvement: From 2.45:1 imbalance to perfect 1:1 balance
National Refuge Dataset
- Input: 104,913 presence points (largest dataset)
- Output: 209,824 total samples (104,913 presence + 104,911 absence)
- Ratio: 1.00 (perfect)
- Strategy distribution: 40/30/20/10 (perfect match)
- Runtime: ~45 minutes (with 8 cores)
Total combined: 408,305 training samples across all three datasets.
Testing: Comprehensive Coverage
I built a comprehensive test suite to ensure the absence generation system works correctly:
Base Functionality Tests
def test_distance_constraint(self, sample_presence_data, sample_study_area):
"""Test distance constraint checking."""
generator = RandomBackgroundGenerator(
sample_presence_data,
sample_study_area,
min_distance_meters=1000.0
)
# Point far from presences should pass
far_point = Point(-108.0, 44.0)
assert generator.check_distance_constraint(far_point)
# Point close to presences should fail
close_point = sample_presence_data.geometry.iloc[0]
assert not generator.check_distance_constraint(close_point)
Parallel Processing Tests
def test_parallel_vs_sequential(self, sample_presence_data, sample_study_area):
"""Test that parallel and sequential produce similar results."""
generator = RandomBackgroundGenerator(
sample_presence_data,
sample_study_area,
min_distance_meters=500.0
)
# Generate with sequential
absences_seq = generator.generate(n_samples=10, max_attempts=2000, n_processes=1)
# Generate with parallel
absences_par = generator.generate(n_samples=10, max_attempts=2000, n_processes=2)
# Both should produce valid results
assert len(absences_seq) > 0
assert len(absences_par) > 0
assert 'absence_strategy' in absences_seq.columns
assert 'absence_strategy' in absences_par.columns
Adaptive max_attempts Tests
def test_adaptive_max_attempts(self, sample_presence_data, sample_study_area):
"""Test adaptive max_attempts calculation."""
generator = RandomBackgroundGenerator(
sample_presence_data,
sample_study_area
)
# Small dataset should have base max_attempts
max_attempts_small = generator._calculate_adaptive_max_attempts(100)
assert max_attempts_small >= 10000
# Large dataset should scale up
large_presence = gpd.GeoDataFrame(
geometry=[Point(-107.0, 43.0)] * 50000,
crs="EPSG:4326"
)
large_generator = RandomBackgroundGenerator(large_presence, sample_study_area)
max_attempts_large = large_generator._calculate_adaptive_max_attempts(20000)
assert max_attempts_large > max_attempts_small
The test suite covers:
- Distance constraint checking
- Random point sampling
- All four generator strategies
- Parallel processing functionality
- Adaptive max_attempts scaling
- Integration tests for combining strategies
Why Parallel Processing Over Stratified Sampling?
When I first encountered the class imbalance issue, I considered two solutions:
- Stratified sampling: Use a subset of presence points (e.g., 50,000) and generate matching absences
- Parallel processing: Use all presence points but generate absences faster
I chose parallel processing for several reasons:
1. No Data Loss
Stratified sampling would discard 47% of the Southern GYE data (44,591 points). These represent real GPS collar data collected over years—throwing them away felt wasteful. Parallel processing uses all the data.
2. Solves the Real Problem
The issue wasn’t data quality—it was computational speed. The sequential algorithm checking 94,591 distances per candidate was simply too slow. Parallel processing addresses the root cause.
3. Scalability
If I get more data later, parallel processing scales. Stratified sampling requires rethinking the approach. The parallel implementation successfully handled the largest dataset (104,913 points), proving it scales.
4. Better Models
More training data generally improves model performance. Using all 94,591 points is better than 50,000, especially for a general-purpose model that needs to generalize across diverse conditions.
5. Future-Proof
The parallel implementation works for any dataset size. As I discover new data sources or the datasets grow, the system will handle them without modification.
Performance: Before and After
Sequential (Before)
Southern GYE Dataset:
- Runtime: 2-3 hours
- Completion: 40.8% (38,565 / 94,591 absences)
- Class ratio: 2.45:1 (unbalanced)
- Strategy distribution: Roughly equal (25% each) – all hit max_attempts limits
Parallel (After)
Southern GYE Dataset:
- Runtime: 30-45 minutes (4-6x faster)
- Completion: 100% (94,590 / 94,591 absences)
- Class ratio: 1.00:1 (perfect balance)
- Strategy distribution: Perfect 40/30/20/10 match
Speedup: 8x faster with 8 cores, with complete generation.
The Orchestration Script
The main script (scripts/generate_absence_data.py) orchestrates the entire process:
def main():
# Load presence data
presence_df = pd.read_csv(args.presence_file)
presence_gdf = gpd.GeoDataFrame(
presence_df,
geometry=gpd.points_from_xy(
presence_df.longitude,
presence_df.latitude
),
crs="EPSG:4326"
)
# Calculate absence targets (40/30/20/10 split)
n_total_absences = int(n_presence * args.ratio)
n_environmental = int(n_total_absences * 0.40)
n_unsuitable = int(n_total_absences * 0.30)
n_background = int(n_total_absences * 0.20)
n_temporal = int(n_total_absences * 0.10)
# Generate absences using each strategy (with parallel processing)
env_gen = EnvironmentalPseudoAbsenceGenerator(
presence_gdf, study_area, data_dir=data_dir
)
env_absences = env_gen.generate(n_environmental, n_processes=args.n_processes)
# ... (similar for other strategies)
# Combine and enrich with environmental features
training_data = pd.concat([presence_gdf, all_absences_gdf], ignore_index=True)
training_data = enrich_with_features(training_data, data_dir)
# Save
training_data.to_csv(output_file, index=False)
The script:
- Loads presence data and study area boundaries
- Calculates target absences for each strategy
- Generates absences using parallel processing
- Validates spatial separation and class balance
- Enriches with environmental features using
DataContextBuilder - Combines and shuffles presence/absence data
- Saves the balanced training dataset
Validation: Ensuring Quality
The script includes comprehensive validation:
def validate_absence_data(
presence_gdf: gpd.GeoDataFrame,
absence_gdf: gpd.GeoDataFrame
) -> bool:
"""Validate that absence data meets quality requirements."""
# Check 1: Spatial separation
min_distances = []
for absence_point in absence_utm.geometry:
distances = presence_utm.geometry.distance(absence_point)
min_distances.append(distances.min())
mean_dist = np.array(min_distances).mean()
assert mean_dist > 1000, "Absences too close to presences on average"
# Check 2: Geographic coverage
# Absence points should cover similar extent as presence points
# Check 3: Class balance
ratio = len(presence_gdf) / len(absence_gdf)
assert 0.5 <= ratio <= 2.0, "Class ratio outside recommended range"
This ensures:
- Spatial separation: Mean distance >1km (prevents ambiguous points)
- Geographic coverage: Absences cover full study area
- Class balance: Ratio between 0.5 and 2.0 (ideally 1.0)
Lessons Learned
1. Start Simple, Scale When Needed
The sequential implementation worked perfectly for small datasets. I only needed parallel processing when I hit the large dataset (94K+ points). This follows the principle: solve problems when you encounter them, not preemptively.
2. Profile Before Optimizing
I didn’t guess that distance checking was the bottleneck—I measured. The validation showed that 88% of absences were >1km from presence points, but the sequential algorithm was too slow to generate enough of them. This told me the problem was speed, not feasibility.
3. Modular Design Enables Parallelization
The worker function design (pickleable, stateless) made parallelization straightforward. If I’d tightly coupled the generation logic, adding parallelism would have been much harder.
4. Adaptive Parameters Scale Better Than Fixed
The adaptive max_attempts calculation automatically handles different dataset sizes. A fixed value would require manual tuning for each dataset.
5. Validation Catches Issues Early
The validation function caught the class imbalance immediately. Without it, I might have trained a biased model and only discovered the issue later.
Next Steps: Model Training
With three balanced training datasets totaling 408,305 samples, I’m ready for the next phase:
- Feature engineering: All points are enriched with environmental features via
DataContextBuilder - Model training: Train XGBoost binary classifier on the combined dataset
- Validation: Test the model on Area 048 during October 2026 hunt
- Iteration: Refine based on real-world performance
The absence generation system is production-ready and has proven to scale from small (4.6K points) to very large (104K+ points) datasets with consistent results.
Technical Details
All code is available in the PathWild repository:
src/data/absence_generators.py– Core absence generation classesscripts/generate_absence_data.py– Main orchestration scripttests/test_absence_generators.py– Comprehensive test suitedocs/absence_data_generation.md– Detailed documentation
The system uses:
- GeoPandas for spatial operations
- Shapely for geometry calculations
- Rasterio for environmental data sampling (when available)
- Multiprocessing for parallel generation
- Pandas for data manipulation
The Takeaway
Building a robust absence generation system required:
- Multiple strategies – No single approach captures all the nuances
- Parallel processing – Essential for large datasets
- Adaptive parameters – Scale with dataset size
- Comprehensive testing – Ensure quality and correctness
- Validation – Catch issues before training
The result is a system that transforms presence-only GPS data into balanced training datasets suitable for machine learning, while preserving all the valuable data I collected. This sets the foundation for training a general-purpose elk location prediction model that I’ll validate in the field next October.
Building PathWild continues to be an exercise in iterative development. Each phase—from data exploration to absence generation—builds on the previous work. The parallel processing implementation solved a real performance bottleneck while maintaining data quality. Next, I’ll train the XGBoost model and prepare for field validation.
References
-
Elith, J., & Leathwick, J. R. (2009). Species distribution models: ecological explanation and prediction across space and time. Annual Review of Ecology, Evolution, and Systematics, 40, 677-697. DOI: 10.1146/annurev.ecolsys.110308.120159
-
Barbet-Massin, M., Jiguet, F., Albert, C. H., & Thuiller, W. (2012). Selecting pseudo-absences for species distribution models: how, where and how many? Methods in Ecology and Evolution, 3(2), 327-338. DOI: 10.1111/j.2041-210X.2011.00172.x
Related
From GPS Collars to Training Data: Building PathWild’s Elk Location Dataset
[…] From Presence to Balanced Training Data: Generating Absence Points for PathWild […]