← All projects
Machine LearningPrototype

Crop Yield Prediction: Applying Machine Learning to Agricultural Productivity Data

A reproducible machine-learning project for predicting crop yield from agronomic and environmental features.

Data science and machine learning builder · 2024

  • Python
  • pandas
  • scikit-learn
  • Jupyter
Role
Data science and machine learning builder
Status
Prototype
Year
2024
Type
Machine Learning
Access
Public

Overview

Crop Yield Prediction is a reproducible machine-learning project that estimates crop yield from agronomic and environmental features. It covers the full path from raw data to evaluated model, kept runnable end to end so results can be re-checked and extended.

Problem

Yield is shaped by many agronomic and environmental factors at once — soil, weather, inputs, and management — which makes it hard to anticipate from raw records alone. The goal was a model that turns those signals into a useful yield estimate, built so its reasoning is transparent rather than a black box.

Data and preprocessing

Work started by cleaning and consolidating the source data, then shaping features the model could learn from.

  • Cleaning and consolidating raw records
  • Handling missing values and inconsistent units
  • Feature engineering from agronomic and environmental variables
  • Splitting into training and held-out evaluation sets

Modelling approach

I compared classical models before reaching for anything heavier — the dataset is tabular and modest in size, so interpretable models that iterate quickly were the right starting point.

  • A baseline model for a reference point
  • Comparison across several scikit-learn regressors
  • Hyperparameter tuning on the training set
  • Selection by performance on held-out data

Evaluation

Models were scored on held-out data with standard regression metrics. The emphasis was honest evaluation rather than a single headline number: if a model scores unusually high, I treat that as a prompt to check for data leakage or overfitting — not a result to celebrate. Exact figures live in the notebook rather than being quoted here.

Interpretation

Beyond the score, feature-importance analysis helped sanity-check which signals the model leaned on and whether that lined up with agronomic intuition. A model is more trustworthy when its reasoning is plausible to someone who knows the domain.

Limitations

  • The dataset may not generalize across regions
  • Model quality depends on feature quality and data coverage
  • High performance should be checked for leakage or overfitting
  • Real deployment would need validation with current local data

What it demonstrates

  • A reproducible, end-to-end ML workflow
  • Feature engineering and model comparison
  • Careful, leakage-aware evaluation
  • Connecting model behaviour to agricultural relevance

Stack

  • Python
  • pandas
  • scikit-learn
  • Jupyter

Proof assets

Some proof assets use dummy data or are shared as private walkthroughs to protect sensitive systems and records.

  • NotebookComing soon

    Notebook

    The full preparation, training, and evaluation flow.

    Coming soon

  • ScreenshotsPlanned

    Model evaluation chart

    Performance on held-out data.

    Planned — to be added

  • ScreenshotsPlanned

    Feature importance plot

    Which signals the model leaned on.

    Planned — to be added

  • GitHubComing soon

    GitHub

    Source repository.

    Coming soon

Availability

PublicCode and a demo can be shared freely.

Next steps

  • Validate with current local and regional data
  • Add cross-validation and explicit leakage checks
  • Publish the notebook and evaluation charts
  • Package the model behind a small API