Data Science .gitignore Lifecycle
Comprehensive .gitignore templates for ML/AI projects. From research and experimentation to production deployment and MLOps workflows.
EDA & Research
Jupyter notebooks & datasets
Model Training
Feature engineering & ML
Experiments
MLflow, W&B tracking
MLOps
Production deployment
Choose Your ML Project Stage
Research & EDA
Exploratory data analysis and research experiments
Model Development
Feature engineering and model training workflows
Experiment Tracking
MLflow, Weights & Biases, and experiment management
Production MLOps
Model serving, monitoring, and production deployment
Research & EDA
Exploratory data analysis and research experiments
# Data Science Research & EDA .gitignore
# Jupyter Notebook checkpoints
.ipynb_checkpoints/
*/.ipynb_checkpoints/*
.jupyter/
.ipython/
# Raw datasets (usually large and should not be in version control)
data/raw/
data/external/
datasets/
*.csv
*.tsv
*.json
*.parquet
*.hdf5
*.h5
*.pickle
*.pkl
*.feather
# Exploratory data analysis outputs
eda_output/
exploratory/
analysis_results/
plots/
figures/
*.png
*.jpg
*.jpeg
*.svg
*.pdf
# Research notebooks and temporary files
research_notebooks/
scratch/
temp_analysis/
prototype/
*.tmp
*.temp
# Data profiling reports
profiling_reports/
data_quality/
*.html
report_*.json
# Virtual environment
venv/
.venv/
env/
ENV/
conda-env/
.conda/
# Python cache
__pycache__/
*.py[cod]
*$py.class
*.so
# Research databases and caches
cache/
.cache/
*.db
*.sqlite3
research.db
# Temporary model files
temp_models/
scratch_models/
*.joblib
*.model
# IDE and editor files
.vscode/
.idea/
*.swp
*.swo
# OS files
.DS_Store
Thumbs.db
# Environment variables
.env
.env.local
research_config.py
# Logs
*.log
research.log
analysis.log
# Statistical analysis outputs
stats_output/
correlation_matrices/
feature_importance/
# Documentation drafts
draft_docs/
research_notes/
*.md.backup
# Testing and validation
test_results/
validation_output/
cross_validation/
ML Framework Integration
TensorFlow / Keras
Deep learning and neural networks
# TensorFlow/Keras
*.tfrecord
*.tfevents.*
tf_logs/
saved_model/
*.pb
*.h5
*.hdf5
checkpoints/
tensorboard_logs/
PyTorch
Dynamic neural networks and research
# PyTorch
*.pth
*.pt
*.ckpt
lightning_logs/
torch_cache/
.torch/
torchvision_cache/
scikit-learn
Traditional ML algorithms
# scikit-learn
*.joblib
sklearn_models/
pipeline_cache/
grid_search_results/
cross_val_cache/
MLflow
Experiment tracking and model registry
# MLflow
mlruns/
mlflow.db
mlartifacts/
model_registry/
mlflow_tracking/
Weights & Biases
Experiment tracking and collaboration
# Weights & Biases
wandb/
.wandb/
wandb-offline/
wandb-metadata.json
wandb-summary.json
Hugging Face
Transformers and NLP models
# Hugging Face
transformers_cache/
.cache/huggingface/
models--*/
tokenizers_cache/
*.safetensors
Data Management Strategy
Data Type | File Extensions | Typical Size | Version Control Strategy |
---|---|---|---|
Structured Data | .csv.tsv.parquet.feather | Medium (MB - GB) | Sample data in repo, full data external |
Unstructured Data | .json.txt.pdf.docx | Large (GB - TB) | Metadata only, data in cloud storage |
Image Data | .jpg.png.tiff.dicom | Very Large (TB+) | DVC or cloud storage with versioning |
Audio/Video | .wav.mp3.mp4.avi | Extremely Large (PB) | External storage with metadata tracking |
Data Science Best Practices
Data Handling
Raw datasets, personal information, API keys, large model files
Data schemas, sample data, preprocessing scripts, feature definitions
Model Management
Use model registry (MLflow) instead of git for model versioning
Keep experiment metadata in code, results in tracking systems
Accelerate Your ML Development
Start your ML project with the right .gitignore configuration. From research notebooks to production deployment, maintain clean repositories and efficient workflows.