Data science involves repetitive tasks like data cleaning, analysis, and model training, which can be time-consuming and error-prone when done manually. Python automation scripts offer a solution, enabling data scientists to automate data workflows, save time, and focus on extracting insights. Python’s rich ecosystem of libraries, such as Pandas, Scikit-learn, and BeautifulSoup, makes it ideal for data science automation.
This article presents 10 powerful Python automation scripts to streamline common data science tasks, drawing from best practices and real-world applications.
Data science automation eliminates repetitive work, reduces errors, and ensures consistent results. Manual processes, such as cleaning datasets or generating reports, can take hours, draining productivity. According to Timothy Kimutai, data cleaning alone consumes 60–80% of a data scientist’s time. Python automation scripts transform these tasks into reusable workflows, allowing scalability and precision. Benefits include:
These scripts will enhance your productivity and streamline your data science automation efforts.
Data cleaning is a critical yet tedious task in data science automation. This script uses Pandas to handle duplicates, missing values, and outliers systematically.
import pandas as pd
import numpy as np
def automated_data_cleaning(df):
"""Cleans data by removing duplicates, handling missing values, and outliers."""
df = df.drop_duplicates()
numeric_cols = df.select_dtypes(include=[np.number]).columns
categorical_cols = df.select_dtypes(include=['object']).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())
for col in categorical_cols:
df[col] = df[col].fillna(df[col].mode()[0] if not df[col].mode().empty else 'Unknown')
for col in numeric_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
df = df[~((df[col] < (Q1 - 1.5 * IQR)) | (df[col] > (Q3 + 1.5 * IQR)))]
df.columns = df.columns.str.lower().str.replace(' ', '_')
return df
# Example usage
raw_df = pd.read_csv('data.csv')
df_clean = automated_data_cleaning(raw_df)
df_clean.to_csv('cleaned_data.csv', index=False)
Use Case: A retail company processes sales data with inconsistent formats and missing entries. This script ensures data quality for analysis.
Benefits:
Exploratory Data Analysis (EDA) is essential but repetitive. This script automates EDA with ydata-profiling, generating detailed reports instantly.
from pandas as pd
import numpy as np
def generate_eda_report(df, title="Data Analysis Report"):
"""Generates an EDA report with visualizations and statistics."""
profile = ProfileReport(
df,
title=title,
explorative=True,
config_file={'correlations': {'auto': {'calculate': True}}, 'missing_diagrams': {'heatmap': True}})
profile.to_file(f"{title.replace(' ', '_').lower()}.html")
summary = {
'total_rows': len(df),
'total_columns': len(df.columns),
'missing_percentage': (df.isnull().sum().sum() / (len(df) * len(df.columns))) * 100,
'duplicate_rows': df.duplicated().sum()
}
return profile, summary
# Example usage
df = pd.read_csv('customer_data.csv')
profile, summary = generate_eda_report(df, "Customer Dataset Analysis")
print(f"Dataset has {summary['missing_percentage']:.2f}% missing values")
Use Case: A marketing team needs quick insights into monthly customer data for campaign planning.
Benefits:
Creating dashboards manually is time-intensive. This script automates interactive dashboard creation for stakeholder insights.
import dash
from dash import dcc, html, Input, Output
import plotly.express as px
import pandas as pd
def create_automated_dashboard(df):
"""Creates an interactive dashboard with dynamic visualizations."""
app = dash.Dash(__name__)
numeric_cols = df.select_dtypes(include=['number']).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
app.layout = html.Div([,
html.H1("Automated Data Dashboard"),
html.Div([
html.Label("Select X-axis:"),
dcc.Dropdown(id='x-axis-dropdown', options=[{'label': col, 'value': col} for col in
numeric_cols + categorical_cols], value=numeric_cols[0]),
], style={'width': '48%', 'display': 'inline-block'}),
html.Div([
html.Label("Select Y-axis:"),
dcc.Dropdown(id='y-axis-dropdown', options=[{'label': col, 'value': col} for col in
numeric_cols], value=numeric_cols[1]),
], style={'width': '48%', 'float': 'right'}),
dcc.Graph(id='main-graph'),
dcc.Graph(id='distribution-graph')
])
app.callback(
[Output('main-graph', 'figure'), Output('distribution-graph', 'figure')],
[Input('x-axis-dropdown', 'value'), Input('y-axis-dropdown', 'value')]
)
def update_graphs(x_axis, y_axis):
scatter_fig = px.scatter(df, x=x_axis, y=y_axis, title=f'{y_axis} vs {x_axis}')
dist_fig = px.histogram(df, x=x_axis, title=f'Distribution of {x_axis}') if x_axis in
numeric_cols else px.bar(df[x_axis].value_counts().reset_index(), x='index', y=x_axis)
return scatter_fig, dist_fig
return app
# Example usage
df = pd.read_csv('sales_data.csv')
dashboard = create_automated_dashboard(df)
dashboard.run_server(debug=True)
Use Case: A sales team needs real-time performance dashboards updated with new data.
Benefits
Manual data collection from websites is inefficient. This script automates web scraping with error handling.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
def automated_web_scraper(urls, delay_range=(1, 3)):
"""Scrapes data from multiple URLs with error handling."""
scraped_data = []
for i, url in enumerate(urls):
try:
time.sleep(random.uniform(*delay_range))
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
data = {,
'url': url,
'title': soup.find('title').text.strip() if soup.find('title') else 'N/A',
'meta_description': soup.find('meta', attrs={'name': 'description'}).get('content', '') if
soup.find('meta', attrs={'name': 'description'}) else '',
'headings': [h.text.strip() for h in soup.find_all(['h1', 'h2', 'h3'])[:5]],
'scraped_at': pd.Timestamp.now()
}
scraped_data.append(data)
print(f"Scraped {i+1}/{len(urls)}: {url}")
except Exception as e:
print(f"Error scraping {url}: {str(e)}")
scraped_data.append({'url': url, 'error': str(e), 'scraped_at': pd.Timestamp.now()})
return pd.DataFrame(scraped_data)
])
# Example usage
urls = ['https://example1.com', 'https://example2.com']
scraped_df = automated_web_scraper(urls)
scraped_df.to_csv('scraped_data.csv', index=False)
Use Case: A market research team monitors competitor pricing across multiple websites daily.
Benefits:
Model training involves repetitive preprocessing and tuning. This script automates the entire machine learning pipeline.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score, train_test_split
import joblib
def create_automated_ml_pipeline(df, target_column, model_type='classification'):
"""Creates a reusable ML pipeline for training and evaluation."""
X = df.drop(columns=[target_column])
y = df[target_column]
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
model = RandomForestClassifier(n_estimators=100, random_state=42) if model_type ==
'classification' else RandomForestRegressor(n_estimators=100, random_state=42)
pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', model)])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
joblib.dump(pipeline, f'ml_pipeline_{target_column}.pkl')
return {'pipeline': pipeline, 'cv_scores': cv_scores, 'test_score': pipeline.score(X_test, y_test)}
])
# Example usage
df = pd.read_csv('dataset.csv')
results = create_automated_ml_pipeline(df, 'target_column')
print(f"Cross-validation score: {results['cv_scores'].mean():.3f}")
Use Case: A bank retrains fraud detection models weekly with new transaction data.
Benefits:
Feature engineering requires repetitive coding. This script automates common feature creation tasks.
from feature_engine.creation import MathFeatures
from feature_engine.discretisation import EqualFrequencyDiscretiser
from feature_engine.encoding import RareLabelEncoder, OneHotEncoder
from feature_engine.selection import DropConstantFeatures, DropDuplicateFeatures
import pandas as pb
def automated_feature_engineering(df, target_column=None):
"""Automates feature engineering with transformations and encoding."""
X = df.drop(columns=[target_column]) if target_column else df.copy()
numeric_vars = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_vars = X.select_dtypes(include=['object']).columns.tolist()
X = DropConstantFeatures().fit_transform(X)
X = DropDuplicateFeatures().fit_transform(X)
if categorical_vars:
X = RareLabelEncoder(tol=0.01, n_categories=10).fit_transform(X)
if len(numeric_vars) >= 2:
X = MathFeatures(variables=numeric_vars[:5], func=['sum', 'mean']).fit_transform(X)
skewed_vars = [var for var in numeric_vars if X[var].min() > 0 and abs(X[var].skew()) > 1]
if skewed_vars:
X = LogTransformer(variables=skewed_vars).fit_transform(X)
if numeric_vars:
X = EqualFrequencyDiscretiser(variables=numeric_vars[:3], q=5,
return_object=True).fit_transform(X)
updated_categorical_vars = X.select_dtypes(include=['object']).columns.tolist()
if updated_categorical_vars:
X = OneHotEncoder(variables=updated_categorical_vars, drop_last=True).fit_transform(X)
summary = {'original_features': len(df.columns) - (1 if target_column else 0), 'final_features':
len(X.columns)}
return X, summary
# Example usage
df = pd.read_csv('data.csv')
X_engineered, summary = automated_feature_engineering(df, 'target')
print(f"Created {summary['final_features'] - summary['original_features']} new features")
Use Case: An e-commerce platform enhances its recommendation system with automated feature creation from user data.
Benefits:
Manual hyperparameter tuning is inefficient. This script automates the process using Optuna.
import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import pandas as pd
def automated_hyperparameter_tuning(X, y, model_type='random_forest', n_trials=50):
"""Optimizes model hyperparameters automatically."""
def objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 50, 300),
'max_depth': trial.suggest_int('max_depth', 3, 20),
'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
'max_features': trial.suggest_categorical('max_features', ['auto', 'sqrt', 'log2'])}
model = RandomForestClassifier(**params, random_state=42)
return cross_val_score(model, X, y, cv=5).mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=n_trials)
best_model = RandomForestClassifier(**study.best_params, random_state=42)
best_model.fit(X, y)
return {'best_model': best_model, 'best_params': study.best_params, 'best_score':
study.best_value}
# Example usage
df = pd.read_csv('data.csv')
X, y = df.drop('target', axis=1), df['target']
results = automated_hyperparameter_tuning(X, y)
print(f"Best score: {results['best_score']:.4f}")
Use Case: A data science team optimizes models for various client projects efficiently.
Benefits:
Manual model evaluation is time-consuming. This script automates comprehensive evaluation reports.
import yellowbrick.classifier import ClassificationReport, ROCAUC, ConfusionMatrix
from sklearn.model_selection import train_test_split
from matplotlib.pyplot as plt
def automated_model_evaluation(model, X, y, model_name="Model"):
"""Generates automated model evaluation reports."""
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
fig, axes = plt.subplots(2, 2, figsize=(12, 8)){
fig.suptitle(f'{model_name} Evaluation Report')
ClassificationReport(model, ax=axes[0,0], support=True).fit(X_train, y_train).score(X_test,
y_test).finalize()
ROCAUC(model, ax=axes[0,1]).fit(X_train, y_train).score(X_test, y_test).finalize()
ConfusionMatrix(model, ax=axes[1,0]).fit(X_train, y_train).score(X_test, y_test).finalize()
plt.savefig(f'{model_name.lower().replace(" ", "_")}_report.png', dpi=300)
plt.close()
return {'train_accuracy': model.fit(X_train, y_train).score(X_train, y_train), 'test_accuracy':
model.score(X_test, y_test)}
# Example usage
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=42)
df = pd.read_csv('data.csv')
X, y = df.drop('target', axis=1), df['target']
summary = automated_model_evaluation(model, X, y, "Random Forest")
print(f"Test accuracy: {summary['test_accuracy']:.3f}")
Use Case: A consulting firm needs professional model performance reports for clients.
Benefits:
Data versioning ensures reproducibility but is complex. This script automates dataset tracking.
import dvc.api
import pandas as pd
import os
import hashlib
import json
import datetime import datetime
class AutomatedDataVersioning:
def __init__(self, project_path="."):
self.project_path = project_path
self.data_dir = os.path.join(project_path, "data")
os.makedirs(self.data_dir, exist_ok=True)
def add_dataset_version(self, dataframe, dataset_name, description=""):
"""Adds and tracks a new dataset version."""
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
data_hash = hashlib.md5(dataframe.to_string().encode()).hexdigest()[:8]
filename = f"{dataset_name}_{timestamp}_{data_hash}.csv"
filepath = os.path.join(self.data_dir, filename)
dataframe.to_csv(filepath, index=False)
metadata = {
'dataset_name': dataset_name,
'timestamp': timestamp,)
'description': description,
'shape': dataframe.shape,
'data_hash': data_hash
with open(filepath.replace('.csv', '_metadata.json'), 'w') as f:
json.dump(metadata, f)
os.system(f"cd {self.project_path} && dvc add {filepath}")
return filepath, metadata
# Example usage
df = pd.read_csv('data.csv')
versioning = AutomatedDataVersioning()
filepath, metadata = versioning.add_dataset_version(df, "customer_data", "Initial dataset")
print(f"Dataset saved: {filepath}")
Use Case: A team tracks versions of training data for a churn prediction model.
Benefits:
Scheduling repetitive tasks manually is inefficient. This script automates task scheduling and monitoring.
import apscheduler.schedulers.background import BackgroundScheduler
import apscheduler.triggers.cron import CronTrigger
import pandas as pd
import logging
import datetime import datetime
class AutomatedDataPipeline:
def __init__(self):
self.scheduler = BackgroundScheduler()
logging.basicConfig(level=logging.INFO, filename='pipeline.log')
self.logger = logging.getLogger(__name__)
def data_collection_job(self):
"""Automates data collection and logging."""
self.logger.info("Starting data collection")
data = pd.DataFrame({'timestamp': [datetime.now()], 'records': [1000]})
data.to_csv(f"data_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv", index=False)
self.logger.info("Data collection completed")
def setup_schedules(self):
"""Sets up automated schedules."""
self.scheduler.add_job(self.data_collection_job, CronTrigger(hour=2, minute=0),
id='data_collection')
self.scheduler.start()
# Example usage
pipeline = AutomatedDataPipeline()
pipeline.setup_schedules()
Use Case: An e-commerce company schedules daily data updates and model retraining.
Benefits:
The true power of Python automation scripts lies in combining them into end-to-end data science automation pipelines. For example:
This integrated approach creates robust data science automation workflows.
To implement these Python automation scripts:
These 10 Python automation scripts transform data science automation by streamlining repetitive tasks like data cleaning, EDA, visualization, and model training. By leveraging libraries like Pandas, ydata-profiling, and Optuna, data scientists can automate data workflows, saving time and reducing errors. Start implementing these scripts to boost productivity and focus on strategic insights, making your data science automation efforts more efficient and impactful.
This website uses cookies to enhance website functionalities and improve your online experience. By browsing this website, you agree to the use of cookies as outlined in our privacy policy.