Data Science11/19/2025⏱️ 31 min read
Data Science Workflow: From Raw Data to Insights - Complete Guide
Data SciencePythonData AnalysisMachine LearningVisualizationStatisticsPandasNumPy

Data Science Workflow: From Raw Data to Insights - Complete Guide

Introduction

Data science is an interdisciplinary field that combines scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It encompasses everything from data collection and preprocessing to model building, evaluation, and deployment.

This comprehensive guide walks you through the complete data science workflow, from raw data to actionable insights. Whether you're a beginner starting your data science journey or an experienced practitioner looking to refine your workflow, this guide provides practical examples, best practices, and real-world techniques using Python and popular data science libraries.

Understanding Data Science

Data science combines multiple disciplines to extract insights from data:

Core Components:

  • Statistics: Mathematical foundations for data analysis
  • Computer Science: Programming and algorithms
  • Domain Knowledge: Understanding the business problem
  • Communication: Presenting insights effectively

Data Science Process:

  1. Problem Definition: Understand the business problem
  1. Data Collection: Gather relevant data
  1. Data Preprocessing: Clean and prepare data
  1. Exploratory Data Analysis: Understand data patterns
  1. Feature Engineering: Create meaningful features
  1. Model Building: Build predictive or descriptive models
  1. Model Evaluation: Test and validate models
  1. Model Deployment: Implement models in production
  1. Monitoring: Track model performance
  1. Iteration: Continuously improve models

Key Skills Required:

  • Programming: Python, R, SQL
  • Statistics: Statistical analysis and hypothesis testing
  • Machine Learning: Algorithms and model building
  • Data Visualization: Creating effective visualizations
  • Domain Knowledge: Understanding the business context
  • Communication: Presenting findings clearly

The Data Science Workflow

A structured workflow ensures reliable and reproducible results:

1. Problem Definition:

  • Understand the business problem
  • Define success metrics
  • Identify data requirements
  • Set project scope and timeline

2. Data Collection:

  • Identify data sources
  • Gather relevant data
  • Understand data structure
  • Document data sources

3. Data Preprocessing:

  • Handle missing values
  • Remove outliers
  • Fix inconsistencies
  • Transform data formats

4. Exploratory Data Analysis (EDA):

  • Understand data distributions
  • Identify patterns and relationships
  • Detect anomalies
  • Generate hypotheses

5. Feature Engineering:

  • Create new features
  • Transform existing features
  • Select relevant features
  • Encode categorical variables

6. Model Building:

  • Select appropriate algorithms
  • Train models
  • Tune hyperparameters
  • Compare model performance

7. Model Evaluation:

  • Test on unseen data
  • Calculate performance metrics
  • Validate model assumptions
  • Check for overfitting

8. Model Deployment:

  • Deploy to production
  • Create APIs
  • Monitor performance
  • Update models regularly

9. Monitoring and Maintenance:

  • Track model performance
  • Monitor data drift
  • Update models as needed
  • Retrain with new data

Data Collection and Acquisition

Collecting quality data is the foundation of successful data science:

1. Data Sources:

  • SQL databases, NoSQL databases
  • REST APIs, GraphQL APIs
  • CSV, JSON, Excel, Parquet
  • Extract data from websites
  • Real-time data streams
  • S3, Azure Blob, GCS

2. Data Collection with Python:

# Python example: Reading data from various sources
import pandas as pd
import requests
import json
from sqlalchemy import create_engine

# Read from CSV
csv_data = pd.read_csv('data.csv')

# Read from JSON
with open('data.json', 'r') as f:
    json_data = json.load(f)
json_df = pd.DataFrame(json_data)

# Read from Excel
excel_data = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Read from SQL database
engine = create_engine('postgresql://user:password@localhost/db')
sql_data = pd.read_sql_query('SELECT * FROM table', engine)

# Read from API
response = requests.get('https://api.example.com/data')
api_data = pd.DataFrame(response.json())

# Read from Parquet (efficient for large datasets)
parquet_data = pd.read_parquet('data.parquet')

# Read from S3
import boto3
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='my-bucket', Key='data.csv')
s3_data = pd.read_csv(obj['Body'])

# Python example: Web scraping with BeautifulSoup
from bs4 import BeautifulSoup
import requests
import pandas as pd

def scrape_website(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Extract data
    data = []
    for item in soup.find_all('div', class_='item'):
        title = item.find('h2').text
        price = item.find('span', class_='price').text
        data.append({'title': title, 'price': price})
    
    return pd.DataFrame(data)

# Use Selenium for dynamic content
from selenium import webdriver
from selenium.webdriver.common.by import By

def scrape_dynamic_website(url):
    driver = webdriver.Chrome()
    driver.get(url)
    
    # Wait for content to load
    driver.implicitly_wait(10)
    
    # Extract data
    elements = driver.find_elements(By.CLASS_NAME, 'item')
    data = [{'text': elem.text} for elem in elements]
    
    driver.quit()
    return pd.DataFrame(data)

# Assess data quality
import pandas as pd
import numpy as np

def assess_data_quality(df):
    quality_report = {
        'total_rows': len(df),
        'total_columns': len(df.columns),
        'missing_values': df.isnull().sum().to_dict(),
        'duplicate_rows': df.duplicated().sum(),
        'data_types': df.dtypes.to_dict(),
        'memory_usage': df.memory_usage(deep=True).sum()
    }
    
    # Check for outliers
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        q1 = df[col].quantile(0.25)
        q3 = df[col].quantile(0.75)
        iqr = q3 - q1
        outliers = df[(df[col] < q1 - 1.5*iqr) | (df[col] > q3 + 1.5*iqr)]
        quality_report[f'{col}_outliers'] = len(outliers)
    
    return quality_report

Data Preprocessing and Cleaning

Data preprocessing is crucial for successful analysis:

# Python example: Handling missing values
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer

# Check for missing values
def check_missing_values(df):
    missing = df.isnull().sum()
    missing_percent = (missing / len(df)) * 100
    return pd.DataFrame({
        'Missing Count': missing,
        'Missing Percentage': missing_percent
    })

# Strategy 1: Remove missing values
# Remove rows with any missing values
df_dropped = df.dropna()

# Remove rows where all values are missing
df_dropped = df.dropna(how='all')

# Remove columns with too many missing values
df_dropped = df.dropna(axis=1, thresh=len(df)*0.5)  # Remove if >50% missing

# Strategy 2: Fill missing values
# Fill with mean (for numeric columns)
df['numeric_col'].fillna(df['numeric_col'].mean(), inplace=True)

# Fill with median (more robust to outliers)
df['numeric_col'].fillna(df['numeric_col'].median(), inplace=True)

# Fill with mode (for categorical columns)
df['categorical_col'].fillna(df['categorical_col'].mode()[0], inplace=True)

# Fill with forward fill or backward fill
df['col'].fillna(method='ffill', inplace=True)  # Forward fill
df['col'].fillna(method='bfill', inplace=True)  # Backward fill

# Strategy 3: Advanced imputation
# Using SimpleImputer
imputer = SimpleImputer(strategy='mean')
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = imputer.fit_transform(df[numeric_cols])

# Using KNN Imputer
knn_imputer = KNNImputer(n_neighbors=5)
df_imputed = knn_imputer.fit_transform(df[numeric_cols])

# Python example: Detecting and handling outliers
import pandas as pd
import numpy as np
from scipy import stats

def detect_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    return outliers

def detect_outliers_zscore(df, column, threshold=3):
    z_scores = np.abs(stats.zscore(df[column]))
    outliers = df[z_scores > threshold]
    return outliers

# Remove outliers
def remove_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

# Cap outliers (winsorization)
def cap_outliers(df, column, lower_percentile=0.05, upper_percentile=0.95):
    lower_bound = df[column].quantile(lower_percentile)
    upper_bound = df[column].quantile(upper_percentile)
    
    df[column] = df[column].clip(lower=lower_bound, upper=upper_bound)
    return df

# Python example: Data transformation
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Normalization (0-1 scaling)
scaler = MinMaxScaler()
df['normalized_col'] = scaler.fit_transform(df[['col']])

# Standardization (z-score normalization)
scaler = StandardScaler()
df['standardized_col'] = scaler.fit_transform(df[['col']])

# Robust scaling (uses median and IQR)
scaler = RobustScaler()
df['robust_scaled_col'] = scaler.fit_transform(df[['col']])

# Log transformation (for skewed data)
df['log_col'] = np.log1p(df['col'])  # log1p handles zeros

# Square root transformation
df['sqrt_col'] = np.sqrt(df['col'])

# Box-Cox transformation
from scipy import stats
df['boxcox_col'], fitted_lambda = stats.boxcox(df['col'])

# Python example: Encoding categorical variables
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder

# Label Encoding (for ordinal data)
label_encoder = LabelEncoder()
df['encoded_col'] = label_encoder.fit_transform(df['categorical_col'])

# One-Hot Encoding (for nominal data)
df_encoded = pd.get_dummies(df, columns=['categorical_col'], prefix='cat')

# Using sklearn OneHotEncoder
onehot_encoder = OneHotEncoder(sparse=False, drop='first')
encoded = onehot_encoder.fit_transform(df[['categorical_col']])
encoded_df = pd.DataFrame(encoded, columns=onehot_encoder.get_feature_names_out())

# Ordinal Encoding (for ordered categories)
ordinal_encoder = OrdinalEncoder(categories=[['low', 'medium', 'high']])
df['ordinal_col'] = ordinal_encoder.fit_transform(df[['categorical_col']])

# Target Encoding (mean encoding)
def target_encode(df, categorical_col, target_col):
    target_mean = df.groupby(categorical_col)[target_col].mean()
    df[f'{categorical_col}_encoded'] = df[categorical_col].map(target_mean)
    return df

Exploratory Data Analysis (EDA)

EDA helps understand data characteristics and patterns:

# Python example: Descriptive statistics
import pandas as pd
import numpy as np

# Basic statistics
df.describe()  # Summary statistics for numeric columns

# Detailed statistics
def detailed_stats(df):
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    stats = pd.DataFrame({
        'mean': df[numeric_cols].mean(),
        'median': df[numeric_cols].median(),
        'std': df[numeric_cols].std(),
        'min': df[numeric_cols].min(),
        'max': df[numeric_cols].max(),
        'skewness': df[numeric_cols].skew(),
        'kurtosis': df[numeric_cols].kurtosis(),
        'missing': df[numeric_cols].isnull().sum()
    })
    return stats

# Categorical statistics
def categorical_stats(df, column):
    return df[column].value_counts()

# Python example: Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Set style
sns.set_style('darkgrid')
plt.style.use('seaborn-v0_8')

# Distribution plots
plt.figure(figsize=(10, 6))
sns.histplot(df['column'], kde=True)
plt.title('Distribution of Column')
plt.show()

# Box plots (for outlier detection)
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, y='column')
plt.title('Box Plot of Column')
plt.show()

# Correlation heatmap
plt.figure(figsize=(12, 8))
correlation_matrix = df.select_dtypes(include=[np.number]).corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap')
plt.show()

# Scatter plots
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='x_col', y='y_col', hue='category_col')
plt.title('Scatter Plot')
plt.show()

# Pair plots (for multiple variables)
sns.pairplot(df.select_dtypes(include=[np.number]).iloc[:, :5])
plt.show()

# Categorical plots
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='categorical_col')
plt.title('Count Plot')
plt.xticks(rotation=45)
plt.show()

# Time series plots
plt.figure(figsize=(12, 6))
plt.plot(df['date'], df['value'])
plt.title('Time Series Plot')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()

# Python example: Advanced EDA
import pandas as pd
import numpy as np
from scipy import stats

# Statistical tests
def perform_statistical_tests(df, col1, col2):
    # T-test (for comparing means)
    t_stat, p_value = stats.ttest_ind(df[col1], df[col2])
    print(f'T-test: t={t_stat:.4f}, p={p_value:.4f}')
    
    # Chi-square test (for categorical data)
    contingency_table = pd.crosstab(df[col1], df[col2])
    chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
    print(f'Chi-square: chi2={chi2:.4f}, p={p_value:.4f}')
    
    # Correlation test
    correlation, p_value = stats.pearsonr(df[col1], df[col2])
    print(f'Correlation: r={correlation:.4f}, p={p_value:.4f}')

# Feature relationships
def analyze_feature_relationships(df):
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    
    # Correlation analysis
    correlation_matrix = df[numeric_cols].corr()
    
    # Find highly correlated features
    high_corr_pairs = []
    for i in range(len(correlation_matrix.columns)):
        for j in range(i+1, len(correlation_matrix.columns)):
            if abs(correlation_matrix.iloc[i, j]) > 0.7:
                high_corr_pairs.append((
                    correlation_matrix.columns[i],
                    correlation_matrix.columns[j],
                    correlation_matrix.iloc[i, j]
                ))
    
    return high_corr_pairs

Feature Engineering

Feature engineering is crucial for model performance:

# Python example: Feature engineering
import pandas as pd
import numpy as np
from datetime import datetime

# Date features
def extract_date_features(df, date_column):
    df['year'] = pd.to_datetime(df[date_column]).dt.year
    df['month'] = pd.to_datetime(df[date_column]).dt.month
    df['day'] = pd.to_datetime(df[date_column]).dt.day
    df['day_of_week'] = pd.to_datetime(df[date_column]).dt.dayofweek
    df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
    df['quarter'] = pd.to_datetime(df[date_column]).dt.quarter
    return df

# Mathematical transformations
def create_math_features(df):
    df['feature_sum'] = df['col1'] + df['col2']
    df['feature_product'] = df['col1'] * df['col2']
    df['feature_ratio'] = df['col1'] / (df['col2'] + 1)  # +1 to avoid division by zero
    df['feature_diff'] = df['col1'] - df['col2']
    df['feature_power'] = df['col1'] ** 2
    return df

# Binning (creating categorical features from numeric)
def create_bins(df, column, bins=5):
    df[f'{column}_binned'] = pd.cut(df[column], bins=bins, labels=False)
    return df

# Aggregation features
def create_aggregation_features(df, group_col, agg_col):
    grouped = df.groupby(group_col)[agg_col].agg(['mean', 'std', 'min', 'max'])
    grouped.columns = [f'{agg_col}_{stat}' for stat in grouped.columns]
    df = df.merge(grouped, left_on=group_col, right_index=True)
    return df

# Python example: Feature selection
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, f_regression, chi2
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor

# Univariate feature selection
def univariate_feature_selection(X, y, k=10):
    selector = SelectKBest(score_func=f_regression, k=k)
    X_selected = selector.fit_transform(X, y)
    selected_features = X.columns[selector.get_support()]
    return X_selected, selected_features

# Recursive Feature Elimination
def recursive_feature_elimination(X, y, n_features=10):
    model = RandomForestRegressor()
    rfe = RFE(estimator=model, n_features_to_select=n_features)
    X_selected = rfe.fit_transform(X, y)
    selected_features = X.columns[rfe.get_support()]
    return X_selected, selected_features

# Feature importance (using tree-based models)
def feature_importance_selection(X, y, n_features=10):
    model = RandomForestRegressor()
    model.fit(X, y)
    
    importance_df = pd.DataFrame({
        'feature': X.columns,
        'importance': model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    selected_features = importance_df.head(n_features)['feature'].tolist()
    return X[selected_features], selected_features

# Correlation-based feature selection
def correlation_feature_selection(df, target_col, threshold=0.8):
    correlation = df.corr()[target_col].abs().sort_values(ascending=False)
    selected_features = correlation[correlation > threshold].index.tolist()
    selected_features.remove(target_col)
    return df[selected_features], selected_features

Model Building and Training

Building and training machine learning models:

# Python example: Data splitting
from sklearn.model_selection import train_test_split

# Basic train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train-validation-test split
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)

# Time-based split (for time series)
def time_based_split(df, date_col, train_ratio=0.7):
    df_sorted = df.sort_values(date_col)
    split_idx = int(len(df_sorted) * train_ratio)
    train = df_sorted[:split_idx]
    test = df_sorted[split_idx:]
    return train, test

# Python example: Model training
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.svm import SVR, SVC
from sklearn.neural_network import MLPRegressor
import xgboost as xgb

# Regression models
def train_regression_models(X_train, y_train):
    models = {
        'linear_regression': LinearRegression(),
        'random_forest': RandomForestRegressor(n_estimators=100, random_state=42),
        'svr': SVR(kernel='rbf'),
        'xgboost': xgb.XGBRegressor(random_state=42)
    }
    
    trained_models = {}
    for name, model in models.items():
        model.fit(X_train, y_train)
        trained_models[name] = model
    
    return trained_models

# Classification models
def train_classification_models(X_train, y_train):
    models = {
        'logistic_regression': LogisticRegression(random_state=42),
        'random_forest': RandomForestClassifier(n_estimators=100, random_state=42),
        'svc': SVC(random_state=42),
        'xgboost': xgb.XGBClassifier(random_state=42)
    }
    
    trained_models = {}
    for name, model in models.items():
        model.fit(X_train, y_train)
        trained_models[name] = model
    
    return trained_models

# Python example: Hyperparameter tuning
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor

# Grid Search
def grid_search_tuning(X_train, y_train):
    param_grid = {
        'n_estimators': [100, 200, 300],
        'max_depth': [10, 20, 30, None],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    }
    
    model = RandomForestRegressor(random_state=42)
    grid_search = GridSearchCV(
        model, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1
    )
    grid_search.fit(X_train, y_train)
    
    return grid_search.best_estimator_, grid_search.best_params_

# Randomized Search (faster for large parameter spaces)
def randomized_search_tuning(X_train, y_train):
    param_distributions = {
        'n_estimators': [100, 200, 300, 400, 500],
        'max_depth': [10, 20, 30, 40, None],
        'min_samples_split': [2, 5, 10, 15],
        'min_samples_leaf': [1, 2, 4, 8]
    }
    
    model = RandomForestRegressor(random_state=42)
    random_search = RandomizedSearchCV(
        model, param_distributions, n_iter=50, cv=5,
        scoring='neg_mean_squared_error', n_jobs=-1, random_state=42
    )
    random_search.fit(X_train, y_train)
    
    return random_search.best_estimator_, random_search.best_params_

Model Evaluation

Evaluating model performance is crucial for selecting the best model:

# Python example: Regression evaluation
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

def evaluate_regression_model(y_true, y_pred):
    metrics = {
        'MSE': mean_squared_error(y_true, y_pred),
        'RMSE': np.sqrt(mean_squared_error(y_true, y_pred)),
        'MAE': mean_absolute_error(y_true, y_pred),
        'R2': r2_score(y_true, y_pred),
        'MAPE': np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    }
    return metrics

# Python example: Classification evaluation
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score
)

def evaluate_classification_model(y_true, y_pred, y_pred_proba=None):
    metrics = {
        'accuracy': accuracy_score(y_true, y_pred),
        'precision': precision_score(y_true, y_pred, average='weighted'),
        'recall': recall_score(y_true, y_pred, average='weighted'),
        'f1_score': f1_score(y_true, y_pred, average='weighted'),
        'confusion_matrix': confusion_matrix(y_true, y_pred)
    }
    
    if y_pred_proba is not None:
        metrics['roc_auc'] = roc_auc_score(y_true, y_pred_proba)
    
    return metrics

# Classification report
print(classification_report(y_true, y_pred))

# Python example: Cross-validation
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
from sklearn.ensemble import RandomForestRegressor

# K-Fold Cross-Validation
def kfold_cross_validation(X, y, model, k=5):
    kfold = KFold(n_splits=k, shuffle=True, random_state=42)
    scores = cross_val_score(model, X, y, cv=kfold, scoring='neg_mean_squared_error')
    return -scores.mean(), scores.std()

# Stratified K-Fold (for classification)
def stratified_kfold_cross_validation(X, y, model, k=5):
    skfold = StratifiedKFold(n_splits=k, shuffle=True, random_state=42)
    scores = cross_val_score(model, X, y, cv=skfold, scoring='accuracy')
    return scores.mean(), scores.std()

# Time Series Cross-Validation
def time_series_cross_validation(X, y, model, n_splits=5):
    from sklearn.model_selection import TimeSeriesSplit
    tscv = TimeSeriesSplit(n_splits=n_splits)
    scores = cross_val_score(model, X, y, cv=tscv, scoring='neg_mean_squared_error')
    return -scores.mean(), scores.std()

Model Deployment and Production

Deploying models to production requires careful planning:

# Python example: Model serialization
import pickle
import joblib
import json

# Using pickle
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)

# Load model
with open('model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

# Using joblib (better for scikit-learn models)
joblib.dump(model, 'model.joblib')
loaded_model = joblib.load('model.joblib')

# Save preprocessing pipeline
preprocessing_pipeline = {
    'scaler': scaler,
    'encoder': encoder,
    'imputer': imputer
}
joblib.dump(preprocessing_pipeline, 'preprocessing.joblib')

# Python example: FastAPI for model serving
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
import numpy as np

app = FastAPI()

# Load model
model = joblib.load('model.joblib')
preprocessing = joblib.load('preprocessing.joblib')

# Define input schema
class PredictionRequest(BaseModel):
    feature1: float
    feature2: float
    feature3: float

@app.post('/predict')
def predict(request: PredictionRequest):
    try:
        # Preprocess input
        features = np.array([[request.feature1, request.feature2, request.feature3]])
        features_scaled = preprocessing['scaler'].transform(features)
        
        # Make prediction
        prediction = model.predict(features_scaled)[0]
        
        return {'prediction': float(prediction)}
    except Exception as e:
        raise HTTPException(status_code=400, detail=str(e))

@app.get('/health')
def health():
    return {'status': 'healthy'}

# Python example: Model monitoring
import pandas as pd
import numpy as np
from datetime import datetime

class ModelMonitor:
    def __init__(self):
        self.predictions = []
        self.actuals = []
        self.timestamps = []
    
    def log_prediction(self, prediction, actual=None):
        self.predictions.append(prediction)
        self.actuals.append(actual)
        self.timestamps.append(datetime.now())
    
    def calculate_drift(self, reference_data, current_data):
        # Calculate data drift using statistical tests
        from scipy import stats
        
        drift_scores = {}
        for col in reference_data.columns:
            stat, p_value = stats.ks_2samp(reference_data[col], current_data[col])
            drift_scores[col] = {'statistic': stat, 'p_value': p_value}
        
        return drift_scores
    
    def check_model_performance(self):
        if len(self.actuals) > 100:
            from sklearn.metrics import mean_squared_error
            mse = mean_squared_error(self.actuals[-100:], self.predictions[-100:])
            return {'mse': mse, 'status': 'healthy' if mse < threshold else 'degraded'}
        return {'status': 'insufficient_data'}

Data Science Tools and Libraries

Essential tools and libraries for data science:

1. Python Libraries:

Data Manipulation:

  • Data manipulation and analysis
  • Numerical computing
  • Fast DataFrame library

Machine Learning:

  • Machine learning algorithms
  • Gradient boosting framework
  • Fast gradient boosting
  • Categorical boosting

Deep Learning:

  • Deep learning framework
  • Deep learning framework
  • High-level neural network API

Visualization:

  • Basic plotting
  • Statistical visualization
  • Interactive visualizations
  • Interactive web visualizations

2. Jupyter Notebooks:

# Jupyter notebook best practices
# - Use markdown cells for documentation
# - Keep code cells focused and small
# - Use clear variable names
# - Document your analysis
# - Version control your notebooks
# - Use nbconvert to export to other formats

3. Data Science Workflow Tools:

  • Model lifecycle management
  • Experiment tracking
  • Data version control
  • Data science pipelines
  • Workflow orchestration

Best Practices and Tips

Best practices for successful data science projects:

project/
  data/
    raw/
    processed/
    external/
  notebooks/
    exploration/
    modeling/
  src/
    data/
    features/
    models/
  models/
  reports/
  requirements.txt
  README.md

2. Code Quality:

  • Write clean, readable code
  • Use version control (Git)
  • Document your code
  • Write unit tests
  • Follow PEP 8 (Python style guide)

3. Reproducibility:

  • Set random seeds
  • Document dependencies
  • Use virtual environments
  • Save preprocessing steps
  • Version your data

4. Communication:

  • Create clear visualizations
  • Write comprehensive reports
  • Present findings effectively
  • Document assumptions
  • Explain methodology

Conclusion

Data science is a powerful discipline that combines statistics, programming, and domain knowledge to extract insights from data. Success in data science requires a structured workflow, quality data, appropriate tools, and effective communication.

Key Takeaways:

  • Structured Workflow: Follow a systematic approach from problem definition to deployment
  • Data Quality: Invest time in data preprocessing and cleaning
  • Exploratory Analysis: Understand your data before modeling
  • Feature Engineering: Create meaningful features for better models
  • Model Evaluation: Use appropriate metrics and cross-validation
  • Deployment: Plan for production deployment from the start
  • Monitoring: Continuously monitor model performance
  • Iteration: Data science is an iterative process

Remember:

  • Start with simple models and iterate
  • Focus on understanding the problem
  • Quality data is more important than complex models
  • Communicate findings clearly
  • Keep learning and improving By following the principles and practices outlined in this guide, you can build reliable, effective data science solutions that provide real value to your organization.

Share this article

Comments