Data Preparation and Feature Engineering

Data Preparation Overview

Raw data contains errors. It has missing values. It has outliers. Features have different scales. You must prepare data before training models. Data preparation transforms raw data into clean features. Clean features improve model performance. Poor preparation produces poor models.

Data preparation includes collection, cleaning, transformation, and validation. You collect data from sources. You clean errors and inconsistencies. You transform features into usable formats. You validate data quality. Each step affects final model performance.

Figure: Data Preparation Workflow

The workflow starts with raw data. You identify issues. You handle missing values. You remove or transform outliers. You normalize features. You validate results. The output is clean data ready for training.

Data Collection

You collect data from multiple sources. Databases store historical records. APIs provide real-time data. Files contain structured or unstructured data. Sensors capture measurements. Each source has different formats and quality levels.

Assess data quality early. Check completeness. Check accuracy. Check relevance. Identify missing values. Identify duplicates. Identify inconsistencies. Document data sources and collection methods. Track data lineage for reproducibility.

# Data Collection Example
import pandas as pd
import requests

# Collect from CSV file
df_csv = pd.read_csv('customer_data.csv')

# Collect from API
response = requests.get('https://api.example.com/data')
df_api = pd.DataFrame(response.json())

# Collect from database
import sqlite3
conn = sqlite3.connect('database.db')
df_db = pd.read_sql_query('SELECT * FROM customers', conn)

# Combine sources
df_combined = pd.concat([df_csv, df_api, df_db], ignore_index=True)
print("Total records: " + str(len(df_combined)))
# Result: Total records: 1500

-- NeuronDB: Data Collection
CREATE TABLE raw_customer_data AS
SELECT * FROM (
    SELECT id, name, email, age, income 
    FROM csv_import('customers.csv')
    UNION ALL
    SELECT id, name, email, age, income
    FROM api_import('https://api.example.com/customers')
    UNION ALL
    SELECT id, name, email, age, income
    FROM external_db.customers
) AS combined_data;

SELECT COUNT(*) AS total_records FROM raw_customer_data;
-- Result:
--  total_records
-- ---------------
--           1500
-- (1 row)

Data collection requires planning. Define what data you need. Identify available sources. Assess access requirements. Plan collection schedules. Handle rate limits for APIs. Manage storage for large datasets. Ensure data privacy compliance.

Handling Missing Values

Missing values appear as null, NaN, or empty fields. They occur from collection errors, optional fields, or data corruption. Missing values break many algorithms. You must handle them before training.

Three main approaches exist. Removal deletes rows or columns with missing values. Imputation fills missing values with estimates. Prediction uses models to predict missing values. Choose based on missing data amount and pattern.

Figure: Missing Values Handling

The diagram shows different strategies. Complete case analysis removes all rows with any missing value. Mean imputation fills numeric missing values with column means. Mode imputation fills categorical missing values with most common values. Model-based imputation predicts missing values using other features.

# Missing Values Handling
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer

# Sample data with missing values
data = pd.DataFrame({
    'age': [25, 30, np.nan, 35, 40, np.nan],
    'income': [50000, np.nan, 75000, 80000, np.nan, 90000],
    'city': ['NYC', 'SF', 'NYC', np.nan, 'LA', 'SF']
})

# Method 1: Remove rows with any missing value
df_removed = data.dropna()
print("After removal: " + str(len(df_removed)) + " rows")

# Method 2: Mean imputation for numeric columns
numeric_imputer = SimpleImputer(strategy='mean')
data_numeric = data[['age', 'income']]
data[['age', 'income']] = numeric_imputer.fit_transform(data_numeric)

# Method 3: Mode imputation for categorical columns
categorical_imputer = SimpleImputer(strategy='most_frequent')
data_categorical = data[['city']]
data[['city']] = categorical_imputer.fit_transform(data_categorical)

# Method 4: KNN imputation
knn_imputer = KNNImputer(n_neighbors=2)
data_imputed = pd.DataFrame(
    knn_imputer.fit_transform(data[['age', 'income']]),
    columns=['age', 'income']
)
print(data_imputed)
# Result:
# After removal: 2 rows
#    age   income
# 0  25.0  50000.0
# 1  30.0  75000.0
# 2  32.5  75000.0
# 3  35.0  80000.0
# 4  40.0  81666.7
# 5  32.5  90000.0

-- NeuronDB: Missing Values Handling
CREATE TABLE customer_data (
    id SERIAL PRIMARY KEY,
    age INTEGER,
    income NUMERIC,
    city VARCHAR(50)
);

INSERT INTO customer_data (age, income, city) VALUES
    (25, 50000, 'NYC'),
    (30, NULL, 'SF'),
    (NULL, 75000, 'NYC'),
    (35, 80000, NULL),
    (40, NULL, 'LA'),
    (NULL, 90000, 'SF');

-- Method 1: Remove rows with missing values
CREATE TABLE clean_data_removed AS
SELECT * FROM customer_data
WHERE age IS NOT NULL AND income IS NOT NULL AND city IS NOT NULL;

-- Method 2: Mean imputation
UPDATE customer_data
SET age = COALESCE(age, (SELECT AVG(age) FROM customer_data WHERE age IS NOT NULL))
WHERE age IS NULL;

UPDATE customer_data
SET income = COALESCE(income, (SELECT AVG(income) FROM customer_data WHERE income IS NOT NULL))
WHERE income IS NULL;

-- Method 3: Mode imputation for categorical
UPDATE customer_data
SET city = COALESCE(city, (
    SELECT city FROM customer_data 
    WHERE city IS NOT NULL 
    GROUP BY city 
    ORDER BY COUNT(*) DESC 
    LIMIT 1
))
WHERE city IS NULL;

SELECT * FROM customer_data;
-- Result:
--  id | age | income | city
-- ----+-----+--------+------
--   1 |  25 |  50000 | NYC
--   2 |  30 |  75000 | SF
--   3 |  32 |  75000 | NYC
--   4 |  35 |  80000 | NYC
--   5 |  40 |  75000 | LA
--   6 |  32 |  90000 | SF
-- (6 rows)

Missing value patterns matter. Missing completely at random means no pattern exists. Missing at random means pattern depends on observed data. Missing not at random means pattern depends on missing values themselves. Understanding patterns guides handling strategy.

Detailed Missing Value Analysis

Analyze missing value patterns before choosing a strategy. Check missing percentages per column. Identify correlations between missing values. Test if missingness depends on other features. Visualize missing patterns using heatmaps.

Missing completely at random occurs when probability of missing is independent of observed and unobserved data. Example: random data corruption. You can safely use deletion or simple imputation. Missing at random occurs when probability of missing depends only on observed data. Example: income missing more often for young people. You can use model-based imputation. Missing not at random occurs when probability of missing depends on missing values themselves. Example: high-income people less likely to report income. This requires specialized handling.

# Detailed Missing Value Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.DataFrame({
    'age': [25, 30, np.nan, 35, 40, np.nan, 28, 32],
    'income': [50000, np.nan, 75000, 80000, np.nan, 90000, 55000, np.nan],
    'education': ['Bachelor', 'Master', 'Bachelor', np.nan, 'PhD', 'Master', np.nan, 'Bachelor'],
    'employed': [True, True, False, True, True, True, False, True]
})

# Calculate missing percentages
missing_pct = data.isnull().sum() / len(data) * 100
print("Missing percentages:")
print(missing_pct)

# Check if missing values correlate
missing_corr = data.isnull().corr()
print("Missing value correlations:")
print(missing_corr)

# Test if missingness depends on other features
# Check if income missing more for unemployed
income_missing_by_employment = data.groupby('employed')['income'].apply(lambda x: x.isnull().sum())
print("Income missing by employment:")
print(income_missing_by_employment)

# Visualize missing patterns
plt.figure(figsize=(10, 6))
sns.heatmap(data.isnull(), cbar=True, yticklabels=False, cmap='viridis')
plt.title('Missing Value Patterns')
plt.show()

Advanced Imputation Techniques

Advanced imputation uses machine learning to predict missing values. Iterative imputation uses multiple models. Each feature with missing values becomes a target. Other features become inputs. Models predict missing values iteratively.

Multiple imputation creates several complete datasets. Each dataset has different imputed values. You train models on each dataset. You combine results to account for imputation uncertainty. This provides better uncertainty estimates.

# Advanced Imputation Techniques
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
import numpy as np

# Iterative imputation
iterative_imputer = IterativeImputer(
    estimator=RandomForestRegressor(n_estimators=10, random_state=42),
    max_iter=10,
    random_state=42
)

data_numeric = data[['age', 'income']]
data_imputed = iterative_imputer.fit_transform(data_numeric)
print("Iteratively imputed data:")
print(data_imputed)

# Multiple imputation simulation
n_imputations = 5
imputed_datasets = []
for i in range(n_imputations):
    imputer = IterativeImputer(random_state=i)
    imputed = imputer.fit_transform(data_numeric)
    imputed_datasets.append(imputed)

# Average across imputations
final_imputed = np.mean(imputed_datasets, axis=0)
print("Multiple imputation average:")
print(final_imputed)

Missing Value Handling Best Practices

Choose strategy based on missing percentage. If less than 5% missing, deletion is acceptable. If 5-20% missing, use imputation. If more than 20% missing, consider if feature is necessary. Very high missing rates may indicate data quality issues.

For numeric features, use mean or median for symmetric distributions. Use median for skewed distributions. Use KNN or iterative imputation for complex patterns. For categorical features, use mode for low cardinality. Use separate category for high cardinality. Consider if missingness itself is informative.

Always validate imputation quality. Compare distributions before and after. Check if imputed values are reasonable. Test model performance with imputed data. Document imputation methods for reproducibility.

Outlier Detection and Treatment

Outliers are values far from the majority. They occur from measurement errors, data entry mistakes, or rare events. Outliers distort statistics and model training. You must identify and handle them appropriately.

Detection methods include statistical tests, distance measures, and visualization. Z-scores flag values beyond thresholds. Interquartile range identifies values outside quartile bounds. Isolation forests detect anomalies automatically. Visualization shows outliers in scatter plots.

Detailed Outlier Detection Methods

Z-score method calculates standardized scores. Z = (x - μ) / σ. Values with |Z| > 3 are outliers. This assumes normal distribution. It works well for symmetric data. It fails for skewed distributions.

Modified Z-score uses median and median absolute deviation. It is more robust to outliers. MAD = median(|x - median(x)|). Modified Z = 0.6745 × (x - median) / MAD. Values with |modified Z| > 3.5 are outliers.

Interquartile range method uses quartiles. Q1 is 25th percentile. Q3 is 75th percentile. IQR = Q3 - Q1. Lower bound = Q1 - 1.5×IQR. Upper bound = Q3 + 1.5×IQR. Values outside bounds are outliers. This method is distribution-free. It works for any distribution shape.

Isolation forest uses tree-based anomaly detection. It isolates outliers using random splits. Outliers require fewer splits to isolate. It works for high-dimensional data. It handles multiple outliers well.

# Detailed Outlier Detection
from sklearn.ensemble import IsolationForest
from scipy import stats
import numpy as np

data = np.array([10, 12, 11, 13, 15, 14, 100, 16, 12, 11, 200, 13, 14, 15, 12])

# Method 1: Standard Z-score
z_scores = np.abs(stats.zscore(data))
outliers_z = np.where(z_scores > 3)[0]
print("Z-score outliers (indices): " + str(outliers_z))

# Method 2: Modified Z-score
median = np.median(data)
mad = np.median(np.abs(data - median))
modified_z = 0.6745 * (data - median) / mad
outliers_modz = np.where(np.abs(modified_z) > 3.5)[0]
print("Modified Z-score outliers: " + str(outliers_modz))

# Method 3: IQR method
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
outliers_iqr = np.where((data < lower) | (data > upper))[0]
print("IQR outliers: " + str(outliers_iqr))

# Method 4: Isolation Forest
iso_forest = IsolationForest(contamination=0.1, random_state=42)
outlier_labels = iso_forest.fit_predict(data.reshape(-1, 1))
outliers_iso = np.where(outlier_labels == -1)[0]
print("Isolation Forest outliers: " + str(outliers_iso))

Multivariate Outlier Detection

Multivariate outliers are unusual combinations of features. They may not be outliers in individual dimensions. Methods include Mahalanobis distance, local outlier factor, and DBSCAN clustering.

Mahalanobis distance measures distance from distribution center. It accounts for feature correlations. D = √((x - μ)ᵀ Σ⁻¹ (x - μ)). Large distances indicate outliers. Threshold typically uses chi-square distribution.

Local outlier factor compares local density. It identifies points with lower density than neighbors. LOF > 1 indicates outlier. Higher values mean more anomalous. It works well for clusters with varying densities.

# Multivariate Outlier Detection
from sklearn.covariance import EllipticEnvelope
from sklearn.neighbors import LocalOutlierFactor
from scipy.spatial.distance import mahalanobis
import numpy as np

# Sample 2D data
data_2d = np.array([
    [1, 2], [1.1, 2.1], [1.2, 2.2],
    [10, 20], [10.1, 20.1],
    [5, 100]  # Outlier
])

# Method 1: Elliptic Envelope (Mahalanobis-based)
envelope = EllipticEnvelope(contamination=0.1, random_state=42)
outlier_labels_env = envelope.fit_predict(data_2d)
outliers_env = np.where(outlier_labels_env == -1)[0]
print("Elliptic Envelope outliers: " + str(outliers_env))

# Method 2: Local Outlier Factor
lof = LocalOutlierFactor(n_neighbors=3, contamination=0.1)
outlier_labels_lof = lof.fit_predict(data_2d)
lof_scores = lof.negative_outlier_factor_
outliers_lof = np.where(outlier_labels_lof == -1)[0]
print("LOF outliers: " + str(outliers_lof))
print("LOF scores: " + str(lof_scores))

Outlier Treatment Strategies

Treatment depends on outlier cause and impact. Legitimate outliers represent rare events. They should be kept but handled carefully. Erroneous outliers should be removed or corrected.

Removal deletes outlier records. Use when outliers are errors. Use when outliers are few. Use when removal doesn't affect sample size significantly. Capping limits extreme values. Set values beyond thresholds to threshold values. Use when outliers are legitimate but extreme. Use when you want to preserve sample size.

Transformation reduces outlier impact. Log transformation compresses large values. Square root transformation moderates extremes. Box-Cox transformation normalizes distributions. Use when outliers are legitimate. Use when you want to preserve all data.

Separate modeling treats outliers differently. Build models for normal and outlier cases. Use when outliers represent different populations. Use when outliers have different patterns.

Figure: Outlier Detection

The diagram shows outlier detection methods. Z-score method marks values beyond 3 standard deviations. IQR method marks values below Q1-1.5×IQR or above Q3+1.5×IQR. Isolation forest separates outliers using tree structures. Each method has different sensitivity and assumptions.

# Outlier Detection and Treatment
import pandas as pd
import numpy as np
from scipy import stats

# Sample data with outliers
data = pd.DataFrame({
    'value': [10, 12, 11, 13, 15, 14, 100, 16, 12, 11, 200, 13]
})

# Method 1: Z-score detection
z_scores = np.abs(stats.zscore(data['value']))
outliers_z = data[z_scores > 3]
print("Z-score outliers: " + str(len(outliers_z)))

# Method 2: IQR method
Q1 = data['value'].quantile(0.25)
Q3 = data['value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_iqr = data[(data['value'] < lower_bound) | (data['value'] > upper_bound)]
print("IQR outliers: " + str(len(outliers_iqr)))

# Method 3: Remove outliers
data_cleaned = data[(data['value'] >= lower_bound) & (data['value'] <= upper_bound)]

# Method 4: Cap outliers
data_capped = data.copy()
data_capped.loc[data_capped['value'] < lower_bound, 'value'] = lower_bound
data_capped.loc[data_capped['value'] > upper_bound, 'value'] = upper_bound

print("Original mean: " + str(data['value'].mean()))
print("Cleaned mean: " + str(data_cleaned['value'].mean()))
# Result:
# Z-score outliers: 2
# IQR outliers: 2
# Original mean: 35.25
# Cleaned mean: 12.7

-- NeuronDB: Outlier Detection
CREATE TABLE measurements (
    id SERIAL PRIMARY KEY,
    value NUMERIC
);

INSERT INTO measurements (value) VALUES
    (10), (12), (11), (13), (15), (14), (100), (16), (12), (11), (200), (13);

-- Z-score detection
WITH stats AS (
    SELECT 
        AVG(value) AS mean_val,
        STDDEV(value) AS std_val
    FROM measurements
)
SELECT m.id, m.value
FROM measurements m, stats s
WHERE ABS(m.value - s.mean_val) / NULLIF(s.std_val, 0) > 3;

-- IQR method
WITH quartiles AS (
    SELECT 
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) AS q1,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) AS q3
    FROM measurements
),
bounds AS (
    SELECT 
        q1 - 1.5 * (q3 - q1) AS lower_bound,
        q3 + 1.5 * (q3 - q1) AS upper_bound
    FROM quartiles
)
SELECT m.id, m.value
FROM measurements m, bounds b
WHERE m.value < b.lower_bound OR m.value > b.upper_bound;

-- Remove outliers
CREATE TABLE cleaned_measurements AS
WITH quartiles AS (
    SELECT 
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) AS q1,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) AS q3
    FROM measurements
),
bounds AS (
    SELECT 
        q1 - 1.5 * (q3 - q1) AS lower_bound,
        q3 + 1.5 * (q3 - q1) AS upper_bound
    FROM quartiles
)
SELECT m.*
FROM measurements m, bounds b
WHERE m.value >= b.lower_bound AND m.value <= b.upper_bound;

Outlier treatment depends on context. Removal works when outliers are errors. Capping limits extreme values to bounds. Transformation reduces outlier impact. Separate modeling handles legitimate rare cases. Domain knowledge guides appropriate treatment.

Normalization and Standardization

Features have different scales. Age ranges from 0 to 100. Income ranges from 0 to 1,000,000. Distance algorithms treat larger numbers as more important. Normalization and standardization make features comparable.

Normalization scales values to 0-1 range. Formula is (x - min) / (max - min). Standardization centers values around zero with unit variance. Formula is (x - mean) / std. Choose based on algorithm requirements.

Figure: Feature Scaling

The diagram shows scaling transformations. Original data has different ranges. Normalization maps all values to 0-1. Standardization centers at zero with unit spread. Both methods preserve relationships while making features comparable.

# Normalization and Standardization
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Sample data with different scales
data = pd.DataFrame({
    'age': [25, 30, 35, 40, 45],
    'income': [50000, 75000, 100000, 125000, 150000],
    'score': [0.5, 0.7, 0.8, 0.9, 1.0]
})

# Method 1: Min-Max Normalization (0-1 range)
scaler_minmax = MinMaxScaler()
data_normalized = pd.DataFrame(
    scaler_minmax.fit_transform(data),
    columns=data.columns
)
print("Normalized data:")
print(data_normalized)

# Method 2: Standardization (zero mean, unit variance)
scaler_std = StandardScaler()
data_standardized = pd.DataFrame(
    scaler_std.fit_transform(data),
    columns=data.columns
)
print("Standardized data:")
print(data_standardized.head())

# Result:
# Normalized data:
#    age  income  score
# 0  0.0     0.0    0.0
# 1  0.25    0.25   0.4
# 2  0.5     0.5    0.6
# 3  0.75    0.75   0.8
# 4  1.0     1.0    1.0

-- NeuronDB: Feature Scaling
CREATE TABLE customer_features (
    id SERIAL PRIMARY KEY,
    age INTEGER,
    income NUMERIC,
    score NUMERIC
);

INSERT INTO customer_features (age, income, score) VALUES
    (25, 50000, 0.5),
    (30, 75000, 0.7),
    (35, 100000, 0.8),
    (40, 125000, 0.9),
    (45, 150000, 1.0);

-- Min-Max Normalization
SELECT 
    id,
    (age - MIN(age) OVER ())::NUMERIC / NULLIF(MAX(age) OVER () - MIN(age) OVER (), 0) AS age_normalized,
    (income - MIN(income) OVER ())::NUMERIC / NULLIF(MAX(income) OVER () - MIN(income) OVER (), 0) AS income_normalized,
    (score - MIN(score) OVER ())::NUMERIC / NULLIF(MAX(score) OVER () - MIN(score) OVER (), 0) AS score_normalized
FROM customer_features;

-- Standardization
SELECT 
    id,
    (age - AVG(age) OVER ())::NUMERIC / NULLIF(STDDEV(age) OVER (), 0) AS age_standardized,
    (income - AVG(income) OVER ())::NUMERIC / NULLIF(STDDEV(income) OVER (), 0) AS income_standardized,
    (score - AVG(score) OVER ())::NUMERIC / NULLIF(STDDEV(score) OVER (), 0) AS score_standardized
FROM customer_features;

Scaling requirements vary by algorithm. Distance-based algorithms need scaling. Neural networks require normalization. Tree-based algorithms are scale-invariant. Linear models benefit from standardization. Check algorithm documentation for requirements.

Feature Engineering

Feature engineering creates new features from existing data. It transforms raw inputs into useful representations. Good features improve model performance more than algorithm selection. Domain knowledge guides feature creation.

Common techniques include interaction features, polynomial features, time-based features, and text features. Interaction features combine multiple inputs. Polynomial features capture non-linear relationships. Time features extract temporal patterns. Text features convert words to numbers.

Figure: Feature Engineering

The diagram shows feature engineering transformations. Raw features include price and area. Interaction feature multiplies price and area. Polynomial feature squares area. Time feature extracts month from date. Each transformation captures different patterns.

# Feature Engineering
import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

# Sample data
data = pd.DataFrame({
    'price': [100, 200, 300, 400],
    'area': [500, 1000, 1500, 2000],
    'date': pd.to_datetime(['2024-01-15', '2024-02-20', '2024-03-10', '2024-04-05'])
})

# Interaction features
data['price_area'] = data['price'] * data['area']
data['price_per_area'] = data['price'] / data['area']

# Polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(data[['price', 'area']])
data_poly = pd.DataFrame(
    poly_features,
    columns=['price', 'area', 'price^2', 'price×area', 'area^2']
)

# Time-based features
data['year'] = data['date'].dt.year
data['month'] = data['date'].dt.month
data['day_of_week'] = data['date'].dt.dayofweek
data['is_weekend'] = data['day_of_week'].isin([5, 6]).astype(int)

print("Engineered features:")
print(data[['price', 'area', 'price_area', 'price_per_area', 'month', 'is_weekend']])
# Result:
#    price  area  price_area  price_per_area  month  is_weekend
# 0    100   500       50000            0.20      1           0
# 1    200  1000      200000           0.20      2           1
# 2    300  1500      450000           0.20      3           0
# 3    400  2000      800000           0.20      4           0

-- NeuronDB: Feature Engineering
CREATE TABLE property_data (
    id SERIAL PRIMARY KEY,
    price NUMERIC,
    area NUMERIC,
    sale_date DATE
);

INSERT INTO property_data (price, area, sale_date) VALUES
    (100000, 500, '2024-01-15'),
    (200000, 1000, '2024-02-20'),
    (300000, 1500, '2024-03-10'),
    (400000, 2000, '2024-04-05');

-- Interaction features
SELECT 
    id,
    price,
    area,
    price * area AS price_area,
    price / NULLIF(area, 0) AS price_per_area
FROM property_data;

-- Polynomial features
SELECT 
    id,
    price,
    area,
    price * price AS price_squared,
    area * area AS area_squared,
    price * area AS price_area_interaction
FROM property_data;

-- Time-based features
SELECT 
    id,
    sale_date,
    EXTRACT(YEAR FROM sale_date) AS year,
    EXTRACT(MONTH FROM sale_date) AS month,
    EXTRACT(DOW FROM sale_date) AS day_of_week,
    CASE WHEN EXTRACT(DOW FROM sale_date) IN (0, 6) THEN 1 ELSE 0 END AS is_weekend
FROM property_data;

Feature engineering requires iteration. Start with domain knowledge. Create candidate features. Test feature importance. Remove redundant features. Monitor model performance. Automated feature engineering tools exist but manual engineering often performs better.

Feature Selection

Feature selection reduces dimensionality. It removes irrelevant or redundant features. Fewer features mean faster training and less overfitting. Selection methods include filter, wrapper, and embedded approaches.

Filter methods use statistical tests. They rank features independently. Wrapper methods use model performance. They search feature subsets. Embedded methods use model internals. They select during training.

Figure: Feature Selection

The diagram shows selection methods. Filter method scores each feature independently. Wrapper method tests feature subsets with models. Embedded method uses model weights or importance. Each method has different computational cost and effectiveness.

# Feature Selection
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Sample data
X = pd.DataFrame({
    'feature1': np.random.randn(100),
    'feature2': np.random.randn(100),
    'feature3': np.random.randn(100),
    'feature4': np.random.randn(100),
    'feature5': np.random.randn(100)
})
y = (X['feature1'] + X['feature2'] > 0).astype(int)

# Method 1: Filter method (statistical test)
selector_filter = SelectKBest(score_func=f_classif, k=3)
X_selected_filter = selector_filter.fit_transform(X, y)
selected_features_filter = X.columns[selector_filter.get_support()]
print("Filter selected: " + str(list(selected_features_filter)))

# Method 2: Wrapper method (RFE)
estimator = LogisticRegression()
selector_rfe = RFE(estimator, n_features_to_select=3)
X_selected_rfe = selector_rfe.fit_transform(X, y)
selected_features_rfe = X.columns[selector_rfe.get_support()]
print("RFE selected: " + str(list(selected_features_rfe)))

# Method 3: Embedded method (feature importance)
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X, y)
feature_importance = pd.Series(rf.feature_importances_, index=X.columns)
top_features = feature_importance.nlargest(3).index
print("Embedded selected: " + str(list(top_features)))
# Result:
# Filter selected: ['feature1', 'feature2', 'feature3']
# RFE selected: ['feature1', 'feature2', 'feature4']
# Embedded selected: ['feature1', 'feature2', 'feature3']

-- NeuronDB: Feature Selection
CREATE TABLE training_data (
    id SERIAL PRIMARY KEY,
    feature1 NUMERIC,
    feature2 NUMERIC,
    feature3 NUMERIC,
    feature4 NUMERIC,
    feature5 NUMERIC,
    target INTEGER
);

-- Statistical correlation for filter method
SELECT 
    'feature1' AS feature,
    ABS(CORR(feature1, target)) AS correlation
FROM training_data
UNION ALL
SELECT 'feature2', ABS(CORR(feature2, target)) FROM training_data
UNION ALL
SELECT 'feature3', ABS(CORR(feature3, target)) FROM training_data
UNION ALL
SELECT 'feature4', ABS(CORR(feature4, target)) FROM training_data
UNION ALL
SELECT 'feature5', ABS(CORR(feature5, target)) FROM training_data
ORDER BY correlation DESC
LIMIT 3;

-- Feature importance from model
CREATE TEMP TABLE model_features AS
SELECT neurondb.train(
    'default',
    'random_forest',
    'training_data',
    'target',
    ARRAY['feature1', 'feature2', 'feature3', 'feature4', 'feature5'],
    '{"n_trees": 100}'::jsonb
)::integer AS model_id;

SELECT neurondb.feature_importance(
    (SELECT model_id FROM model_features)
) ORDER BY importance DESC LIMIT 3;

Feature selection balances performance and complexity. More features can improve accuracy but increase overfitting risk. Fewer features reduce complexity but may miss important patterns. Use cross-validation to evaluate selection strategies.

Data Validation

Data validation checks data quality after preparation. It verifies completeness, correctness, and consistency. Validation catches errors before training. It ensures data meets model requirements.

Validation checks include range validation, type validation, constraint validation, and relationship validation. Range validation ensures values fall within expected bounds. Type validation ensures correct data types. Constraint validation checks business rules. Relationship validation verifies referential integrity.

Figure: Data Validation

The diagram shows validation checks. Range check flags values outside 0-100. Type check flags non-numeric values. Constraint check flags invalid combinations. Relationship check flags orphaned records. Each check prevents different error types.

# Data Validation
import pandas as pd
import numpy as np

# Sample data
data = pd.DataFrame({
    'age': [25, 30, 150, 35, -5],
    'income': [50000, 75000, 100000, 'invalid', 80000],
    'email': ['user@example.com', 'invalid-email', 'test@test.com', 'user@domain', 'valid@email.com']
})

# Range validation
def validate_range(df, column, min_val, max_val):
    invalid = df[(df[column] < min_val) | (df[column] > max_val)]
    return invalid

age_invalid = validate_range(data, 'age', 0, 120)
print("Invalid ages: " + str(len(age_invalid)))

# Type validation
def validate_type(df, column, expected_type):
    invalid = df[~df[column].apply(lambda x: isinstance(x, expected_type))]
    return invalid

income_invalid = validate_type(data, 'income', (int, float))
print("Invalid income types: " + str(len(income_invalid)))

# Pattern validation (email)
import re
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$'
email_invalid = data[~data['email'].str.match(email_pattern)]
print("Invalid emails: " + str(len(email_invalid)))

# Result:
# Invalid ages: 2
# Invalid income types: 1
# Invalid emails: 2

-- NeuronDB: Data Validation
CREATE TABLE customer_data (
    id SERIAL PRIMARY KEY,
    age INTEGER,
    income NUMERIC,
    email VARCHAR(255)
);

INSERT INTO customer_data (age, income, email) VALUES
    (25, 50000, 'user@example.com'),
    (30, 75000, 'invalid-email'),
    (150, 100000, 'test@test.com'),
    (35, NULL, 'user@domain'),
    (-5, 80000, 'valid@email.com');

-- Range validation
SELECT id, age, 'age out of range' AS validation_error
FROM customer_data
WHERE age < 0 OR age > 120;

-- Type validation
SELECT id, income, 'invalid income type' AS validation_error
FROM customer_data
WHERE income IS NULL OR income < 0;

-- Pattern validation (email)
SELECT id, email, 'invalid email format' AS validation_error
FROM customer_data
WHERE email !~ '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$';

-- Combined validation report
SELECT 
    COUNT(*) AS total_records,
    COUNT(CASE WHEN age < 0 OR age > 120 THEN 1 END) AS invalid_age_count,
    COUNT(CASE WHEN income IS NULL OR income < 0 THEN 1 END) AS invalid_income_count,
    COUNT(CASE WHEN email !~ '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' THEN 1 END) AS invalid_email_count
FROM customer_data;

Validation should be automated. Create validation rules early. Run validation after each preparation step. Document validation failures. Fix errors systematically. Re-validate after fixes. Maintain validation logs for auditing.

Complete Example: Customer Data Preparation

This example demonstrates complete data preparation workflow for customer data.

# Complete Data Preparation Example
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# Step 1: Load raw data
raw_data = pd.DataFrame({
    'customer_id': range(1, 101),
    'age': np.random.randint(18, 80, 100),
    'age': np.where(np.random.random(100) < 0.1, np.nan, np.random.randint(18, 80, 100)),
    'income': np.where(np.random.random(100) < 0.15, np.nan, np.random.randint(20000, 150000, 100)),
    'purchase_amount': np.random.randn(100) * 500 + 1000,
    'city': np.random.choice(['NYC', 'SF', 'LA', 'Chicago', None], 100, p=[0.3, 0.3, 0.2, 0.15, 0.05])
})

# Step 2: Handle missing values
numeric_imputer = SimpleImputer(strategy='median')
categorical_imputer = SimpleImputer(strategy='most_frequent')

raw_data[['age', 'income']] = numeric_imputer.fit_transform(raw_data[['age', 'income']])
raw_data['city'] = categorical_imputer.fit_transform(raw_data[['city']]).ravel()

# Step 3: Remove outliers (IQR method)
def remove_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return df[(df[column] >= lower) & (df[column] <= upper)]

clean_data = remove_outliers_iqr(raw_data, 'purchase_amount')

# Step 4: Feature engineering
clean_data['age_group'] = pd.cut(clean_data['age'], bins=[0, 30, 50, 100], labels=['Young', 'Middle', 'Senior'])
clean_data['income_per_age'] = clean_data['income'] / clean_data['age']
clean_data['high_income'] = (clean_data['income'] > clean_data['income'].median()).astype(int)

# Step 5: Encode categorical variables
clean_data = pd.get_dummies(clean_data, columns=['city', 'age_group'], prefix=['city', 'age'])

# Step 6: Normalize numeric features
scaler = StandardScaler()
numeric_cols = ['age', 'income', 'purchase_amount', 'income_per_age']
clean_data[numeric_cols] = scaler.fit_transform(clean_data[numeric_cols])

# Step 7: Final validation
print("Final dataset shape: " + str(clean_data.shape))
print("Missing values: " + str(clean_data.isnull().sum().sum()))
print("Ready for training: " + str(clean_data.isnull().sum().sum() == 0))
# Result:
# Final dataset shape: (95, 12)
# Missing values: 0
# Ready for training: True

-- NeuronDB: Complete Data Preparation
CREATE TABLE raw_customers (
    customer_id SERIAL PRIMARY KEY,
    age INTEGER,
    income NUMERIC,
    purchase_amount NUMERIC,
    city VARCHAR(50)
);

-- Insert sample data with missing values and outliers
INSERT INTO raw_customers (age, income, purchase_amount, city)
SELECT 
    CASE WHEN random() < 0.1 THEN NULL ELSE 18 + floor(random() * 62)::INTEGER END,
    CASE WHEN random() < 0.15 THEN NULL ELSE 20000 + floor(random() * 130000)::NUMERIC END,
    (random() * 1000 - 500)::NUMERIC,
    CASE WHEN random() < 0.05 THEN NULL 
         ELSE (ARRAY['NYC', 'SF', 'LA', 'Chicago'])[1 + floor(random() * 4)::INTEGER] END
FROM generate_series(1, 100);

-- Step 1: Handle missing values
UPDATE raw_customers
SET age = COALESCE(age, (SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY age) FROM raw_customers WHERE age IS NOT NULL))
WHERE age IS NULL;

UPDATE raw_customers
SET income = COALESCE(income, (SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY income) FROM raw_customers WHERE income IS NOT NULL))
WHERE income IS NULL;

UPDATE raw_customers
SET city = COALESCE(city, (
    SELECT city FROM raw_customers 
    WHERE city IS NOT NULL 
    GROUP BY city 
    ORDER BY COUNT(*) DESC 
    LIMIT 1
))
WHERE city IS NULL;

-- Step 2: Remove outliers
CREATE TABLE clean_customers AS
WITH quartiles AS (
    SELECT 
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY purchase_amount) AS q1,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY purchase_amount) AS q3
    FROM raw_customers
),
bounds AS (
    SELECT q1 - 1.5 * (q3 - q1) AS lower, q3 + 1.5 * (q3 - q1) AS upper
    FROM quartiles
)
SELECT r.*
FROM raw_customers r, bounds b
WHERE r.purchase_amount >= b.lower AND r.purchase_amount <= b.upper;

-- Step 3: Feature engineering
ALTER TABLE clean_customers ADD COLUMN age_group VARCHAR(20);
ALTER TABLE clean_customers ADD COLUMN income_per_age NUMERIC;
ALTER TABLE clean_customers ADD COLUMN high_income INTEGER;

UPDATE clean_customers
SET age_group = CASE 
    WHEN age < 30 THEN 'Young'
    WHEN age < 50 THEN 'Middle'
    ELSE 'Senior'
END,
income_per_age = income / NULLIF(age, 0),
high_income = CASE WHEN income > (SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY income) FROM clean_customers) THEN 1 ELSE 0 END;

-- Step 4: One-hot encoding for categorical
CREATE TABLE prepared_customers AS
SELECT 
    customer_id,
    age,
    income,
    purchase_amount,
    income_per_age,
    high_income,
    CASE WHEN city = 'NYC' THEN 1 ELSE 0 END AS city_nyc,
    CASE WHEN city = 'SF' THEN 1 ELSE 0 END AS city_sf,
    CASE WHEN city = 'LA' THEN 1 ELSE 0 END AS city_la,
    CASE WHEN city = 'Chicago' THEN 1 ELSE 0 END AS city_chicago,
    CASE WHEN age_group = 'Young' THEN 1 ELSE 0 END AS age_young,
    CASE WHEN age_group = 'Middle' THEN 1 ELSE 0 END AS age_middle,
    CASE WHEN age_group = 'Senior' THEN 1 ELSE 0 END AS age_senior
FROM clean_customers;

-- Step 5: Normalize numeric features
CREATE TABLE final_customers AS
SELECT 
    customer_id,
    (age - AVG(age) OVER ()) / NULLIF(STDDEV(age) OVER (), 0) AS age_normalized,
    (income - AVG(income) OVER ()) / NULLIF(STDDEV(income) OVER (), 0) AS income_normalized,
    (purchase_amount - AVG(purchase_amount) OVER ()) / NULLIF(STDDEV(purchase_amount) OVER (), 0) AS purchase_normalized,
    (income_per_age - AVG(income_per_age) OVER ()) / NULLIF(STDDEV(income_per_age) OVER (), 0) AS income_per_age_normalized,
    high_income,
    city_nyc, city_sf, city_la, city_chicago,
    age_young, age_middle, age_senior
FROM prepared_customers;

SELECT COUNT(*) AS total_records, COUNT(*) FILTER (WHERE age_normalized IS NOT NULL) AS complete_records
FROM final_customers;
-- Result:
--  total_records | complete_records
-- ---------------+-----------------
--             95 |              95
-- (1 row)

Summary

Data preparation transforms raw data into clean features. You collect data from multiple sources. You handle missing values through removal, imputation, or prediction. You detect and treat outliers using statistical methods. You normalize or standardize features to comparable scales. You engineer new features from existing data. You select relevant features to reduce dimensionality. You validate data quality throughout the process. Each step improves model performance. Proper preparation requires domain knowledge and iterative refinement.