Data Preparation Overview
Raw data contains errors. It has missing values. It has outliers. Features have different scales. You must prepare data before training models. Data preparation transforms raw data into clean features. Clean features improve model performance. Poor preparation produces poor models.
Data preparation includes collection, cleaning, transformation, and validation. You collect data from sources. You clean errors and inconsistencies. You transform features into usable formats. You validate data quality. Each step affects final model performance.
The workflow starts with raw data. You identify issues. You handle missing values. You remove or transform outliers. You normalize features. You validate results. The output is clean data ready for training.
Data Collection
You collect data from multiple sources. Databases store historical records. APIs provide real-time data. Files contain structured or unstructured data. Sensors capture measurements. Each source has different formats and quality levels.
Assess data quality early. Check completeness. Check accuracy. Check relevance. Identify missing values. Identify duplicates. Identify inconsistencies. Document data sources and collection methods. Track data lineage for reproducibility.
# Data Collection Exampleimport pandas as pdimport requests# Collect from CSV filedf_csv = pd.read_csv('customer_data.csv')# Collect from APIresponse = requests.get('https://api.example.com/data')df_api = pd.DataFrame(response.json())# Collect from databaseimport sqlite3conn = sqlite3.connect('database.db')df_db = pd.read_sql_query('SELECT * FROM customers', conn)# Combine sourcesdf_combined = pd.concat([df_csv, df_api, df_db], ignore_index=True)print("Total records: " + str(len(df_combined)))# Result: Total records: 1500
-- NeuronDB: Data CollectionCREATE TABLE raw_customer_data ASSELECT * FROM (SELECT id, name, email, age, incomeFROM csv_import('customers.csv')UNION ALLSELECT id, name, email, age, incomeFROM api_import('https://api.example.com/customers')UNION ALLSELECT id, name, email, age, incomeFROM external_db.customers) AS combined_data;SELECT COUNT(*) AS total_records FROM raw_customer_data;-- Result:-- total_records-- ----------------- 1500-- (1 row)
Data collection requires planning. Define what data you need. Identify available sources. Assess access requirements. Plan collection schedules. Handle rate limits for APIs. Manage storage for large datasets. Ensure data privacy compliance.
Handling Missing Values
Missing values appear as null, NaN, or empty fields. They occur from collection errors, optional fields, or data corruption. Missing values break many algorithms. You must handle them before training.
Three main approaches exist. Removal deletes rows or columns with missing values. Imputation fills missing values with estimates. Prediction uses models to predict missing values. Choose based on missing data amount and pattern.
The diagram shows different strategies. Complete case analysis removes all rows with any missing value. Mean imputation fills numeric missing values with column means. Mode imputation fills categorical missing values with most common values. Model-based imputation predicts missing values using other features.
# Missing Values Handlingimport pandas as pdimport numpy as npfrom sklearn.impute import SimpleImputer, KNNImputer# Sample data with missing valuesdata = pd.DataFrame({'age': [25, 30, np.nan, 35, 40, np.nan],'income': [50000, np.nan, 75000, 80000, np.nan, 90000],'city': ['NYC', 'SF', 'NYC', np.nan, 'LA', 'SF']})# Method 1: Remove rows with any missing valuedf_removed = data.dropna()print("After removal: " + str(len(df_removed)) + " rows")# Method 2: Mean imputation for numeric columnsnumeric_imputer = SimpleImputer(strategy='mean')data_numeric = data[['age', 'income']]data[['age', 'income']] = numeric_imputer.fit_transform(data_numeric)# Method 3: Mode imputation for categorical columnscategorical_imputer = SimpleImputer(strategy='most_frequent')data_categorical = data[['city']]data[['city']] = categorical_imputer.fit_transform(data_categorical)# Method 4: KNN imputationknn_imputer = KNNImputer(n_neighbors=2)data_imputed = pd.DataFrame(knn_imputer.fit_transform(data[['age', 'income']]),columns=['age', 'income'])print(data_imputed)# Result:# After removal: 2 rows# age income# 0 25.0 50000.0# 1 30.0 75000.0# 2 32.5 75000.0# 3 35.0 80000.0# 4 40.0 81666.7# 5 32.5 90000.0
-- NeuronDB: Missing Values HandlingCREATE TABLE customer_data (id SERIAL PRIMARY KEY,age INTEGER,income NUMERIC,city VARCHAR(50));INSERT INTO customer_data (age, income, city) VALUES(25, 50000, 'NYC'),(30, NULL, 'SF'),(NULL, 75000, 'NYC'),(35, 80000, NULL),(40, NULL, 'LA'),(NULL, 90000, 'SF');-- Method 1: Remove rows with missing valuesCREATE TABLE clean_data_removed ASSELECT * FROM customer_dataWHERE age IS NOT NULL AND income IS NOT NULL AND city IS NOT NULL;-- Method 2: Mean imputationUPDATE customer_dataSET age = COALESCE(age, (SELECT AVG(age) FROM customer_data WHERE age IS NOT NULL))WHERE age IS NULL;UPDATE customer_dataSET income = COALESCE(income, (SELECT AVG(income) FROM customer_data WHERE income IS NOT NULL))WHERE income IS NULL;-- Method 3: Mode imputation for categoricalUPDATE customer_dataSET city = COALESCE(city, (SELECT city FROM customer_dataWHERE city IS NOT NULLGROUP BY cityORDER BY COUNT(*) DESCLIMIT 1))WHERE city IS NULL;SELECT * FROM customer_data;-- Result:-- id | age | income | city-- ----+-----+--------+-------- 1 | 25 | 50000 | NYC-- 2 | 30 | 75000 | SF-- 3 | 32 | 75000 | NYC-- 4 | 35 | 80000 | NYC-- 5 | 40 | 75000 | LA-- 6 | 32 | 90000 | SF-- (6 rows)
Missing value patterns matter. Missing completely at random means no pattern exists. Missing at random means pattern depends on observed data. Missing not at random means pattern depends on missing values themselves. Understanding patterns guides handling strategy.
Detailed Missing Value Analysis
Analyze missing value patterns before choosing a strategy. Check missing percentages per column. Identify correlations between missing values. Test if missingness depends on other features. Visualize missing patterns using heatmaps.
Missing completely at random occurs when probability of missing is independent of observed and unobserved data. Example: random data corruption. You can safely use deletion or simple imputation. Missing at random occurs when probability of missing depends only on observed data. Example: income missing more often for young people. You can use model-based imputation. Missing not at random occurs when probability of missing depends on missing values themselves. Example: high-income people less likely to report income. This requires specialized handling.
# Detailed Missing Value Analysisimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsdata = pd.DataFrame({'age': [25, 30, np.nan, 35, 40, np.nan, 28, 32],'income': [50000, np.nan, 75000, 80000, np.nan, 90000, 55000, np.nan],'education': ['Bachelor', 'Master', 'Bachelor', np.nan, 'PhD', 'Master', np.nan, 'Bachelor'],'employed': [True, True, False, True, True, True, False, True]})# Calculate missing percentagesmissing_pct = data.isnull().sum() / len(data) * 100print("Missing percentages:")print(missing_pct)# Check if missing values correlatemissing_corr = data.isnull().corr()print("Missing value correlations:")print(missing_corr)# Test if missingness depends on other features# Check if income missing more for unemployedincome_missing_by_employment = data.groupby('employed')['income'].apply(lambda x: x.isnull().sum())print("Income missing by employment:")print(income_missing_by_employment)# Visualize missing patternsplt.figure(figsize=(10, 6))sns.heatmap(data.isnull(), cbar=True, yticklabels=False, cmap='viridis')plt.title('Missing Value Patterns')plt.show()
Advanced Imputation Techniques
Advanced imputation uses machine learning to predict missing values. Iterative imputation uses multiple models. Each feature with missing values becomes a target. Other features become inputs. Models predict missing values iteratively.
Multiple imputation creates several complete datasets. Each dataset has different imputed values. You train models on each dataset. You combine results to account for imputation uncertainty. This provides better uncertainty estimates.
# Advanced Imputation Techniquesfrom sklearn.experimental import enable_iterative_imputerfrom sklearn.impute import IterativeImputerfrom sklearn.ensemble import RandomForestRegressorimport numpy as np# Iterative imputationiterative_imputer = IterativeImputer(estimator=RandomForestRegressor(n_estimators=10, random_state=42),max_iter=10,random_state=42)data_numeric = data[['age', 'income']]data_imputed = iterative_imputer.fit_transform(data_numeric)print("Iteratively imputed data:")print(data_imputed)# Multiple imputation simulationn_imputations = 5imputed_datasets = []for i in range(n_imputations):imputer = IterativeImputer(random_state=i)imputed = imputer.fit_transform(data_numeric)imputed_datasets.append(imputed)# Average across imputationsfinal_imputed = np.mean(imputed_datasets, axis=0)print("Multiple imputation average:")print(final_imputed)
Missing Value Handling Best Practices
Choose strategy based on missing percentage. If less than 5% missing, deletion is acceptable. If 5-20% missing, use imputation. If more than 20% missing, consider if feature is necessary. Very high missing rates may indicate data quality issues.
For numeric features, use mean or median for symmetric distributions. Use median for skewed distributions. Use KNN or iterative imputation for complex patterns. For categorical features, use mode for low cardinality. Use separate category for high cardinality. Consider if missingness itself is informative.
Always validate imputation quality. Compare distributions before and after. Check if imputed values are reasonable. Test model performance with imputed data. Document imputation methods for reproducibility.
Outlier Detection and Treatment
Outliers are values far from the majority. They occur from measurement errors, data entry mistakes, or rare events. Outliers distort statistics and model training. You must identify and handle them appropriately.
Detection methods include statistical tests, distance measures, and visualization. Z-scores flag values beyond thresholds. Interquartile range identifies values outside quartile bounds. Isolation forests detect anomalies automatically. Visualization shows outliers in scatter plots.
Detailed Outlier Detection Methods
Z-score method calculates standardized scores. Z = (x - μ) / σ. Values with |Z| > 3 are outliers. This assumes normal distribution. It works well for symmetric data. It fails for skewed distributions.
Modified Z-score uses median and median absolute deviation. It is more robust to outliers. MAD = median(|x - median(x)|). Modified Z = 0.6745 × (x - median) / MAD. Values with |modified Z| > 3.5 are outliers.
Interquartile range method uses quartiles. Q1 is 25th percentile. Q3 is 75th percentile. IQR = Q3 - Q1. Lower bound = Q1 - 1.5×IQR. Upper bound = Q3 + 1.5×IQR. Values outside bounds are outliers. This method is distribution-free. It works for any distribution shape.
Isolation forest uses tree-based anomaly detection. It isolates outliers using random splits. Outliers require fewer splits to isolate. It works for high-dimensional data. It handles multiple outliers well.
# Detailed Outlier Detectionfrom sklearn.ensemble import IsolationForestfrom scipy import statsimport numpy as npdata = np.array([10, 12, 11, 13, 15, 14, 100, 16, 12, 11, 200, 13, 14, 15, 12])# Method 1: Standard Z-scorez_scores = np.abs(stats.zscore(data))outliers_z = np.where(z_scores > 3)[0]print("Z-score outliers (indices): " + str(outliers_z))# Method 2: Modified Z-scoremedian = np.median(data)mad = np.median(np.abs(data - median))modified_z = 0.6745 * (data - median) / madoutliers_modz = np.where(np.abs(modified_z) > 3.5)[0]print("Modified Z-score outliers: " + str(outliers_modz))# Method 3: IQR methodq1 = np.percentile(data, 25)q3 = np.percentile(data, 75)iqr = q3 - q1lower = q1 - 1.5 * iqrupper = q3 + 1.5 * iqroutliers_iqr = np.where((data < lower) | (data > upper))[0]print("IQR outliers: " + str(outliers_iqr))# Method 4: Isolation Forestiso_forest = IsolationForest(contamination=0.1, random_state=42)outlier_labels = iso_forest.fit_predict(data.reshape(-1, 1))outliers_iso = np.where(outlier_labels == -1)[0]print("Isolation Forest outliers: " + str(outliers_iso))
Multivariate Outlier Detection
Multivariate outliers are unusual combinations of features. They may not be outliers in individual dimensions. Methods include Mahalanobis distance, local outlier factor, and DBSCAN clustering.
Mahalanobis distance measures distance from distribution center. It accounts for feature correlations. D = √((x - μ)ᵀ Σ⁻¹ (x - μ)). Large distances indicate outliers. Threshold typically uses chi-square distribution.
Local outlier factor compares local density. It identifies points with lower density than neighbors. LOF > 1 indicates outlier. Higher values mean more anomalous. It works well for clusters with varying densities.
# Multivariate Outlier Detectionfrom sklearn.covariance import EllipticEnvelopefrom sklearn.neighbors import LocalOutlierFactorfrom scipy.spatial.distance import mahalanobisimport numpy as np# Sample 2D datadata_2d = np.array([[1, 2], [1.1, 2.1], [1.2, 2.2],[10, 20], [10.1, 20.1],[5, 100] # Outlier])# Method 1: Elliptic Envelope (Mahalanobis-based)envelope = EllipticEnvelope(contamination=0.1, random_state=42)outlier_labels_env = envelope.fit_predict(data_2d)outliers_env = np.where(outlier_labels_env == -1)[0]print("Elliptic Envelope outliers: " + str(outliers_env))# Method 2: Local Outlier Factorlof = LocalOutlierFactor(n_neighbors=3, contamination=0.1)outlier_labels_lof = lof.fit_predict(data_2d)lof_scores = lof.negative_outlier_factor_outliers_lof = np.where(outlier_labels_lof == -1)[0]print("LOF outliers: " + str(outliers_lof))print("LOF scores: " + str(lof_scores))
Outlier Treatment Strategies
Treatment depends on outlier cause and impact. Legitimate outliers represent rare events. They should be kept but handled carefully. Erroneous outliers should be removed or corrected.
Removal deletes outlier records. Use when outliers are errors. Use when outliers are few. Use when removal doesn't affect sample size significantly. Capping limits extreme values. Set values beyond thresholds to threshold values. Use when outliers are legitimate but extreme. Use when you want to preserve sample size.
Transformation reduces outlier impact. Log transformation compresses large values. Square root transformation moderates extremes. Box-Cox transformation normalizes distributions. Use when outliers are legitimate. Use when you want to preserve all data.
Separate modeling treats outliers differently. Build models for normal and outlier cases. Use when outliers represent different populations. Use when outliers have different patterns.
The diagram shows outlier detection methods. Z-score method marks values beyond 3 standard deviations. IQR method marks values below Q1-1.5×IQR or above Q3+1.5×IQR. Isolation forest separates outliers using tree structures. Each method has different sensitivity and assumptions.
# Outlier Detection and Treatmentimport pandas as pdimport numpy as npfrom scipy import stats# Sample data with outliersdata = pd.DataFrame({'value': [10, 12, 11, 13, 15, 14, 100, 16, 12, 11, 200, 13]})# Method 1: Z-score detectionz_scores = np.abs(stats.zscore(data['value']))outliers_z = data[z_scores > 3]print("Z-score outliers: " + str(len(outliers_z)))# Method 2: IQR methodQ1 = data['value'].quantile(0.25)Q3 = data['value'].quantile(0.75)IQR = Q3 - Q1lower_bound = Q1 - 1.5 * IQRupper_bound = Q3 + 1.5 * IQRoutliers_iqr = data[(data['value'] < lower_bound) | (data['value'] > upper_bound)]print("IQR outliers: " + str(len(outliers_iqr)))# Method 3: Remove outliersdata_cleaned = data[(data['value'] >= lower_bound) & (data['value'] <= upper_bound)]# Method 4: Cap outliersdata_capped = data.copy()data_capped.loc[data_capped['value'] < lower_bound, 'value'] = lower_bounddata_capped.loc[data_capped['value'] > upper_bound, 'value'] = upper_boundprint("Original mean: " + str(data['value'].mean()))print("Cleaned mean: " + str(data_cleaned['value'].mean()))# Result:# Z-score outliers: 2# IQR outliers: 2# Original mean: 35.25# Cleaned mean: 12.7
-- NeuronDB: Outlier DetectionCREATE TABLE measurements (id SERIAL PRIMARY KEY,value NUMERIC);INSERT INTO measurements (value) VALUES(10), (12), (11), (13), (15), (14), (100), (16), (12), (11), (200), (13);-- Z-score detectionWITH stats AS (SELECTAVG(value) AS mean_val,STDDEV(value) AS std_valFROM measurements)SELECT m.id, m.valueFROM measurements m, stats sWHERE ABS(m.value - s.mean_val) / NULLIF(s.std_val, 0) > 3;-- IQR methodWITH quartiles AS (SELECTPERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) AS q1,PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) AS q3FROM measurements),bounds AS (SELECTq1 - 1.5 * (q3 - q1) AS lower_bound,q3 + 1.5 * (q3 - q1) AS upper_boundFROM quartiles)SELECT m.id, m.valueFROM measurements m, bounds bWHERE m.value < b.lower_bound OR m.value > b.upper_bound;-- Remove outliersCREATE TABLE cleaned_measurements ASWITH quartiles AS (SELECTPERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) AS q1,PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) AS q3FROM measurements),bounds AS (SELECTq1 - 1.5 * (q3 - q1) AS lower_bound,q3 + 1.5 * (q3 - q1) AS upper_boundFROM quartiles)SELECT m.*FROM measurements m, bounds bWHERE m.value >= b.lower_bound AND m.value <= b.upper_bound;
Outlier treatment depends on context. Removal works when outliers are errors. Capping limits extreme values to bounds. Transformation reduces outlier impact. Separate modeling handles legitimate rare cases. Domain knowledge guides appropriate treatment.
Normalization and Standardization
Features have different scales. Age ranges from 0 to 100. Income ranges from 0 to 1,000,000. Distance algorithms treat larger numbers as more important. Normalization and standardization make features comparable.
Normalization scales values to 0-1 range. Formula is (x - min) / (max - min). Standardization centers values around zero with unit variance. Formula is (x - mean) / std. Choose based on algorithm requirements.
The diagram shows scaling transformations. Original data has different ranges. Normalization maps all values to 0-1. Standardization centers at zero with unit spread. Both methods preserve relationships while making features comparable.
# Normalization and Standardizationimport pandas as pdimport numpy as npfrom sklearn.preprocessing import MinMaxScaler, StandardScaler# Sample data with different scalesdata = pd.DataFrame({'age': [25, 30, 35, 40, 45],'income': [50000, 75000, 100000, 125000, 150000],'score': [0.5, 0.7, 0.8, 0.9, 1.0]})# Method 1: Min-Max Normalization (0-1 range)scaler_minmax = MinMaxScaler()data_normalized = pd.DataFrame(scaler_minmax.fit_transform(data),columns=data.columns)print("Normalized data:")print(data_normalized)# Method 2: Standardization (zero mean, unit variance)scaler_std = StandardScaler()data_standardized = pd.DataFrame(scaler_std.fit_transform(data),columns=data.columns)print("Standardized data:")print(data_standardized.head())# Result:# Normalized data:# age income score# 0 0.0 0.0 0.0# 1 0.25 0.25 0.4# 2 0.5 0.5 0.6# 3 0.75 0.75 0.8# 4 1.0 1.0 1.0
-- NeuronDB: Feature ScalingCREATE TABLE customer_features (id SERIAL PRIMARY KEY,age INTEGER,income NUMERIC,score NUMERIC);INSERT INTO customer_features (age, income, score) VALUES(25, 50000, 0.5),(30, 75000, 0.7),(35, 100000, 0.8),(40, 125000, 0.9),(45, 150000, 1.0);-- Min-Max NormalizationSELECTid,(age - MIN(age) OVER ())::NUMERIC / NULLIF(MAX(age) OVER () - MIN(age) OVER (), 0) AS age_normalized,(income - MIN(income) OVER ())::NUMERIC / NULLIF(MAX(income) OVER () - MIN(income) OVER (), 0) AS income_normalized,(score - MIN(score) OVER ())::NUMERIC / NULLIF(MAX(score) OVER () - MIN(score) OVER (), 0) AS score_normalizedFROM customer_features;-- StandardizationSELECTid,(age - AVG(age) OVER ())::NUMERIC / NULLIF(STDDEV(age) OVER (), 0) AS age_standardized,(income - AVG(income) OVER ())::NUMERIC / NULLIF(STDDEV(income) OVER (), 0) AS income_standardized,(score - AVG(score) OVER ())::NUMERIC / NULLIF(STDDEV(score) OVER (), 0) AS score_standardizedFROM customer_features;
Scaling requirements vary by algorithm. Distance-based algorithms need scaling. Neural networks require normalization. Tree-based algorithms are scale-invariant. Linear models benefit from standardization. Check algorithm documentation for requirements.
Feature Engineering
Feature engineering creates new features from existing data. It transforms raw inputs into useful representations. Good features improve model performance more than algorithm selection. Domain knowledge guides feature creation.
Common techniques include interaction features, polynomial features, time-based features, and text features. Interaction features combine multiple inputs. Polynomial features capture non-linear relationships. Time features extract temporal patterns. Text features convert words to numbers.
The diagram shows feature engineering transformations. Raw features include price and area. Interaction feature multiplies price and area. Polynomial feature squares area. Time feature extracts month from date. Each transformation captures different patterns.
# Feature Engineeringimport pandas as pdimport numpy as npfrom sklearn.preprocessing import PolynomialFeatures# Sample datadata = pd.DataFrame({'price': [100, 200, 300, 400],'area': [500, 1000, 1500, 2000],'date': pd.to_datetime(['2024-01-15', '2024-02-20', '2024-03-10', '2024-04-05'])})# Interaction featuresdata['price_area'] = data['price'] * data['area']data['price_per_area'] = data['price'] / data['area']# Polynomial featurespoly = PolynomialFeatures(degree=2, include_bias=False)poly_features = poly.fit_transform(data[['price', 'area']])data_poly = pd.DataFrame(poly_features,columns=['price', 'area', 'price^2', 'price×area', 'area^2'])# Time-based featuresdata['year'] = data['date'].dt.yeardata['month'] = data['date'].dt.monthdata['day_of_week'] = data['date'].dt.dayofweekdata['is_weekend'] = data['day_of_week'].isin([5, 6]).astype(int)print("Engineered features:")print(data[['price', 'area', 'price_area', 'price_per_area', 'month', 'is_weekend']])# Result:# price area price_area price_per_area month is_weekend# 0 100 500 50000 0.20 1 0# 1 200 1000 200000 0.20 2 1# 2 300 1500 450000 0.20 3 0# 3 400 2000 800000 0.20 4 0
-- NeuronDB: Feature EngineeringCREATE TABLE property_data (id SERIAL PRIMARY KEY,price NUMERIC,area NUMERIC,sale_date DATE);INSERT INTO property_data (price, area, sale_date) VALUES(100000, 500, '2024-01-15'),(200000, 1000, '2024-02-20'),(300000, 1500, '2024-03-10'),(400000, 2000, '2024-04-05');-- Interaction featuresSELECTid,price,area,price * area AS price_area,price / NULLIF(area, 0) AS price_per_areaFROM property_data;-- Polynomial featuresSELECTid,price,area,price * price AS price_squared,area * area AS area_squared,price * area AS price_area_interactionFROM property_data;-- Time-based featuresSELECTid,sale_date,EXTRACT(YEAR FROM sale_date) AS year,EXTRACT(MONTH FROM sale_date) AS month,EXTRACT(DOW FROM sale_date) AS day_of_week,CASE WHEN EXTRACT(DOW FROM sale_date) IN (0, 6) THEN 1 ELSE 0 END AS is_weekendFROM property_data;
Feature engineering requires iteration. Start with domain knowledge. Create candidate features. Test feature importance. Remove redundant features. Monitor model performance. Automated feature engineering tools exist but manual engineering often performs better.
Feature Selection
Feature selection reduces dimensionality. It removes irrelevant or redundant features. Fewer features mean faster training and less overfitting. Selection methods include filter, wrapper, and embedded approaches.
Filter methods use statistical tests. They rank features independently. Wrapper methods use model performance. They search feature subsets. Embedded methods use model internals. They select during training.
The diagram shows selection methods. Filter method scores each feature independently. Wrapper method tests feature subsets with models. Embedded method uses model weights or importance. Each method has different computational cost and effectiveness.
# Feature Selectionimport pandas as pdimport numpy as npfrom sklearn.feature_selection import SelectKBest, f_classif, RFEfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.linear_model import LogisticRegression# Sample dataX = pd.DataFrame({'feature1': np.random.randn(100),'feature2': np.random.randn(100),'feature3': np.random.randn(100),'feature4': np.random.randn(100),'feature5': np.random.randn(100)})y = (X['feature1'] + X['feature2'] > 0).astype(int)# Method 1: Filter method (statistical test)selector_filter = SelectKBest(score_func=f_classif, k=3)X_selected_filter = selector_filter.fit_transform(X, y)selected_features_filter = X.columns[selector_filter.get_support()]print("Filter selected: " + str(list(selected_features_filter)))# Method 2: Wrapper method (RFE)estimator = LogisticRegression()selector_rfe = RFE(estimator, n_features_to_select=3)X_selected_rfe = selector_rfe.fit_transform(X, y)selected_features_rfe = X.columns[selector_rfe.get_support()]print("RFE selected: " + str(list(selected_features_rfe)))# Method 3: Embedded method (feature importance)rf = RandomForestClassifier(n_estimators=100)rf.fit(X, y)feature_importance = pd.Series(rf.feature_importances_, index=X.columns)top_features = feature_importance.nlargest(3).indexprint("Embedded selected: " + str(list(top_features)))# Result:# Filter selected: ['feature1', 'feature2', 'feature3']# RFE selected: ['feature1', 'feature2', 'feature4']# Embedded selected: ['feature1', 'feature2', 'feature3']
-- NeuronDB: Feature SelectionCREATE TABLE training_data (id SERIAL PRIMARY KEY,feature1 NUMERIC,feature2 NUMERIC,feature3 NUMERIC,feature4 NUMERIC,feature5 NUMERIC,target INTEGER);-- Statistical correlation for filter methodSELECT'feature1' AS feature,ABS(CORR(feature1, target)) AS correlationFROM training_dataUNION ALLSELECT 'feature2', ABS(CORR(feature2, target)) FROM training_dataUNION ALLSELECT 'feature3', ABS(CORR(feature3, target)) FROM training_dataUNION ALLSELECT 'feature4', ABS(CORR(feature4, target)) FROM training_dataUNION ALLSELECT 'feature5', ABS(CORR(feature5, target)) FROM training_dataORDER BY correlation DESCLIMIT 3;-- Feature importance from modelCREATE TEMP TABLE model_features ASSELECT neurondb.train('default','random_forest','training_data','target',ARRAY['feature1', 'feature2', 'feature3', 'feature4', 'feature5'],'{"n_trees": 100}'::jsonb)::integer AS model_id;SELECT neurondb.feature_importance((SELECT model_id FROM model_features)) ORDER BY importance DESC LIMIT 3;
Feature selection balances performance and complexity. More features can improve accuracy but increase overfitting risk. Fewer features reduce complexity but may miss important patterns. Use cross-validation to evaluate selection strategies.
Data Validation
Data validation checks data quality after preparation. It verifies completeness, correctness, and consistency. Validation catches errors before training. It ensures data meets model requirements.
Validation checks include range validation, type validation, constraint validation, and relationship validation. Range validation ensures values fall within expected bounds. Type validation ensures correct data types. Constraint validation checks business rules. Relationship validation verifies referential integrity.
The diagram shows validation checks. Range check flags values outside 0-100. Type check flags non-numeric values. Constraint check flags invalid combinations. Relationship check flags orphaned records. Each check prevents different error types.
# Data Validationimport pandas as pdimport numpy as np# Sample datadata = pd.DataFrame({'age': [25, 30, 150, 35, -5],'income': [50000, 75000, 100000, 'invalid', 80000],'email': ['user@example.com', 'invalid-email', 'test@test.com', 'user@domain', 'valid@email.com']})# Range validationdef validate_range(df, column, min_val, max_val):invalid = df[(df[column] < min_val) | (df[column] > max_val)]return invalidage_invalid = validate_range(data, 'age', 0, 120)print("Invalid ages: " + str(len(age_invalid)))# Type validationdef validate_type(df, column, expected_type):invalid = df[~df[column].apply(lambda x: isinstance(x, expected_type))]return invalidincome_invalid = validate_type(data, 'income', (int, float))print("Invalid income types: " + str(len(income_invalid)))# Pattern validation (email)import reemail_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$'email_invalid = data[~data['email'].str.match(email_pattern)]print("Invalid emails: " + str(len(email_invalid)))# Result:# Invalid ages: 2# Invalid income types: 1# Invalid emails: 2
-- NeuronDB: Data ValidationCREATE TABLE customer_data (id SERIAL PRIMARY KEY,age INTEGER,income NUMERIC,email VARCHAR(255));INSERT INTO customer_data (age, income, email) VALUES(25, 50000, 'user@example.com'),(30, 75000, 'invalid-email'),(150, 100000, 'test@test.com'),(35, NULL, 'user@domain'),(-5, 80000, 'valid@email.com');-- Range validationSELECT id, age, 'age out of range' AS validation_errorFROM customer_dataWHERE age < 0 OR age > 120;-- Type validationSELECT id, income, 'invalid income type' AS validation_errorFROM customer_dataWHERE income IS NULL OR income < 0;-- Pattern validation (email)SELECT id, email, 'invalid email format' AS validation_errorFROM customer_dataWHERE email !~ '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$';-- Combined validation reportSELECTCOUNT(*) AS total_records,COUNT(CASE WHEN age < 0 OR age > 120 THEN 1 END) AS invalid_age_count,COUNT(CASE WHEN income IS NULL OR income < 0 THEN 1 END) AS invalid_income_count,COUNT(CASE WHEN email !~ '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' THEN 1 END) AS invalid_email_countFROM customer_data;
Validation should be automated. Create validation rules early. Run validation after each preparation step. Document validation failures. Fix errors systematically. Re-validate after fixes. Maintain validation logs for auditing.
Complete Example: Customer Data Preparation
This example demonstrates complete data preparation workflow for customer data.
# Complete Data Preparation Exampleimport pandas as pdimport numpy as npfrom sklearn.preprocessing import StandardScalerfrom sklearn.impute import SimpleImputer# Step 1: Load raw dataraw_data = pd.DataFrame({'customer_id': range(1, 101),'age': np.random.randint(18, 80, 100),'age': np.where(np.random.random(100) < 0.1, np.nan, np.random.randint(18, 80, 100)),'income': np.where(np.random.random(100) < 0.15, np.nan, np.random.randint(20000, 150000, 100)),'purchase_amount': np.random.randn(100) * 500 + 1000,'city': np.random.choice(['NYC', 'SF', 'LA', 'Chicago', None], 100, p=[0.3, 0.3, 0.2, 0.15, 0.05])})# Step 2: Handle missing valuesnumeric_imputer = SimpleImputer(strategy='median')categorical_imputer = SimpleImputer(strategy='most_frequent')raw_data[['age', 'income']] = numeric_imputer.fit_transform(raw_data[['age', 'income']])raw_data['city'] = categorical_imputer.fit_transform(raw_data[['city']]).ravel()# Step 3: Remove outliers (IQR method)def remove_outliers_iqr(df, column):Q1 = df[column].quantile(0.25)Q3 = df[column].quantile(0.75)IQR = Q3 - Q1lower = Q1 - 1.5 * IQRupper = Q3 + 1.5 * IQRreturn df[(df[column] >= lower) & (df[column] <= upper)]clean_data = remove_outliers_iqr(raw_data, 'purchase_amount')# Step 4: Feature engineeringclean_data['age_group'] = pd.cut(clean_data['age'], bins=[0, 30, 50, 100], labels=['Young', 'Middle', 'Senior'])clean_data['income_per_age'] = clean_data['income'] / clean_data['age']clean_data['high_income'] = (clean_data['income'] > clean_data['income'].median()).astype(int)# Step 5: Encode categorical variablesclean_data = pd.get_dummies(clean_data, columns=['city', 'age_group'], prefix=['city', 'age'])# Step 6: Normalize numeric featuresscaler = StandardScaler()numeric_cols = ['age', 'income', 'purchase_amount', 'income_per_age']clean_data[numeric_cols] = scaler.fit_transform(clean_data[numeric_cols])# Step 7: Final validationprint("Final dataset shape: " + str(clean_data.shape))print("Missing values: " + str(clean_data.isnull().sum().sum()))print("Ready for training: " + str(clean_data.isnull().sum().sum() == 0))# Result:# Final dataset shape: (95, 12)# Missing values: 0# Ready for training: True
-- NeuronDB: Complete Data PreparationCREATE TABLE raw_customers (customer_id SERIAL PRIMARY KEY,age INTEGER,income NUMERIC,purchase_amount NUMERIC,city VARCHAR(50));-- Insert sample data with missing values and outliersINSERT INTO raw_customers (age, income, purchase_amount, city)SELECTCASE WHEN random() < 0.1 THEN NULL ELSE 18 + floor(random() * 62)::INTEGER END,CASE WHEN random() < 0.15 THEN NULL ELSE 20000 + floor(random() * 130000)::NUMERIC END,(random() * 1000 - 500)::NUMERIC,CASE WHEN random() < 0.05 THEN NULLELSE (ARRAY['NYC', 'SF', 'LA', 'Chicago'])[1 + floor(random() * 4)::INTEGER] ENDFROM generate_series(1, 100);-- Step 1: Handle missing valuesUPDATE raw_customersSET age = COALESCE(age, (SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY age) FROM raw_customers WHERE age IS NOT NULL))WHERE age IS NULL;UPDATE raw_customersSET income = COALESCE(income, (SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY income) FROM raw_customers WHERE income IS NOT NULL))WHERE income IS NULL;UPDATE raw_customersSET city = COALESCE(city, (SELECT city FROM raw_customersWHERE city IS NOT NULLGROUP BY cityORDER BY COUNT(*) DESCLIMIT 1))WHERE city IS NULL;-- Step 2: Remove outliersCREATE TABLE clean_customers ASWITH quartiles AS (SELECTPERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY purchase_amount) AS q1,PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY purchase_amount) AS q3FROM raw_customers),bounds AS (SELECT q1 - 1.5 * (q3 - q1) AS lower, q3 + 1.5 * (q3 - q1) AS upperFROM quartiles)SELECT r.*FROM raw_customers r, bounds bWHERE r.purchase_amount >= b.lower AND r.purchase_amount <= b.upper;-- Step 3: Feature engineeringALTER TABLE clean_customers ADD COLUMN age_group VARCHAR(20);ALTER TABLE clean_customers ADD COLUMN income_per_age NUMERIC;ALTER TABLE clean_customers ADD COLUMN high_income INTEGER;UPDATE clean_customersSET age_group = CASEWHEN age < 30 THEN 'Young'WHEN age < 50 THEN 'Middle'ELSE 'Senior'END,income_per_age = income / NULLIF(age, 0),high_income = CASE WHEN income > (SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY income) FROM clean_customers) THEN 1 ELSE 0 END;-- Step 4: One-hot encoding for categoricalCREATE TABLE prepared_customers ASSELECTcustomer_id,age,income,purchase_amount,income_per_age,high_income,CASE WHEN city = 'NYC' THEN 1 ELSE 0 END AS city_nyc,CASE WHEN city = 'SF' THEN 1 ELSE 0 END AS city_sf,CASE WHEN city = 'LA' THEN 1 ELSE 0 END AS city_la,CASE WHEN city = 'Chicago' THEN 1 ELSE 0 END AS city_chicago,CASE WHEN age_group = 'Young' THEN 1 ELSE 0 END AS age_young,CASE WHEN age_group = 'Middle' THEN 1 ELSE 0 END AS age_middle,CASE WHEN age_group = 'Senior' THEN 1 ELSE 0 END AS age_seniorFROM clean_customers;-- Step 5: Normalize numeric featuresCREATE TABLE final_customers ASSELECTcustomer_id,(age - AVG(age) OVER ()) / NULLIF(STDDEV(age) OVER (), 0) AS age_normalized,(income - AVG(income) OVER ()) / NULLIF(STDDEV(income) OVER (), 0) AS income_normalized,(purchase_amount - AVG(purchase_amount) OVER ()) / NULLIF(STDDEV(purchase_amount) OVER (), 0) AS purchase_normalized,(income_per_age - AVG(income_per_age) OVER ()) / NULLIF(STDDEV(income_per_age) OVER (), 0) AS income_per_age_normalized,high_income,city_nyc, city_sf, city_la, city_chicago,age_young, age_middle, age_seniorFROM prepared_customers;SELECT COUNT(*) AS total_records, COUNT(*) FILTER (WHERE age_normalized IS NOT NULL) AS complete_recordsFROM final_customers;-- Result:-- total_records | complete_records-- ---------------+------------------- 95 | 95-- (1 row)
Summary
Data preparation transforms raw data into clean features. You collect data from multiple sources. You handle missing values through removal, imputation, or prediction. You detect and treat outliers using statistical methods. You normalize or standardize features to comparable scales. You engineer new features from existing data. You select relevant features to reduce dimensionality. You validate data quality throughout the process. Each step improves model performance. Proper preparation requires domain knowledge and iterative refinement.