What is Data Transformation?
Data transformation is the process of converting data from one format or structure into another. It's a crucial step in data analysis and machine learning, as raw data is rarely in a format that is immediately suitable for use.
The goal of data transformation is to prepare data for analysis by cleaning, restructuring, and enriching it.
Common Data Transformation Techniques
- Normalization: Scaling numerical features to a specific range (e.g., 0-1).
- Standardization: Transforming features to have a mean of 0 and a standard deviation of 1.
- Discretization/Binning: Dividing a continuous variable into discrete intervals.
- Aggregation: Summarizing data (e.g., calculating averages, sums, counts).
- Data Type Conversion: Converting data from one data type to another (e.g., string to integer).
Example: Normalization using Python
import pandas as pd
data = {'feature1': [10, 20, 30],
'feature2': [5, 10, 15]}
df = pd.DataFrame(data)
df['feature1'] = (df['feature1'] - df['feature1'].min()) / (df['feature1'].max() - df['feature1'].min())
print(df)