Data Transformation¶
We'll use the customers_df
and orders_df
DataFrames. For realistic examples, especially for date parsing and missing value handling, ensure these DataFrames reflect the structure and "messiness" of the original CSV files.
Prerequisite
# Prerequisites: Ensure Polars is imported and selectors if used.
import polars as pl
import polars.selectors as cs # For using column selectors like cs.numeric()
# --- Placeholder DataFrames (mimicking loaded CSVs for standalone module execution) ---
# These should reflect the structure from your customers.csv and orders.csv,
# including mixed date formats, missing values, etc.
customers_df = pl.DataFrame({
"customer_id": [101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115],
"name": ["Alice Wonderland", "Bob The Builder", "Charlie Brown", "Diana Prince", "Evan Almighty", "Fiona Gallagher", "George Jetson", "Hannah Montana", "Ian Malcolm", "Jane Doe", "Kevin McCallister", "Laura Palmer", "Michael Scott", "Nancy Drew", "Oscar Grouch"],
"registration_date_str": ["2022-01-15", "2022-03-22", "2022-05-10", "2022-07-01", "2022-08-19", "2023-01-20", None, "2023-04-05", "2023-06-12", "2023-07-21", "2023-09-01", "2023-10-15", "2024-02-10", "03/15/2024", "2024-05-01"],
"city": ["New York", "London", "Paris", "New York", "London", "New York", "Paris", "Berlin", "London", "New York", "Chicago", "Twin Peaks", "Scranton", "River Heights", "New York"],
"age": [28, 35, 45, 3000, 42, 29, 50, 22, 55, None, 12, 17, 48, 18, 60]
}).with_columns(pl.col("age").cast(pl.Int64, strict=False)) # Cast age to Int64, allowing nulls
orders_df = pl.DataFrame({
"order_id": [201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218],
"customer_id": [101, 102, 101, 103, 104, 102, 105, 101, 106, 108, 103, 107, 110, 111, 102, 113, 101, 115],
"order_date_str": ["2023-01-20 10:30:00", "2023-02-15 11:05:30", "2023-02-28 14:12:55", "2023-03-10 09:00:15", "2023-03-12 17:45:00", "2023-04-05 12:00:00", "2023-04-22 16:20:30", "2023-05-01 10:00:00", "2023-05-15 08:55:10", "2023-06-01 11:30:00", "2023-06-20 13:40:00", "2023-07-01 00:00:00", "2023-07-25 10:10:10", "2023-09-10 14:20:30", "2023-11-05 19:00:00", "2024-02-15 09:30:00", "2024-03-01 10:00:00", "2024-05-05 12:12:12"],
"product_category": ["Books", "Tools", "Electronics", "Home Goods", "Antiques", "Books", "Electronics", "Books", "Beauty", "Music", "Home Goods", "Electronics", "Clothing", "Toys", "Tools", "Office Supplies", "Electronics", "Home Goods"],
"quantity": [2, 1, 1, 3, 1, 1, None, 3, 2, 5, 1, 1, 2, 3, 1, 10, 1, 1],
"unit_price": [15.99, 199.50, 799.00, 25.00, 2500.00, 12.50, 99.99, 10.00, 45.75, 9.99, 150.00, 499.50, 75.00, 29.99, 75.00, 4.99, 1200.00, 12.00],
"discount_applied": [0.05, 0.10, None, 0.0, 0.15, 0.0, 0.05, None, 0.10, 0.0, 0.05, 0.10, None, 0.05, 0.0, 0.0, 0.15, 0.0]
}).with_columns([
pl.col("quantity").cast(pl.Int64, strict=False), # Ensure numeric types, allowing nulls
pl.col("unit_price").cast(pl.Float64, strict=False),
pl.col("discount_applied").cast(pl.Float64, strict=False)
])
Transforming Data - Selection, Filtering, and Modification
This module covers the primary operations for transforming data: selecting specific columns, creating or modifying columns, filtering rows, changing data types, handling missing values, and sorting data. These operations are fundamental to preparing data for analysis.
1. Introduction to Data Transformation Contexts¶
As previewed in Module 4, Polars expressions are executed within specific DataFrame methods (contexts):
select()
: For choosing, renaming, or deriving a specific set of columns.with_columns()
: For adding new columns or modifying existing ones, keeping all other columns.filter()
: For selecting a subset of rows based on conditions.
2. Column Selection and Manipulation with select
¶
The select()
method is used when you want to create a new DataFrame with a specific subset or transformation of columns from an existing DataFrame.
-
Selecting Specific Columns: Provide a list of column names (as strings) or
pl.col()
expressions. -
Renaming Columns using
alias()
: The.alias()
method is used within an expression to rename a column in the output. -
Using Polars Selectors: Polars offers powerful column selectors via the
cs
module (imported asimport polars.selectors as cs
). These allow selection based on patterns or data types.# Select all string columns from orders_df string_cols_orders_df = orders_df.select(cs.string()) print("\nString columns from orders_df:") print(string_cols_orders_df.head()) # Select all numeric columns from customers_df numeric_cols_customers_df = customers_df.select(cs.numeric()) print("\nNumeric columns from customers_df:") print(numeric_cols_customers_df.head()) # Select columns starting with 'order' order_related_cols_df = orders_df.select(cs.starts_with("order_")) print("\nColumns starting with 'order_' from orders_df:") print(order_related_cols_df.head())
Selectors are very useful for DataFrames with many columns.
3. Adding or Modifying Columns with with_columns
¶
When you want to add new columns or change existing ones while keeping all other columns, with_columns()
is the appropriate method. It takes a list of expressions.
-
Creating New Columns: Define an expression and assign it a name using
.alias()
.# Add a 'total_price' column to orders_df (quantity * unit_price) # Also, add a column indicating if a discount was applied. orders_enhanced_df = orders_df.with_columns([ (pl.col("quantity") * pl.col("unit_price")).alias("total_price"), pl.col("discount_applied").is_not_null().alias("has_discount") # True if discount_applied is not null ]) print("\nOrders DataFrame with 'total_price' and 'has_discount' columns:") print(orders_enhanced_df.head())
-
Modifying Existing Columns: If an expression produces a column with an existing name, it replaces that column.
# Convert customer names to uppercase customers_upper_name_df = customers_df.with_columns( pl.col("name").str.to_uppercase().alias("name") # .alias("name") ensures it replaces the 'name' column ) # Alternatively, if alias matches existing name, it overwrites: # customers_upper_name_df = customers_df.with_columns( # pl.col("name").str.to_uppercase() # ) print("\nCustomers DataFrame with names in uppercase:") print(customers_upper_name_df.head())
4. Filtering Rows with filter
¶
The filter()
method is used to select rows that meet certain criteria, defined by one or more boolean expressions.
-
Filtering Based on Single Conditions:
# Filter orders for 'Electronics' category electronics_orders_df = orders_df.filter( pl.col("product_category") == pl.lit("Electronics") ) print("\nElectronics orders:") print(electronics_orders_df.head()) # Filter customers older than 50 (excluding Diana Prince due to outlier age) # First, let's handle the outlier age for a more realistic filter example customers_cleaned_age_df = customers_df.with_columns( pl.when(pl.col("age") > 100).then(None).otherwise(pl.col("age")).alias("age_cleaned") ) older_customers_df = customers_cleaned_age_df.filter( pl.col("age_cleaned") > pl.lit(50) ) print("\nCustomers older than 50 (age cleaned):") print(older_customers_df.select(["name", "age_cleaned", "city"]))
-
Combining Conditions using
&
(AND) and|
(OR): Wrap individual conditions in parentheses when combining them.# Filter orders for 'Books' category with quantity > 1 specific_book_orders_df = orders_df.filter( (pl.col("product_category") == pl.lit("Books")) & (pl.col("quantity") > pl.lit(1)) ) print("\nBook orders with quantity > 1:") print(specific_book_orders_df) # Filter customers from 'New York' OR 'Paris' ny_paris_customers_df = customers_df.filter( (pl.col("city") == pl.lit("New York")) | (pl.col("city") == pl.lit("Paris")) ) print("\nCustomers from New York or Paris:") print(ny_paris_customers_df.select(["name", "city"]).head())
-
Using
is_in()
for Multiple Values: To filter rows where a column's value is one of several specified values.# Filter customers from 'London', 'Berlin', or 'Chicago' selected_cities = ["London", "Berlin", "Chicago"] customers_from_selected_cities_df = customers_df.filter( pl.col("city").is_in(selected_cities) ) print(f"\nCustomers from {selected_cities}:") print(customers_from_selected_cities_df.select(["name", "city"]))
5. Data Type Conversion (Casting)¶
Often, data is not in the correct type (e.g., numbers read as strings, dates as strings). Use pl.col().cast(DataType)
within with_columns
to convert types.
-
Common Numeric and Boolean Types:
# Example: Suppose 'customer_id' was read as string and we want it as integer # For our placeholder, it's already int. Let's imagine converting age to Float for some reason. customers_age_float_df = customers_df.with_columns( pl.col("age").cast(pl.Float64).alias("age_float") ) print("\nCustomers DataFrame with age as Float64:") print(customers_age_float_df.select(["name", "age_float"]).head())
-
String to
pl.Categorical
: Useful for columns with a limited number of unique string values. Can improve performance and memory usage.customers_city_categorical_df = customers_df.with_columns( pl.col("city").cast(pl.Categorical).alias("city_categorical") ) print("\nCustomers DataFrame with city as Categorical:") print(customers_city_categorical_df.select(["name", "city_categorical"]).head()) print(f"Data type of 'city_categorical': {customers_city_categorical_df.get_column('city_categorical').dtype}")
-
String to
pl.Enum
(for fixed, known categories): Enums are stricter than Categoricals. You define the set of allowed values. This is useful for data validation.# Define the allowed product categories for an Enum # Let's say these are the only valid primary categories we expect for a report allowed_categories = ["Books", "Electronics", "Home Goods", "Tools", "Clothing"] ProductEnum = pl.Enum(allowed_categories) # Attempt to cast 'product_category' to this Enum. # Values not in ProductEnum will become null if strict=False (default), or error if strict=True in cast. # For demonstration, let's use with_columns and handle potential nulls or filter them. orders_enum_df = orders_df.with_columns( pl.col("product_category").cast(ProductEnum, strict=False).alias("category_enum") ) print("\nOrders DataFrame with product_category as Enum (non-matching become null):") print(orders_enum_df.select(["product_category", "category_enum"]).head(10)) print(f"Data type of 'category_enum': {orders_enum_df.get_column('category_enum').dtype}") # You might filter out nulls if they represent unexpected categories: # valid_category_orders = orders_enum_df.filter(pl.col("category_enum").is_not_null())
-
String to Datetime/Date: Polars provides powerful string parsing capabilities.
-
Standard Format (
.str.to_datetime()
): Fororders_df.order_date_str
which is inYYYY-MM-DD HH:MM:SS
.orders_parsed_dates_df = orders_df.with_columns( pl.col("order_date_str").str.to_datetime(format="%Y-%m-%d %H:%M:%S").alias("order_datetime") ) print("\nOrders DataFrame with parsed order_datetime:") print(orders_parsed_dates_df.select(["order_id", "order_datetime"]).head()) print(f"Data type of 'order_datetime': {orders_parsed_dates_df.get_column('order_datetime').dtype}")
-
Handling Mixed Date Formats (
.str.to_date()
andpl.coalesce()
): Thecustomers_df.registration_date_str
has mixed formats (YYYY-MM-DD
andMM/DD/YYYY
) and nulls. We can try parsing with multiple formats and usepl.coalesce
to pick the first successful parse.customers_parsed_reg_dates_df = customers_df.with_columns( pl.coalesce([ pl.col("registration_date_str").str.to_date(format="%Y-%m-%d", strict=False), pl.col("registration_date_str").str.to_date(format="%m/%d/%Y", strict=False) ]).alias("registration_date") ) print("\nCustomers DataFrame with parsed registration_date (handling mixed formats):") print(customers_parsed_reg_dates_df.select(["name", "registration_date_str", "registration_date"])) print(f"Data type of 'registration_date': {customers_parsed_reg_dates_df.get_column('registration_date').dtype}")
strict=False
allows parsing to returnnull
on failure for a given format, lettingcoalesce
try the next one.
-
6. Handling Missing Values¶
-
Filling Nulls with
fill_null()
: You can fill missing values using a literal value or a defined strategy.# Fill missing 'age' in customers_df with the mean age (after cleaning the outlier) # For simplicity here, let's first ensure 'age_cleaned' exists as from filter example if "age_cleaned" not in customers_cleaned_age_df.columns: customers_cleaned_age_df = customers_cleaned_age_df.with_columns( pl.when(pl.col("age") > 100).then(None).otherwise(pl.col("age")).cast(pl.Float64).alias("age_cleaned") # Cast to float for mean ) mean_age_val = customers_cleaned_age_df.select(pl.col("age_cleaned").mean()).item() # Get the actual mean value customers_filled_age_df = customers_cleaned_age_df.with_columns( pl.col("age_cleaned").fill_null(mean_age_val).alias("age_filled_mean") ) # Alternative using strategy string (Polars will compute the mean internally) # customers_filled_age_df = customers_cleaned_age_df.with_columns( # pl.col("age_cleaned").fill_null(strategy="mean").alias("age_filled_mean_strategy") # ) print(f"\nCustomers DataFrame with 'age_cleaned' nulls filled with mean ({mean_age_val:.2f}):") print(customers_filled_age_df.filter(customers_df["age"].is_null()).select(["name", "age", "age_cleaned", "age_filled_mean"])) # Fill missing 'discount_applied' in orders_df with 0 orders_filled_discount_df = orders_df.with_columns( pl.col("discount_applied").fill_null(0.0).alias("discount_applied_filled") ) print("\nOrders DataFrame with 'discount_applied' nulls filled with 0:") print(orders_filled_discount_df.filter(pl.col("discount_applied").is_null()).select(["order_id", "discount_applied", "discount_applied_filled"]).head())
-
Dropping Rows with Nulls using
drop_nulls()
: Use with caution as it involves data loss. Thesubset
parameter can specify columns to consider for nulls.# Drop rows from orders_df where 'quantity' is null orders_dropped_null_quantity_df = orders_df.drop_nulls(subset=["quantity"]) print(f"\nOriginal orders count: {orders_df.height}, After dropping null quantity: {orders_dropped_null_quantity_df.height}") # Drop rows from customers_df if 'registration_date_str' OR 'age' is null # customers_dropped_any_df = customers_df.drop_nulls(subset=["registration_date_str", "age"]) # print(f"\nOriginal customers count: {customers_df.height}, After dropping if reg_date or age is null: {customers_dropped_any_df.height}")
Business Context: Discuss implications: Dropping an order because quantity is missing might be acceptable if quantity is essential for all analyses. Dropping a customer due to a missing registration date might be undesirable if other customer information is valuable.
7. Sorting Data with sort()
¶
The sort()
method orders the DataFrame by one or more columns.
# Sort customers_df by age in descending order
sorted_customers_by_age_df = customers_df.sort("age", descending=True)
print("\nCustomers sorted by age (descending):")
print(sorted_customers_by_age_df.select(["name", "age", "city"]).head())
# Sort orders_df by 'product_category' (ascending) then by 'total_price' (descending)
# We need 'total_price' from a previous example
if "total_price" not in orders_enhanced_df.columns: # ensure it exists
orders_enhanced_df = orders_enhanced_df.with_columns(
(pl.col("quantity") * pl.col("unit_price")).alias("total_price")
)
sorted_orders_df = orders_enhanced_df.sort(
by=["product_category", "total_price"],
descending=[False, True] # False for product_category (asc), True for total_price (desc)
)
print("\nOrders sorted by product_category (asc) then total_price (desc):")
print(sorted_orders_df.select(["product_category", "order_id", "total_price"]).head(10))