Troubleshooting
Common Issues and Solutions
Issue: "No active SparkSession found"
Solution: Ensure you're running the code in a Databricks notebook with an active Spark session. If using a Python script, create a SparkSession:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("UDTF Registration").getOrCreate()
Issue: "PySpark is required for session-scoped UDTF registration"
Solution: Install PySpark or ensure you're running in a Databricks environment where PySpark is available:
%pip install pyspark
Issue: UDTF returns no results
Possible Causes: 1. Incorrect credentials: Verify that SECRET() values are correct 2. No matching data: Check that filters match existing data in CDF 3. Time series doesn't exist: Verify the time series external_id exists in CDF
Debug Steps:
# Test credentials
from cognite.pygen import load_cognite_client_from_toml
client = load_cognite_client_from_toml("config.toml")
# Test data model query
from cognite.client.data_classes.data_modeling.ids import DataModelId
data_model_id = DataModelId(space="sailboat", external_id="sailboat", version="v1")
views = client.data_modeling.views.list(data_model_id)
print(f"Found {len(views)} views")
Issue: "Module not found" errors
Solution: Restart the Python kernel after installing packages with %pip:
- Run
%pip install cognite-sdk cognite-databricks - When prompted, click "Restart" to restart the kernel
- Re-run your registration code
Issue: UDTF registration succeeds but SQL query fails
Possible Causes: 1. Function name mismatch: Verify the registered function name matches what you're calling in SQL 2. Parameter mismatch: Check that all required parameters are provided 3. Type errors: Ensure parameter types match the UDTF's expected types
Debug Steps:
# Check registered functions
registered = generator.register_session_scoped_udtfs()
print("Registered functions:", registered)
# Verify function name in SQL matches
# If registered as "small_boat_udtf", use:
# SELECT * FROM small_boat_udtf(
# client_id => SECRET('cdf_sailboat_sailboat', 'client_id'),
# client_secret => SECRET('cdf_sailboat_sailboat', 'client_secret'),
# tenant_id => SECRET('cdf_sailboat_sailboat', 'tenant_id'),
# cdf_cluster => SECRET('cdf_sailboat_sailboat', 'cdf_cluster'),
# project => SECRET('cdf_sailboat_sailboat', 'project'),
# name => NULL,
# description => NULL
# ) LIMIT 10;
Issue: "[NOT_A_SCALAR_FUNCTION] ... appears as a scalar expression here"
Cause: UDTFs are table functions. The SQL syntax differs between notebook %sql and SQL Warehouse.
Solution: Use the table-function syntax for your environment:
Notebook %sql (cluster-backed):
SELECT *
FROM time_series_datapoints_detailed_udtf(
client_id => SECRET('cdf_sailboat_sailboat', 'client_id'),
client_secret => SECRET('cdf_sailboat_sailboat', 'client_secret'),
tenant_id => SECRET('cdf_sailboat_sailboat', 'tenant_id'),
cdf_cluster => SECRET('cdf_sailboat_sailboat', 'cdf_cluster'),
project => SECRET('cdf_sailboat_sailboat', 'project'),
instance_ids => 'space1:ts1,space1:ts2',
start => current_timestamp() - INTERVAL 52 WEEKS,
end => current_timestamp() - INTERVAL 51 WEEKS,
aggregates => 'average',
granularity => '2h'
) AS t;
SQL Warehouse (Databricks SQL):
SELECT *
FROM TABLE(
time_series_datapoints_detailed_udtf(
client_id => SECRET('cdf_sailboat_sailboat', 'client_id'),
client_secret => SECRET('cdf_sailboat_sailboat', 'client_secret'),
tenant_id => SECRET('cdf_sailboat_sailboat', 'tenant_id'),
cdf_cluster => SECRET('cdf_sailboat_sailboat', 'cdf_cluster'),
project => SECRET('cdf_sailboat_sailboat', 'project'),
instance_ids => 'space1:ts1,space1:ts2',
start => current_timestamp() - INTERVAL 52 WEEKS,
end => current_timestamp() - INTERVAL 51 WEEKS,
aggregates => 'average',
granularity => '2h'
)
);
Issue: Time Series UDTF returns NULL values
Possible Causes: 1. Time series doesn't exist: Verify the time series external_id exists 2. No datapoints in range: Check that the time range (start/end) contains data 3. Incorrect instance_id format: Ensure space and external_id are correct
Debug Steps:
# Test time series existence
from cognite.client.data_classes.data_modeling.ids import NodeId
instance_id = NodeId(space="sailboat", external_id="vessel.speed")
ts = client.time_series.retrieve(external_id=instance_id.external_id)
print(f"Time series exists: {ts is not None}")
# Test datapoints retrieval
datapoints = client.time_series.data.retrieve(
external_id=instance_id.external_id,
start="1d-ago",
end="now"
)
print(f"Found {len(datapoints)} datapoints")
Getting Help
If you encounter issues not covered here:
- Check the logs: Look for error messages in the notebook output or Databricks logs
- Verify credentials: Ensure CDF credentials are correct and have proper permissions
- Test with simple queries: Start with basic queries before adding complex filters or joins
- Review the Technical Plan: See the Technical Plan document for detailed architecture and implementation details
Next Steps
After successfully using session-scoped UDTFs, consider:
- Unity Catalog Registration: Register UDTFs and Views in Unity Catalog for production use (see Catalog-Based UDTF Registration)
- View Creation: Create SQL Views that wrap UDTFs for easier querying
- Governance: Set up Unity Catalog permissions for production deployments
For more information, see: - Catalog-Based UDTF Registration - Technical Plan: CDF Databricks Integration (UDTF-Based)