Troubleshooting

Common Issues and Solutions

Issue: "No active SparkSession found"

Solution: Ensure you're running the code in a Databricks notebook with an active Spark session. If using a Python script, create a SparkSession:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("UDTF Registration").getOrCreate()

Issue: "PySpark is required for session-scoped UDTF registration"

Solution: Install PySpark or ensure you're running in a Databricks environment where PySpark is available:

%pip install pyspark

Issue: UDTF returns no results

Possible Causes: 1. Incorrect credentials: Verify that SECRET() values are correct 2. No matching data: Check that filters match existing data in CDF 3. Time series doesn't exist: Verify the time series external_id exists in CDF

Debug Steps:

# Test credentials
from cognite.pygen import load_cognite_client_from_toml
client = load_cognite_client_from_toml("config.toml")

# Test data model query
from cognite.client.data_classes.data_modeling.ids import DataModelId
data_model_id = DataModelId(space="sailboat", external_id="sailboat", version="v1")
views = client.data_modeling.views.list(data_model_id)
print(f"Found {len(views)} views")

Issue: "Module not found" errors

Solution: Restart the Python kernel after installing packages with %pip:

Run %pip install cognite-sdk cognite-databricks
When prompted, click "Restart" to restart the kernel
Re-run your registration code

Issue: UDTF registration succeeds but SQL query fails

Possible Causes: 1. Function name mismatch: Verify the registered function name matches what you're calling in SQL 2. Parameter mismatch: Check that all required parameters are provided 3. Type errors: Ensure parameter types match the UDTF's expected types

Debug Steps:

# Check registered functions
registered = generator.register_session_scoped_udtfs()
print("Registered functions:", registered)

# Verify function name in SQL matches
# If registered as "small_boat_udtf", use:
# SELECT * FROM small_boat_udtf(
#     client_id => SECRET('cdf_sailboat_sailboat', 'client_id'),
#     client_secret => SECRET('cdf_sailboat_sailboat', 'client_secret'),
#     tenant_id => SECRET('cdf_sailboat_sailboat', 'tenant_id'),
#     cdf_cluster => SECRET('cdf_sailboat_sailboat', 'cdf_cluster'),
#     project => SECRET('cdf_sailboat_sailboat', 'project'),
#     name => NULL,
#     description => NULL
# ) LIMIT 10;

Issue: "[NOT_A_SCALAR_FUNCTION] ... appears as a scalar expression here"

Cause: UDTFs are table functions. The SQL syntax differs between notebook %sql and SQL Warehouse.

Solution: Use the table-function syntax for your environment:

Notebook %sql (cluster-backed):

SELECT *
FROM time_series_datapoints_detailed_udtf(
  client_id     => SECRET('cdf_sailboat_sailboat', 'client_id'),
  client_secret => SECRET('cdf_sailboat_sailboat', 'client_secret'),
  tenant_id     => SECRET('cdf_sailboat_sailboat', 'tenant_id'),
  cdf_cluster   => SECRET('cdf_sailboat_sailboat', 'cdf_cluster'),
  project       => SECRET('cdf_sailboat_sailboat', 'project'),
  instance_ids  => 'space1:ts1,space1:ts2',
  start         => current_timestamp() - INTERVAL 52 WEEKS,
  end           => current_timestamp() - INTERVAL 51 WEEKS,
  aggregates    => 'average',
  granularity   => '2h'
) AS t;

SQL Warehouse (Databricks SQL):

SELECT *
FROM TABLE(
  time_series_datapoints_detailed_udtf(
    client_id     => SECRET('cdf_sailboat_sailboat', 'client_id'),
    client_secret => SECRET('cdf_sailboat_sailboat', 'client_secret'),
    tenant_id     => SECRET('cdf_sailboat_sailboat', 'tenant_id'),
    cdf_cluster   => SECRET('cdf_sailboat_sailboat', 'cdf_cluster'),
    project       => SECRET('cdf_sailboat_sailboat', 'project'),
    instance_ids  => 'space1:ts1,space1:ts2',
    start         => current_timestamp() - INTERVAL 52 WEEKS,
    end           => current_timestamp() - INTERVAL 51 WEEKS,
    aggregates    => 'average',
    granularity   => '2h'
  )
);

Issue: Time Series UDTF returns NULL values

Possible Causes: 1. Time series doesn't exist: Verify the time series external_id exists 2. No datapoints in range: Check that the time range (start/end) contains data 3. Incorrect instance_id format: Ensure space and external_id are correct

Debug Steps:

# Test time series existence
from cognite.client.data_classes.data_modeling.ids import NodeId

instance_id = NodeId(space="sailboat", external_id="vessel.speed")
ts = client.time_series.retrieve(external_id=instance_id.external_id)
print(f"Time series exists: {ts is not None}")

# Test datapoints retrieval
datapoints = client.time_series.data.retrieve(
    external_id=instance_id.external_id,
    start="1d-ago",
    end="now"
)
print(f"Found {len(datapoints)} datapoints")

Getting Help

If you encounter issues not covered here:

Check the logs: Look for error messages in the notebook output or Databricks logs
Verify credentials: Ensure CDF credentials are correct and have proper permissions
Test with simple queries: Start with basic queries before adding complex filters or joins
Review the Technical Plan: See the Technical Plan document for detailed architecture and implementation details

Next Steps

After successfully using session-scoped UDTFs, consider:

Unity Catalog Registration: Register UDTFs and Views in Unity Catalog for production use (see Catalog-Based UDTF Registration)
View Creation: Create SQL Views that wrap UDTFs for easier querying
Governance: Set up Unity Catalog permissions for production deployments

For more information, see: - Catalog-Based UDTF Registration - Technical Plan: CDF Databricks Integration (UDTF-Based)