Skip to content

fix(bendpy): register_csv/register_tsv fails with column position error#19444

Draft
bohutang wants to merge 13 commits intomainfrom
fix/bendpy-register-csv-column-positions
Draft

fix(bendpy): register_csv/register_tsv fails with column position error#19444
bohutang wants to merge 13 commits intomainfrom
fix/bendpy-register-csv-column-positions

Conversation

@bohutang
Copy link
Copy Markdown
Member

@bohutang bohutang commented Feb 11, 2026

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

register_csv() and register_tsv() generate SELECT * in the underlying CREATE VIEW, which fails because CSV/TSV files require explicit column positions ($1, $2, ...).

Fix: call infer_schema() first to detect column names, then generate SELECT $1 AS col1, $2 AS col2, ... instead of SELECT *.

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-bugfix this PR patches a bug in codebase label Feb 11, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ed9758317f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@bohutang
Copy link
Copy Markdown
Member Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5cd10f5373

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +214 to +216
let select_clause = match file_format {
"csv" | "tsv" => self.build_column_select(&file_path, file_format, connection, py)?,
_ => "*".to_string(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Skip TSV schema inference until infer_schema supports TSV

register_tsv() now routes through build_column_select, which emits infer_schema(..., file_format => 'TSV'); however the infer-schema table function currently rejects TSV formats (it only accepts Parquet/CSV/NDJSON in src/query/service/src/table_functions/infer_schema/infer_schema_table.rs), so TSV registration still fails on every call. This means the patch does not actually fix the TSV path and users still cannot register TSV files.

Useful? React with 👍 / 👎.

@bohutang bohutang marked this pull request as draft February 12, 2026 01:51
@KKould KKould force-pushed the fix/bendpy-register-csv-column-positions branch from 3e7a038 to 62966a3 Compare March 17, 2026 10:22
@KKould KKould force-pushed the fix/bendpy-register-csv-column-positions branch 2 times, most recently from 9e0390a to b3c5792 Compare March 18, 2026 11:29
bohutang and others added 13 commits March 18, 2026 19:39
- Extract resolve_file_path() and extract_string_column() as standalone helpers
- Replace imperative loop with functional iterator chain
- Rename infer_column_names to build_column_select for clarity
- Deduplicate mock logic in test_connections.py via _register_delimited()
…tion

The test_bendpy job runs on a bare self-hosted ARM64 runner without
clang/build-essential installed. Add a setup step that runs dev_setup.sh
to install build dependencies and sets JEMALLOC env vars for Linux
development builds, matching what the macOS path already does.
infer_schema only supports Parquet, CSV, and NDJSON formats. Routing
TSV through build_column_select would fail at runtime. Fall back to
SELECT * for TSV and remove the TSV integration test accordingly.
…nux CI

- Remove braces from single-expression match arm to satisfy cargo fmt
- Replace dev_setup.sh with direct apt-get install to avoid perl errors
  on the self-hosted ARM64 runner
The .cargo/config.toml uses -fuse-ld=mold on Linux targets, so mold
must be installed on the self-hosted runner alongside clang.
infer_schema requires the stage system (system.stage table) which is
not available in bendpy's embedded context. Replace with direct file
reading of the CSV header line to extract column names for building
$1 AS col1, $2 AS col2, ... select clauses.
The embedded bendpy context does not have the system.stage table, so
file-based queries (SELECT FROM 'fs://...') cannot work. Remove the
integration test; SQL generation is already covered by mock tests in
test_connections.py.
@KKould KKould force-pushed the fix/bendpy-register-csv-column-positions branch from b3c5792 to 776a4cd Compare March 18, 2026 11:41
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 18, 2026

🤖 CI Job Analysis

Workflow: 23242956094

📊 Summary

  • Total Jobs: 86
  • Failed Jobs: 1
  • Retryable: 0
  • Code Issues: 1

NO RETRY NEEDED

All failures appear to be code/test issues requiring manual fixes.

🔍 Job Details

  • test_bendpy: Not retryable (Code/Test)

🤖 About

Automated analysis using job annotations to distinguish infrastructure issues (auto-retried) from code/test issues (manual fixes needed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix this PR patches a bug in codebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bendpy: register_csv() fails with 'Query from CSV file lacks column positions'

3 participants