-
Notifications
You must be signed in to change notification settings - Fork 174
Open
Labels
bug: incorrect resultSomething isn't workingSomething isn't workinghigh priorityYour PR will be reviewed very quickly if you address thisYour PR will be reviewed very quickly if you address this
Description
Looks like we have some discrepancies in how null keys are matched in joins:
import narwhals as nw
import polars as pl
ldf = pl.DataFrame({'a': [1, 2, None]})
rdf = pl.DataFrame({'a': [1, 2, None], 'b': [5,6,7]})
# 'left', 'inner'
print('polars')
print(ldf.join(rdf, on='a'))
print(ldf.join(rdf, on='a', how='left'))
engine = 'pandas'
print(engine)
print(nw.from_native(ldf).lazy().collect(engine).join(nw.from_native(rdf).lazy().collect(engine), on='a'))
print(nw.from_native(ldf).lazy().collect(engine).join(nw.from_native(rdf).lazy().collect(engine), on='a', how='left'))
engine = 'pandas'
print('pandas[pyarrow]')
print(nw.from_native(ldf.to_pandas(use_pyarrow_extension_array=True)).join(nw.from_native(rdf.to_pandas(use_pyarrow_extension_array=True)), on='a'))
print(nw.from_native(ldf.to_pandas(use_pyarrow_extension_array=True)).join(nw.from_native(rdf.to_pandas(use_pyarrow_extension_array=True)), on='a', how='left'))
engine = 'pyarrow'
print(engine)
print(nw.from_native(ldf).lazy().collect(engine).join(nw.from_native(rdf).lazy().collect(engine), on='a'))
print(nw.from_native(ldf).lazy().collect(engine).join(nw.from_native(rdf).lazy().collect(engine), on='a', how='left'))
engine = 'duckdb'
print(engine)
print(nw.from_native(ldf).lazy(backend=engine).join(nw.from_native(rdf).lazy(backend=engine), on='a'))
print(nw.from_native(ldf).lazy(backend=engine).join(nw.from_native(rdf).lazy(backend=engine), on='a', how='left'))output
polars
shape: (2, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 5 │
│ 2 ┆ 6 │
└─────┴─────┘
shape: (3, 2)
┌──────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ 5 │
│ 2 ┆ 6 │
│ null ┆ null │
└──────┴──────┘
pandas
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
| a b |
| 0 1.0 5 |
| 1 2.0 6 |
| 2 NaN 7 |
└──────────────────┘
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
| a b |
| 0 1.0 5 |
| 1 2.0 6 |
| 2 NaN 7 |
└──────────────────┘
pandas[pyarrow]
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
| a b |
| 0 1 5 |
| 1 2 6 |
| 2 <NA> 7 |
└──────────────────┘
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
| a b |
| 0 1 5 |
| 1 2 6 |
| 2 <NA> 7 |
└──────────────────┘
pyarrow
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
| pyarrow.Table |
| a: int64 |
| b: int64 |
| ---- |
| a: [[1,2]] |
| b: [[5,6]] |
└──────────────────┘
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
| pyarrow.Table |
| a: int64 |
| b: int64 |
| ---- |
| a: [[1,2,null]] |
| b: [[5,6,null]] |
└──────────────────┘
duckdb
┌──────────────────┐
|Narwhals LazyFrame|
|------------------|
|┌───────┬───────┐ |
|│ a │ b │ |
|│ int64 │ int64 │ |
|├───────┼───────┤ |
|│ 1 │ 5 │ |
|│ 2 │ 6 │ |
|└───────┴───────┘ |
└──────────────────┘
┌──────────────────┐
|Narwhals LazyFrame|
|------------------|
|┌───────┬───────┐ |
|│ a │ b │ |
|│ int64 │ int64 │ |
|├───────┼───────┤ |
|│ 1 │ 5 │ |
|│ 2 │ 6 │ |
|│ NULL │ NULL │ |
|└───────┴───────┘ |
└──────────────────┘
Looks like pandas is the odd one out here
We should align the outputs, or, if that's not feasible, then for now to raise if the joining key(s) contain null values
Metadata
Metadata
Assignees
Labels
bug: incorrect resultSomething isn't workingSomething isn't workinghigh priorityYour PR will be reviewed very quickly if you address thisYour PR will be reviewed very quickly if you address this