Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Requested types ignored if prune_schema is enabled for JSON reading #16797

Open
revans2 opened this issue Sep 11, 2024 · 1 comment
Open
Labels
bug Something isn't working Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Sep 11, 2024

Describe the bug
I noticed this in some unit tests for the java APIs when I tried to enable schema pruning in CUDF by default for java JSON read APIs that explicitly do column pruning.

  • @Test
    void testReadJSONNestedTypes() {
    Schema.Builder root = Schema.builder();
    Schema.Builder a = root.addColumn(DType.STRUCT, "a");
    a.addColumn(DType.STRING, "b");
    a.addColumn(DType.STRING, "c");
    a.addColumn(DType.STRING, "missing");
    Schema.Builder d = root.addColumn(DType.LIST, "d");
    d.addColumn(DType.INT64, "ignored");
    root.addColumn(DType.INT64, "also_missing");
    Schema.Builder e = root.addColumn(DType.LIST, "e");
    Schema.Builder eChild = e.addColumn(DType.STRUCT, "ignored");
    eChild.addColumn(DType.INT64, "f");
    eChild.addColumn(DType.STRING, "missing_in_list");
    eChild.addColumn(DType.INT64, "g");
    Schema schema = root.build();
    JSONOptions opts = JSONOptions.builder()
    .withLines(true)
    .build();
    StructType aStruct = new StructType(true,
    new BasicType(true, DType.STRING),
    new BasicType(true, DType.STRING),
    new BasicType(true, DType.STRING));
    ListType dList = new ListType(true, new BasicType(true, DType.INT64));
    StructType eChildStruct = new StructType(true,
    new BasicType(true, DType.INT64),
    new BasicType(true, DType.STRING),
    new BasicType(true, DType.INT64));
    ListType eList = new ListType(true, eChildStruct);
    try (Table expected = new Table.TestBuilder()
    .column(aStruct,
    new StructData(null, "C1", null),
    new StructData("B2", "C2", null),
    null,
    null)
    .column(dList,
    null,
    null,
    Arrays.asList(1L,2L,3L),
    new ArrayList<Long>())
    .column((Long)null, null, null, null) // also_missing
    .column(eList,
    null,
    null,
    null,
    Arrays.asList(new StructData(null, null, 1L), new StructData(2L, null, null), new StructData(3L, null, 4L)))
    .build();
    Table table = Table.readJSON(schema, opts, NESTED_JSON_DATA_BUFFER)) {
    assertTablesAreEqual(expected, table);
    }
    }
    which fails because column d is being returned as a LIST<INT8> instead of a LIST<INT64> which is what it was requested to be, and which is what is returned for column d if pruning is disabled.
  • @Test
    void testReadJSONNestedTypesDataSource() {
    Schema.Builder root = Schema.builder();
    Schema.Builder a = root.addColumn(DType.STRUCT, "a");
    a.addColumn(DType.STRING, "b");
    a.addColumn(DType.STRING, "c");
    a.addColumn(DType.STRING, "missing");
    Schema.Builder d = root.addColumn(DType.LIST, "d");
    d.addColumn(DType.INT64, "ignored");
    root.addColumn(DType.INT64, "also_missing");
    Schema.Builder e = root.addColumn(DType.LIST, "e");
    Schema.Builder eChild = e.addColumn(DType.STRUCT, "ignored");
    eChild.addColumn(DType.INT64, "g");
    Schema schema = root.build();
    JSONOptions opts = JSONOptions.builder()
    .withLines(true)
    .build();
    StructType aStruct = new StructType(true,
    new BasicType(true, DType.STRING),
    new BasicType(true, DType.STRING),
    new BasicType(true, DType.STRING));
    ListType dList = new ListType(true, new BasicType(true, DType.INT64));
    StructType eChildStruct = new StructType(true,
    new BasicType(true, DType.INT64));
    ListType eList = new ListType(true, eChildStruct);
    try (Table expected = new Table.TestBuilder()
    .column(aStruct,
    new StructData(null, "C1", null),
    new StructData("B2", "C2", null),
    null,
    null)
    .column(dList,
    null,
    null,
    Arrays.asList(1L,2L,3L),
    new ArrayList<Long>())
    .column((Long)null, null, null, null) // also_missing
    .column(eList,
    null,
    null,
    null,
    Arrays.asList(new StructData(1L), new StructData((Long)null), new StructData(4L)))
    .build();
    MultiBufferDataSource source = sourceFrom(NESTED_JSON_DATA_BUFFER);
    Table table = Table.readJSON(schema, opts, source)) {
    assertTablesAreEqual(expected, table);
    }
    }
    is failing for the same reason as the above one. column d is the wrong type.
  • @Test
    void testReadJSONNestedTypesVerySmallChanges() {
    Schema.Builder root = Schema.builder();
    Schema.Builder e = root.addColumn(DType.LIST, "e");
    Schema.Builder eChild = e.addColumn(DType.STRUCT, "ignored");
    eChild.addColumn(DType.INT64, "g");
    eChild.addColumn(DType.INT64, "f");
    Schema schema = root.build();
    JSONOptions opts = JSONOptions.builder()
    .withLines(true)
    .build();
    StructType eChildStruct = new StructType(true,
    new BasicType(true, DType.INT64),
    new BasicType(true, DType.INT64));
    ListType eList = new ListType(true, eChildStruct);
    try (Table expected = new Table.TestBuilder()
    .column(eList,
    null,
    null,
    null,
    Arrays.asList(new StructData(1L, null), new StructData(null, 2L), new StructData(4L, 3L)))
    .build();
    Table table = Table.readJSON(schema, opts, NESTED_JSON_DATA_BUFFER)) {
    assertTablesAreEqual(expected, table);
    }
    }
    is failing because column e was requested to be a LIST<STRUCT>, but it was returned as a LIST<INT8> column.

Steps/Code to reproduce bug
If you want to reproduce this you can take #16796 and enable column pruning for the tests that are listed as failing. The third test is the scariest one, and it appears to return totally invalid results where the data column is empty despite the there being offsets pointing into it.

If I need to create a C++ repro case I am happy to do it

Expected behavior
I would expect the types in the schema to be honored at least in the same way that it is for the non pruning use case.

@revans2 revans2 added bug Something isn't working Spark Functionality that helps Spark RAPIDS labels Sep 11, 2024
rapids-bot bot pushed a commit that referenced this issue Sep 24, 2024
This adds in the options to enable column_pruning when reading JSON using the java APIs.

This is still in draft because there are test failures if this is turned on for those tests.

#16797

That said the performance impact from enabling column pruning on some queries is huge. For one query in particular the current code takes 161.5 seconds and with CUDF column pruning it is just 16.5 seconds. That is a 10x speedup for something that is fairly real world.

Authors:
  - Robert (Bobby) Evans (https://github.com/revans2)

Approvers:
  - Alessandro Bellina (https://github.com/abellina)
  - Nghia Truong (https://github.com/ttnghia)

URL: #16796
@revans2
Copy link
Contributor Author

revans2 commented Oct 7, 2024

I think this is fixed if the experimental feature is enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

1 participant