Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement String/Binary View Support #340

Merged
merged 2 commits into from
Sep 30, 2024
Merged

Implement String/Binary View Support #340

merged 2 commits into from
Sep 30, 2024

Conversation

WillAyd
Copy link
Collaborator

@WillAyd WillAyd commented Sep 24, 2024

closes #333 and #316

@WillAyd
Copy link
Collaborator Author

WillAyd commented Sep 24, 2024

@skyth540 if you get the chance to test this out on your end that would be very helpful

@WillAyd
Copy link
Collaborator Author

WillAyd commented Sep 24, 2024

Hmm the polars behavior looks strange - opened pola-rs/polars#18909 upstream to take a closer look

@skyth540
Copy link

@skyth540 if you get the chance to test this out on your end that would be very helpful

How do I get the changes? Last time I built from main, nothing was different from the normal pip release

@WillAyd
Copy link
Collaborator Author

WillAyd commented Sep 24, 2024

You can pip install directly from the branch:

pip install git+https://github.com/innobi/pantab.git@string-view

However, probably worth tracking the aforementioned issue before diving too far in. I think there are some bugs upstream in polars that might need to be fixed first

@skyth540
Copy link

It is erroring

Building wheels for collected packages: pantab
Building wheel for pantab (pyproject.toml) ... error
error: subprocess-exited-with-error

× Building wheel for pantab (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [287 lines of output]
*** scikit-build-core 0.10.7 using CMake 3.30.3 (wheel)
*** Configuring CMake...
2024-09-25 07:47:07,530 - scikit_build_core - WARNING - Can't find a Python library, got libdir=None, ldlibrary=None, multiarch=None, masd=None
loading initial cache file E:\TEMP\tmp5jmcmbvn\build\CMakeInit.txt
-- Building for: Visual Studio 17 2022
-- Selecting Windows SDK version 10.0.22621.0 to target Windows 10.0.19041.
-- The C compiler identification is MSVC 19.41.34120.0
-- The CXX compiler identification is MSVC 19.41.34120.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: C:/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools/VC/Tools/MSVC/14.41.34120/bin/Hostx64/x64/cl.exe - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: C:/Program Files (x86)/Microsoft Visual Studio/2022/BuildTools/VC/Tools/MSVC/14.41.34120/bin/Hostx64/x64/cl.exe - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Python: C:\Users\nicho\anaconda3\python.exe (found version "3.12.4") found components: Interpreter Development.Module
-- Building using CMake version: 3.30.3
Could not find clang-tidy installation - checks disabled
-- Configuring done (37.7s)
-- Generating done (0.2s)
-- Build files have been written to: E:/TEMP/tmp5jmcmbvn/build
*** Building project with Visual Studio 17 2022...
MSBuild version 17.11.9+a69bbaaf5 for .NET Framework

  C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\Microsoft.CppBuild.targets(541,5): warning MSB8029: The Intermediate directory or Output directory cannot reside under the Temporary directory as it could lead to issues with incremental build. [E:\TEMP\tmp5jmcmbvn\build\ZERO_CHECK.vcxproj]
    1>Checking Build System
  C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\Microsoft.CppBuild.targets(541,5): warning MSB8029: The Intermediate directory or Output directory cannot reside under the Temporary directory as it could lead to issues with incremental build. [E:\TEMP\tmp5jmcmbvn\build\src\pantab\copy-python-src.vcxproj]
    Generating Release/__init__.py
    Generating Release/_reader.py
    Generating Release/_types.py
    Generating Release/_writer.py
    Building Custom Rule E:/TEMP/pip-req-build-irkx6iwh/src/pantab/CMakeLists.txt
  C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\Microsoft.CppBuild.targets(541,5): warning MSB8029: The Intermediate directory or Output directory cannot reside under the Temporary directory as it could lead to issues with incremental build. [E:\TEMP\tmp5jmcmbvn\build\_deps\nanoarrow-project-build\nanoarrow.vcxproj]
    Building Custom Rule E:/TEMP/tmp5jmcmbvn/build/_deps/nanoarrow-project-src/CMakeLists.txt
    array.c
    schema.c
    array_stream.c
    utils.c
    Generating Code...
    nanoarrow.vcxproj -> E:\TEMP\tmp5jmcmbvn\build\_deps\nanoarrow-project-build\Release\nanoarrow.lib
  C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\Microsoft.CppBuild.targets(541,5): warning MSB8029: The Intermediate directory or Output directory cannot reside under the Temporary directory as it could lead to issues with incremental build. [E:\TEMP\tmp5jmcmbvn\build\src\pantab\nanobind-static.vcxproj]
    Building Custom Rule E:/TEMP/pip-req-build-irkx6iwh/src/pantab/CMakeLists.txt
    nb_internals.cpp
    nb_func.cpp
    nb_type.cpp
    nb_enum.cpp
    nb_ndarray.cpp
    nb_static_property.cpp
    common.cpp
    error.cpp
    trampoline.cpp
    implicit.cpp
    Generating Code...
    nanobind-static.vcxproj -> E:\TEMP\tmp5jmcmbvn\build\src\pantab\Release\nanobind-static.lib
  C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\Microsoft.CppBuild.targets(541,5): warning MSB8029: The Intermediate directory or Output directory cannot reside under the Temporary directory as it could lead to issues with incremental build. [E:\TEMP\tmp5jmcmbvn\build\src\pantab\libpantab.vcxproj]
    Building Custom Rule E:/TEMP/pip-req-build-irkx6iwh/src/pantab/CMakeLists.txt
    libpantab.cpp
    reader.cpp
    writer.cpp
  E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,31): error C2665: 'std::basic_string<char,std::char_traits<char>,std::allocator<char>>::basic_string': no overloaded function could convert all the argument types [E:\TEMP\tmp5jmcmbvn\build\src\pantab\libpantab.vcxproj]
        C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.41.34120\include\xstring(1327,5):
        could be 'std::basic_string<char,std::char_traits<char>,std::allocator<char>>::basic_string(std::initializer_list<_Elem>,const _Alloc &)'
            with
            [
                _Elem=char,
                _Alloc=std::allocator<char>
            ]
            E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,31):
            'std::basic_string<char,std::char_traits<char>,std::allocator<char>>::basic_string(std::initializer_list<_Elem>,const _Alloc &)': cannot convert argument 1 from 'ArrowBufferViewData' to 'std::initializer_list<_Elem>'
            with
            [
                _Elem=char,
                _Alloc=std::allocator<char>
            ]
            and
            [
                _Elem=char
            ]
                E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,40):
                No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called
        C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.41.34120\include\xstring(1148,5):
        or       'std::basic_string<char,std::char_traits<char>,std::allocator<char>>::basic_string(std::_String_constructor_rvalue_allocator_tag,_Alloc &&)'
            with
            [
                _Alloc=std::allocator<char>
            ]
            E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,31):
            'std::basic_string<char,std::char_traits<char>,std::allocator<char>>::basic_string(std::_String_constructor_rvalue_allocator_tag,_Alloc &&)': cannot convert argument 1 from 'ArrowBufferViewData' to 'std::_String_constructor_rvalue_allocator_tag'
            with
            [
                _Alloc=std::allocator<char>
            ]
                E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,40):
                No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called
        C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.41.34120\include\xstring(1014,5):
        or       'std::basic_string<char,std::char_traits<char>,std::allocator<char>>::basic_string(std::basic_string<char,std::char_traits<char>,std::allocator<char>> &&,const _Alloc &) noexcept(<expr>)'
            with
            [
                _Alloc=std::allocator<char>
            ]
            E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,31):
            'std::basic_string<char,std::char_traits<char>,std::allocator<char>>::basic_string(std::basic_string<char,std::char_traits<char>,std::allocator<char>> &&,const _Alloc &) noexcept(<expr>)': 

cannot convert argument 1 from 'ArrowBufferViewData' to 'std::basic_string<char,std::char_traits,std::allocator> &&'
with
[
_Alloc=std::allocator
]
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,40):
Reason: cannot convert from 'ArrowBufferViewData' to 'std::basic_string<char,std::char_traits,std::allocator>'
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,40):
No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.41.34120\include\xstring(765,5):
or 'std::basic_string<char,std::char_traits,std::allocator>::basic_string(const unsigned __int64,const _Elem)'
with
[
_Elem=char
]
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,31):
'std::basic_string<char,std::char_traits,std::allocator>::basic_string(const unsigned __int64,const _Elem)': cannot convert argument 1 from 'ArrowBufferViewData' to 'const unsigned __int64'
with
[
_Elem=char
]
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,40):
No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.41.34120\include\xstring(735,5):
or 'std::basic_string<char,std::char_traits,std::allocator>::basic_string(const _Elem *const ,const unsigned __int64)'
with
[
_Elem=char
]
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,31):
'std::basic_string<char,std::char_traits,std::allocator>::basic_string(const _Elem *const ,const unsigned __int64)': cannot convert argument 1 from 'ArrowBufferViewData' to 'const _Elem *const '
with
[
_Elem=char
]
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,40):
No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.41.34120\include\xstring(707,5):
or 'std::basic_string<char,std::char_traits,std::allocator>::basic_string(const std::basic_string<char,std::char_traits,std::allocator> &,const unsigned __int64,const _Alloc &)'
with
[
_Alloc=std::allocator
]
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,31):
'std::basic_string<char,std::char_traits,std::allocator>::basic_string(const std::basic_string<char,std::char_traits,std::allocator> &,const unsigned __int64,const _Alloc &)': cannot convert argument 1 from 'ArrowBufferViewData' to 'const std::basic_string<char,std::char_traits,std::allocator> &'
with
[
_Alloc=std::allocator
]
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,40):
Reason: cannot convert from 'ArrowBufferViewData' to 'const std::basic_string<char,std::char_traits,std::allocator>'
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,40):
No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.41.34120\include\xstring(702,5):
or 'std::basic_string<char,std::char_traits,std::allocator>::basic_string(const std::basic_string<char,std::char_traits,std::allocator> &,const _Alloc &)'
with
[
_Alloc=std::allocator
]
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,31):
'std::basic_string<char,std::char_traits,std::allocator>::basic_string(const std::basic_string<char,std::char_traits,std::allocator> &,const _Alloc &)': cannot convert argument 1 from 'ArrowBufferViewData' to 'const std::basic_string<char,std::char_traits,std::allocator> &'
with
[
_Alloc=std::allocator
]
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,40):
Reason: cannot convert from 'ArrowBufferViewData' to 'const std::basic_string<char,std::char_traits,std::allocator>'
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,40):
No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.41.34120\include\xstring(1138,5):
or 'std::basic_string<char,std::char_traits,std::allocator>::basic_string(const _Ty &,const unsigned __int64,const unsigned __int64,const _Alloc &)'
with
[
_Alloc=std::allocator
]
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,31):
'std::basic_string<char,std::char_traits,std::allocator>::basic_string(const _Ty &,const unsigned __int64,const unsigned __int64,const _Alloc &)': expects 4 arguments - 2 provided
with
[
_Alloc=std::allocator
]
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.41.34120\include\xstring(1131,5):
or 'std::basic_string<char,std::char_traits,std::allocator>::basic_string(const _StringViewIsh &,const _Alloc &)'
with
[
_Alloc=std::allocator
]
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,31):
'std::basic_string<char,std::char_traits,std::allocator>::basic_string(const _StringViewIsh &,const _Alloc &)': could not deduce template argument for '__formal'
with
[
_Alloc=std::allocator
]
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.41.34120\include\xstring(599,9):
'std::enable_if_t<false,int>' : Failed to specialize alias template
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.41.34120\include\xstring(779,5):
or 'std::basic_string<char,std::char_traits,std::allocator>::basic_string(_Iter,_Iter,const _Alloc &)'
with
[
_Alloc=std::allocator
]
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,31):
'std::basic_string<char,std::char_traits,std::allocator>::basic_string(_Iter,_Iter,const _Alloc &)': template parameter '_Iter' is ambiguous
with
[
_Alloc=std::allocator
]
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,31):
could be 'int64_t'
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,31):
or 'ArrowBufferViewData'
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,31):
'std::basic_string<char,std::char_traits,std::allocator>::basic_string(_Iter,_Iter,const _Alloc &)': could not deduce template argument for '_Iter' from 'int64_t'
with
[
_Alloc=std::allocator
]
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.41.34120\include\xstring(773,5):
or 'std::basic_string<char,std::char_traits,std::allocator>::basic_string(const unsigned __int64,const _Elem,const _Alloc &)'
with
[
_Elem=char,
_Alloc=std::allocator
]
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,31):
'std::basic_string<char,std::char_traits,std::allocator>::basic_string(const unsigned __int64,const _Elem,const _Alloc &)': expects 3 arguments - 2 provided
with
[
_Elem=char,
_Alloc=std::allocator
]
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.41.34120\include\xstring(756,5):
or 'std::basic_string<char,std::char_traits,std::allocator>::basic_string(const _Elem *const ,const _Alloc &)'
with
[
_Elem=char,
_Alloc=std::allocator
]
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,40):
'initializing': cannot convert from 'ArrowBufferViewData' to 'const _Elem *const '
with
[
_Elem=char
]
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,40):
No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.41.34120\include\xstring(743,5):
or 'std::basic_string<char,std::char_traits,std::allocator>::basic_string(const _Elem *const ,const unsigned __int64,const _Alloc &)'
with
[
_Elem=char,
_Alloc=std::allocator
]
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,31):
'std::basic_string<char,std::char_traits,std::allocator>::basic_string(const _Elem *const ,const unsigned __int64,const _Alloc &)': expects 3 arguments - 2 provided
with
[
_Elem=char,
_Alloc=std::allocator
]
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,31):
while trying to match the argument list '(ArrowBufferViewData, int64_t)'
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(231,31):
the template instantiation context (the oldest one first) is
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(565,5):
while compiling class template member function 'std::unique_ptr<InsertHelper,std::default_delete>::unique_ptr(std::unique_ptr<_Ux,_Dx> &&) noexcept'
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(565,5):
while processing the default template argument of 'std::unique_ptr<InsertHelper,std::default_delete>::unique_ptr(std::unique_ptr<_Ux,_Dx> &&) noexcept'
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.41.34120\include\memory(3377,20):
see reference to variable template 'const bool conjunction_v<std::negation<std::is_array<BinaryViewInsertHelper<1> > >,std::is_convertible<BinaryViewInsertHelper<1> *,InsertHelper *>,std::is_convertible<std::default_delete<BinaryViewInsertHelper<1> >,std::default_delete > >' being compiled
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.41.34120\include\memory(3377,20):
see reference to class template instantiation 'std::is_convertible<BinaryViewInsertHelper *,_Ty *>' being compiled
with
[
_Ty=InsertHelper
]
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.41.34120\include\type_traits(322,39):
see reference to class template instantiation 'BinaryViewInsertHelper' being compiled
E:\TEMP\pip-req-build-irkx6iwh\src\pantab\writer.cpp(210,8):
while compiling class template member function 'void BinaryViewInsertHelper::InsertValueAtIndex(size_t)'

  *** CMake build failed
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for pantab
Failed to build pantab
ERROR: Could not build wheels for pantab, which is required to install pyproject.toml-based projects

@WillAyd
Copy link
Collaborator Author

WillAyd commented Sep 27, 2024

Sorry about that. Should be working now if you'd like to give it another go

@skyth540
Copy link

Looks like something is still breaking... I successfully installed it from the branch, but using from_to_hyper is crashing my kernel.

Here are the jupyter logs:

Visual Studio Code (1.93.1, undefined, desktop)
Jupyter Extension Version: 2024.8.1.
Python Extension Version: 2024.14.1.
Pylance Extension Version: 2024.9.2.
Platform: win32 (x64).
Temp Storage folder ~\AppData\Roaming\Code\User\globalStorage\ms-toolsai.jupyter\version-2024.8.1
Workspace folder \OneDrive\Desktop\Python Projects, Home = c:\Users\nicho
12:54:31.319 [warn] No interpreter with path c:\Python_Virtual_Environments\Snowpark_venv\Snowpark_venv\Scripts\python.exe found in Python API, will convert Uri path to string as Id c:\Python_Virtual_Environments\Snowpark_venv\Snowpark_venv\Scripts\python.exe
12:54:31.900 [info] Starting Kernel (Python Path: \anaconda3\python.exe, Conda, 3.12.4) for '\OneDrive\Desktop\project.ipynb' (disableUI=true)
12:54:34.238 [warn] Kernel Spec for 'Snowpark_venv' (
\AppData\Roaming\jupyter\kernels\snowpark_venv\kernel.json) hidden, as we cannot find a matching interpreter argv = 'C:\Python_Virtual_Environments\Snowpark_venv\Snowpark_venv\Scripts\python.exe'. To resolve this, please change 'C:\Python_Virtual_Environments\Snowpark_venv\Snowpark_venv\Scripts\python.exe' to point to the fully qualified Python executable.
12:54:39.272 [info] Process Execution: ~\anaconda3\python.exe -m pip list
12:54:39.385 [info] Process Execution: ~\anaconda3\python.exe -c "import ipykernel; print(ipykernel.version); print("5dc3a68c-e34e-4080-9c3e-2a532b2ccb4d"); print(ipykernel.file)"
12:54:39.393 [info] Process Execution: ~\anaconda3\python.exe c:\Users~.vscode\extensions\ms-toolsai.jupyter-2024.8.1-win32-x64\pythonFiles\vscode_datascience_helpers\kernel_interrupt_daemon.py --ppid 12380
> cwd: ~.vscode\extensions\ms-toolsai.jupyter-2024.8.1-win32-x64\pythonFiles\vscode_datascience_helpers
12:54:39.609 [info] Process Execution: ~\anaconda3\python.exe -m ipykernel_launcher --f=c:\Users~\AppData\Roaming\jupyter\runtime\kernel-v3de2eaa0630cf1896009153504f8d8bfe905a8879.json
> cwd: ~\OneDrive\Desktop
12:54:41.498 [info] Kernel successfully started
12:54:41.512 [info] Process Execution: ~\anaconda3\python.exe c:\Users~.vscode\extensions\ms-toolsai.jupyter-2024.8.1-win32-x64\pythonFiles\printJupyterDataDir.py
12:54:59.849 [info] Restart requested ~\OneDrive\Desktop\project.ipynb
12:54:59.859 [info] Process Execution: c:\WINDOWS\System32\taskkill.exe /F /T /PID 5356
12:54:59.867 [info] Process Execution: ~\anaconda3\python.exe -c "import ipykernel; print(ipykernel.version); print("5dc3a68c-e34e-4080-9c3e-2a532b2ccb4d"); print(ipykernel.file)"
12:54:59.951 [info] Process Execution: ~\anaconda3\python.exe -m ipykernel_launcher --f=c:\Users~\AppData\Roaming\jupyter\runtime\kernel-v3132f99d4e1795d35c61c383f16963df44be4c249.json
> cwd: ~\OneDrive\Desktop
12:55:02.042 [info] Restarted 5c241309-3938-4860-8e6e-472c1eaf2746
13:19:39.110 [error] Disposing session as kernel process died ExitCode: 3221225477, Reason:

@WillAyd
Copy link
Collaborator Author

WillAyd commented Sep 27, 2024

Do you have a small code sample that is crashing?

@skyth540
Copy link

skyth540 commented Sep 27, 2024

pl.enable_string_cache()

df_schema ={...}   # mix of pl.String, pl.Categorical, pl.Int16, pl.Date, and pl.Float32

df = pl.scan_csv(r'G:\Duda\Data Projects\Kroger 8451\Raw Data New\*.csv', low_memory = True, schema = df_schema)

# several more steps manipulating the lazyframe, including filters, joins (same data types as above), column renaming, and a concat



path = r"G:\path\test.hyper"

pt.frame_to_hyper(step_3.collect(streaming = True), path, table = 'test')

step_3.collect(streaming = True) works fine on it's own and loads the dataframe just fine

@WillAyd
Copy link
Collaborator Author

WillAyd commented Sep 27, 2024

Thanks, but I can't do much with that, especially not having access to a windows machine.

Can you try and identify a subset of the data that can be used as a fully reproducible code sample?

@skyth540
Copy link

Here is basically my issue:

import polars as pl
import pantab as pt

schema = {
    'string_col':pl.String,
    'cat_col':pl.Categorical,
    'int_col':pl.Int64,
    'float_col':pl.Float32
}

data = {
    'string_col':['ID129120','ID8923879','ID89231987','ID126735817'],
    'cat_col':['Apple', 'Orange', 'Pear', 'Peach'],
    'int_col':[44,23,6,88],
    'float_col':[12.25,4.56,12.645,12.098]
}

df = pl.DataFrame(data, schema = schema)

path = 'test.hyper'

pt.frame_to_hyper(df, path, table = 'test')

@WillAyd
Copy link
Collaborator Author

WillAyd commented Sep 30, 2024

Great thanks. I'm surprised you aren't getting a better error message, but I am getting ValueError: Unsupported Arrow type: dictionary

So there must be some kind of issue with error handling on Windows, but more generally the issue is that we don't support dictionary types, which I assume polars is using for storage on the categorical type. If you cast that to string or remove from the dataset it should work.

Can you open a separate issue to request dictionary support? It ultimately will get you to the same place as strings (since Hyper does not support such a feature), but I can see that as a nice usability perk to write that data type (assuming strings are being held)

@skyth540
Copy link

That's a bummer... I switched from strings to categoricals and had a huge performance increase, as well as making some of what I wanted to do, possible, with RAM limitations (categoricals are more efficient). But if hyper doesn't support them then I'll switch back to strings

@WillAyd
Copy link
Collaborator Author

WillAyd commented Sep 30, 2024

But if hyper doesn't support them then I'll switch back to strings

To be clear we can theoretically write them and keep the memory savings there. However, Hyper cannot store strings encoded in that manner (at least not today), so when you read the same data back it will come back as a string

@skyth540
Copy link

😅

Switching to strings still errors with the code sample I have above

schema = {
'string_col':pl.String,
'cat_col':pl.String,
'int_col':pl.Int64,
'float_col':pl.Float32
}

@WillAyd
Copy link
Collaborator Author

WillAyd commented Sep 30, 2024

I think this PR in its current state does what we need to do for string / binary views, so merging as is.

@WillAyd WillAyd merged commit 7cf2421 into main Sep 30, 2024
5 checks passed
@WillAyd WillAyd deleted the string-view branch September 30, 2024 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants