Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decimal results incorrect depending on locale #753

Open
gv-collibris opened this issue Apr 30, 2020 · 18 comments
Open

Decimal results incorrect depending on locale #753

gv-collibris opened this issue Apr 30, 2020 · 18 comments

Comments

@gv-collibris
Copy link

gv-collibris commented Apr 30, 2020

Environment

  • Python: 3.6.3
  • pyodbc: 4.0.30
  • OS: WSL, Windows 10, Debian 10
  • DB: SQL Server
  • driver: ODBC Driver 17 for SQL Server

Issue

Numbers (e.g. 3656.880000) from the database are turned to different numbers in Python, depending on the locale, e.g. 3656.88 for an English locale, and 3656880000 for a French one.

Expected behavior

I would expect to always obtain the correct number in the Python code.

@gordthompson
Copy link
Collaborator

Please provide a minimal reproducible example.

@gv-collibris
Copy link
Author

Hi,
Thanks for your reply.
Here is a minimal reproducible example.
Cheers

@gordthompson
Copy link
Collaborator

Okay, that's weird. I was able to reproduce the issue using your example, but it works fine when I use a simple stand-alone script:

import locale

import pyodbc

if input("English or French? (e/f): ").startswith("f"):
    locale.setlocale(locale.LC_ALL, 'fr_FR.UTF-8')
print(locale.getlocale())

connection_string = (
    "DRIVER=ODBC Driver 17 for SQL Server;"
    "SERVER=192.168.0.179,49242;"
    "DATABASE=myDb;"
    "UID=sa;PWD=_whatever_"
)
cnxn = pyodbc.connect(connection_string)
crsr = cnxn.cursor()
print(crsr.execute("SELECT 1890 * 1.0 / 100").fetchone())

""" console output
English or French? (e/f): e
('en_CA', 'UTF-8')
(Decimal('18.900000'), )

English or French? (e/f): f
('fr_FR', 'UTF-8')
(Decimal('18.900000'), )
"""

@gv-collibris
Copy link
Author

gv-collibris commented May 4, 2020

yeah, I've seen that too
I struggled a bit to reproduced the issue
I have no idea how the use of a package can be different from a script...

@gordthompson
Copy link
Collaborator

gordthompson commented May 4, 2020

Even more mysterious: With your MRE and my script the ODBC driver returns the same string value ...

[ODBC][6457][1588599816.070747][SQLGetData.c][237]
        Entry:
            Statement = 0x2467540
            Column Number = 1
            Target Type = -8 SQL_WCHAR
            Buffer Length = 4096
            Target Value = 0x24593a0
            StrLen Or Ind = 0x7ffe67f0bd10
[ODBC][6457][1588599816.070863][SQLGetData.c][545]
        Exit:[SQL_SUCCESS]                
            Buffer = [18.900000](unicode)                
            Strlen Or Ind = 0x7ffe67f0bd10 -> 18

... but when it pops out of fetchall your MRE has [(Decimal('18900000'), )] while my script has [(Decimal('18.900000'), )], and that holds true whether we use

SELECT 1890 * 1.0 / 100

or

SELECT CAST(1890 * 1.0 / 100 AS decimal(10,6))

However, if I change your MRE to use

SELECT CAST(1890 * 1.0 / 100 AS float)

then the ODBC driver returns ...

[ODBC][6501][1588600241.966768][SQLGetData.c][237]
        Entry:
            Statement = 0x15a7d90
            Column Number = 1
            Target Type = 8 SQL_DOUBLE
            Buffer Length = 8
            Target Value = 0x7ffc7e61ca58
            StrLen Or Ind = 0x7ffc7e61ca60
[ODBC][6501][1588600241.966842][SQLGetData.c][545]
        Exit:[SQL_SUCCESS]                
            Buffer = [18,9]                
            Strlen Or Ind = 0x7ffc7e61ca60 -> 8

... and the (correct) resulting value is [(18.9, )]

@gv-collibris
Copy link
Author

gv-collibris commented May 4, 2020

My (uneducated) guess is that the creation of a Decimal uses somehow the Python locale.
As we receive 18.900000 from the DB, and as , is the decimal delimiter in French, the dot is ignored altogether.
The use of a float do not present this issue, as it's a different data structure, which do not use the locale (I guess).

Initially, I thought I made a mistake in my own code when formatting the numbers, that's where this guess comes from...

@keitherskine
Copy link
Collaborator

Just for clarity, because it might be significant here. In T-SQL the "1.0" literal is not a float, somewhat counter-intuitively, it's a decimal, so multiplying by 1.0 does not automatically generate a float, it typically generates a decimal:

SELECT
SQL_VARIANT_PROPERTY(1.0,'BaseType') AS 'BaseType',
SQL_VARIANT_PROPERTY(1.0,'Precision') AS 'Precision',
SQL_VARIANT_PROPERTY(1.0,'Scale') AS 'Scale'
UNION
SELECT
SQL_VARIANT_PROPERTY(1890 * 1.0,'BaseType') AS 'BaseType',
SQL_VARIANT_PROPERTY(1890 * 1.0,'Precision') AS 'Precision',
SQL_VARIANT_PROPERTY(1890 * 1.0,'Scale') AS 'Scale';

results in:

BaseType    Precision    Scale
numeric        2            1
numeric        7            1

Apologies if you knew this already, but I know it was kind of a surprise to me.

@gordthompson
Copy link
Collaborator

@gv-collibris - You're probably right; we're just trying to give @v-makouz as much information as we can.

I also tried adding ;Regional=Yes to the end of the connection string in the MRE but that resulted in

pyodbc.Error: ('07006', '[07006] [Microsoft][ODBC Driver 17 for SQL Server]Restricted data type attribute violation (0) (SQLGetData)')

@keitherskine
Copy link
Collaborator

I think the difference in behavior between Gord's script and the package might be explained by exactly when pyodbc is being imported. In pyodbc, it looks like the numeric locale information is read from the environment when pyodbc is initialized (i.e. imported):
https://github.com/mkleehammer/pyodbc/blob/master/src/pyodbcmodule.cpp#L1226
and appears to be fixed after that.

Hence, if you add an import pyodbc to the very beginning of test.py, i.e. before setting the locale, which then emulates Gord's script, then you appear to get a different result. Not sure if this solves anything, but I thought I'd mention it.

@keitherskine
Copy link
Collaborator

One other thing, if you pyodbc.setDecimalSeparator('.') at the start of test_function() (i.e. set the decimal point as English-style, despite the French locale), it appears you get the right result (18.9).

@keitherskine
Copy link
Collaborator

@gv-collibris Just for the record, what is the database instance language you are using, and the database collation? If you run the following:

USE <your database>;
SELECT @@language AS instance_lang;
SELECT CONVERT(varchar(256),SERVERPROPERTY('collation')) AS database_collation;

...what do you get?

@gv-collibris
Copy link
Author

Instance language: us_english
Database collation: SQL_Latin1_General_CP1_CI_AS

@keitherskine
Copy link
Collaborator

Thanks, @gv-collibris . I thought perhaps you might be using a French-collated database instance but that does not appear to be the case.

Currently, it appears pyodbc reads the decimal point character from the current locale when it is imported (and only once). Personally, I'm not sure whether it should be doing that at all because the decimal point character in the result set is probably added by the SQL Server database engine rather than any C libraries in the server itself (although I'm speculating there).

In the scenario you describe, the workarounds appears to be either:

  • import pyodbc before setting the locale, causing pyodbc to use the default locale (i.e. US English).
  • use pyodbc.setDecimalSeparator('.') to explicitly set the decimal point character to a period before running SQL queries.

I hope that helps.

@gv-collibris
Copy link
Author

I used an other workaround: I round and cast to integer, then divide by 100 in the Python code.
It's not particularly an issue, but it is a weird surprise I wanted to avoid to other people ;)

@keitherskine
Copy link
Collaborator

Glad to hear you're not being held up by this. This curious locale behavior was not something I was aware of, so it's good you raised this. Many thanks.

@v-makouz
Copy link
Contributor

v-makouz commented May 5, 2020

The reason for this is in getdata.cpp GetDataDecimal function. There it loops through all the characters trying to detect the decimal mark, but it uses the one based on locale for comparison, while ODBC driver always uses "."

One solution I can think of is to replace
if (*pch == chDecimal)
with
if (*pch == chDecimal || *pch == '.')

It looks a little hacky, but it should work for all drivers, whether they use the locale based one or default one.

What do you guys think?

@keitherskine
Copy link
Collaborator

I can kind of understand why the decimal separator from the Python locale was chosen to be used by pyodbc when parsing decimal values, but it still seems something of an odd choice. First off, I'm making a big assumption that decimal values are generated on the database server as essentially strings, and are sent over the wire more-or-less untouched. So a German-language database might generate a decimal value like "12.345.678,99". This is my big assumption and I'm happy to be corrected on that.

On SQL Server, changing the language of a database (e.g. SET LANGUAGE German) doesn't appear to make any difference to the decimal values in the ODBC trace, they are for example "12345678.99" regardless of the language. Perhaps German-style decimals can be generated by setting the language on the server itself, or by using a different RDBMS like Oracle.

My concern about using the Python locale is that there's no guarantee the app server (running Python) is going to have the same locale as the database server. The database server might be "German" but the app server could be "British". The locales might match, but it's not guaranteed. Hence, it seems something of a stretch to use the app server's locale to figure out what the decimal point character is going to be. By the way, @v-makouz , that's also an argument not to use '.' all the time as per your suggestion. In my German example above, we wouldn't want to use the period character when parsing "12.345.678,99". The trouble is, it's difficult to figure out the decimal point character any other way, hence I can understand why the Python locale is being used.

I appreciate I'm not coming up with any answers here. The big question seems to be whether decimal values can be received in a whole variety of formats (including monetary values), as per the comment on the GetDataDecimal function. If so, then this is not an easy problem to solve. If decimal values can truly be "12,345.678000", "12.345.678,99", "$876.55", or "-345678.77€", then this is very tricky. For example, how would you distinguish "1,234"? Is that one thousand two hundred and thirty four, or one point two three four? Short of making a test query like SELECT CAST(1.5 AS DECIMAL(2,1)) and grabbing the middle character, or perhaps parsing from the right and assuming the first period or comma is the decimal separator, I don't know how the decimal point character can be determined in a watertight manner.

Ref: https://docs.microsoft.com/en-us/globalization/locale/number-formatting

@mkleehammer
Copy link
Owner

The comments indicated that the "proper" way to read using a binary format weren't working for all database. Maybe the 5.0 version should default to binary but have a function to go back to strings and configuring the separator.

I'd love to be able to write a script that would print out the configuration you should set. The difficulty is you'd have to be tricky getting the information into the DB to read back. That is, if you don't know whether to use "." or "," you have to not use any in your inputs. That is, instead of "select 123.45" you need something like "select (cast 12345 as decimal(19,2)) / 100".

This might be a good idea even if just provided as a script alongside pyodbc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: No status
Development

No branches or pull requests

5 participants