TraceQL: support mixed-type attribute querying (int/float) #4391

ndk · 2024-11-27T17:53:59Z

What this PR does:
Below is my understanding of the current limitations. Please feel free to correct me if I’ve misunderstood or overlooked something.

Attributes of the same type are stored in the same column. For example, integers are stored in one column and floats in another.

Querying operates in two stages:

Predicate Creation: Predicates are created based on the operand types.
Chunk Scanning: Chunks are scanned, and spans are filtered using the predicates.

The issue arises because predicates are generated based on the operand type. If an attribute is stored as a float but the operand is an integer, the predicate evaluates against the integers column instead of the floats column. This results in incorrect behavior.

Proposed Solution
The idea is to generate predicates for both integers and floats, allowing both columns to be scanned for the queried attribute.

In this PR, I’ve created a proof-of-concept by copying the existing createAttributeIterator function to createAttributeIterator2. This duplication is intentional, as the original function is used in multiple places, and I want to avoid introducing unintended side effects until the approach is validated.

case traceql.TypeInt:
	{
		pred, err := createIntPredicate(cond.Op, cond.Operands)
		if err != nil {
			return nil, fmt.Errorf("creating attribute predicate: %w", err)
		}
		attrIntPreds = append(attrIntPreds, pred)
	}

	{
		if i, ok := cond.Operands[0].Int(); ok {
			operands := traceql.Operands{traceql.NewStaticFloat(float64(i))}
			pred, err := createFloatPredicate(cond.Op, operands)
			if err != nil {
				return nil, fmt.Errorf("creating attribute predicate: %w", err)
			}
			attrFltPreds = append(attrFltPreds, pred)
		}
	}

case traceql.TypeFloat:
	{
		operands := traceql.Operands{traceql.NewStaticInt(int(cond.Operands[0].Float()))}
		pred, err := createIntPredicate(cond.Op, operands)
		if err != nil {
			return nil, fmt.Errorf("creating attribute predicate: %w", err)
		}
		attrIntPreds = append(attrIntPreds, pred)
	}

	{
		pred, err := createFloatPredicate(cond.Op, cond.Operands)
		if err != nil {
			return nil, fmt.Errorf("creating attribute predicate: %w", err)
		}
		attrFltPreds = append(attrFltPreds, pred)
	}

WDYT? :)

Which issue(s) this PR fixes:
Fixes #4332

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

joe-elliott · 2025-01-02T18:55:08Z

I apologize for taking so long to get to this. Your analysis is correct! We do generate predicates per column and, since we store integers and floats independently we only scan one of the columns. Given how small int and float columns tend to be (compared to string columns) I think the performance hit of doing this is likely acceptable in exchange for the nicer behavior.

What is the behavior in this case? I'm pretty sure this will work b/c the engine will request all values for the two attributes and do the work itself. I believe the engine layer will compare ints and floats correctly but I'm not 100% sure.

{ span.intAttr > span.floatAttr }

Tests should also be added here for the new behavior. These tests build a block and then search for a known trace using a large range of traceql queries. If you add tests here and they pass it means that your changes work from the parquet file all the way up through the engine.

This will also break the "allConditions" optimization if the user types any query with a number comparison:

https://github.com/grafana/tempo/pull/4391/files#diff-a201423ab0b50d4455a497bf1804b1a9f596394413c28b7702710f89237c49c1R2815-R2821

I would like preserve the allConditions behavior in this case b/c it's such a nice optimization and number queries are common. I'm not quite sure why the len(valueIters) == 1 condition exists so we'd need to do some research into it.

ndk · 2025-01-12T11:39:27Z

I apologize for taking so long to get to this. Your analysis is correct! We do generate predicates per column and, since we store integers and floats independently we only scan one of the columns. Given how small int and float columns tend to be (compared to string columns) I think the performance hit of doing this is likely acceptable in exchange for the nicer behavior.

Thank you for confirming the approach and pointing out the allConditions optimization. Right now, the fix scans both integer and float columns for attributes that might be either type. I’ve also adjusted how float comparisons work for integer fields, taking into account the fraction part and the comparison operator.

What is the behavior in this case? I'm pretty sure this will work b/c the engine will request all values for the two attributes and do the work itself. I believe the engine layer will compare ints and floats correctly but I'm not 100% sure.
{ span.intAttr > span.floatAttr }

I verified that { span.intAttr > span.floatAttr } behaves as expected. Wanna me to add a test to cover this case?

Tests should also be added here for the new behavior. These tests build a block and then search for a known trace using a large range of traceql queries. If you add tests here and they pass it means that your changes work from the parquet file all the way up through the engine.

Done. Let me know if I missed something.

This will also break the "allConditions" optimization if the user types any query with a number comparison:

https://github.com/grafana/tempo/pull/4391/files#diff-a201423ab0b50d4455a497bf1804b1a9f596394413c28b7702710f89237c49c1R2815-R2821

I would like preserve the allConditions behavior in this case b/c it's such a nice optimization and number queries are common. I'm not quite sure why the len(valueIters) == 1 condition exists so we'd need to do some research into it.

Regarding the allConditions block, the optimization is lost because we generate two predicates (one for int, one for float) under the same attribute name, triggering a LeftJoinIterator instead of a JoinIterator. Possible workarounds I’m considering:

Creating a variant of JoinIterator that uses logical OR rather than AND.
Exploring parquet.multiRowGroup, parquetquery.UnionIterator, or parquetquery.KeyValueGroupPredicate to see if they can unify the int/float search without losing the optimization.
~~Refactoring a single ColumnChunk to the multy-one.~~

Given my limited exposure to Tempo’s internals, I’d appreciate any guidance on whether these routes are viable or if there’s a simpler approach to preserve allConditions.

P.S. Do we care about comparisons with negative values? Should it also be covered?

joe-elliott · 2025-01-14T22:17:51Z

This is a really cool change. Ran benchmarks and found no major regressions. Nice tests added ./tempodb. We try to keep those as comprehensive as possible given the complexity of the language.

I verified that { span.intAttr > span.floatAttr } behaves as expected. Wanna me to add a test to cover this case?

This case is covered in the ./pkg/traceql tests so I wouldn't worry about it. It occurred to me that this case causes two "OpNone" conditions to the fetch layer and the condition itself is evaluated in the engine, so your changes will not impact it.

I’ve also adjusted how float comparisons work for integer fields, taking into account the fraction part and the comparison operator.

Nice improvements here. I like falling back to integer comparison (or nothing) based on if the float has a fractional part.

Regarding the allConditions block, the optimization ...

The right choice would be a UnionOperator on the columns. It would be interesting to compare the performance of that against what you have currently written. I'm less concerned about allConditions then I was previously b/c the root iterators will still behave as if allConditions is true which is what really drives performance. The benchmarks show your changes are not causing a regression. I'm fine with what you have now, but feel free to experiment with union if you want.

Also, if you're interested, plug your queries into this test and run it. It will dump the iterator structure and you can see how your changes have impacted the hierarchy.

P.S. Do we care about comparisons with negative values? Should it also be covered?

Yes, are they not already? reviewing your code I think they would work fine.

I think my primary ask at this point would be to keep the int and float switch cases symmetrical. Even though it's trivial can you create a createFloatPredicateFromInt()? If these two cases read the same line by line it will be easier for others to understand what was done here in the future.

I'm a bit impressed you're taking this on. I wouldn't have guessed someone outside of Grafana would have had the time and patience to find this.

benches

goos: darwin
goarch: arm64
pkg: github.com/grafana/tempo/tempodb/encoding/vparquet4
cpu: Apple M3 Pro
                                                    │ before.txt  │             after.txt              │
                                                    │   sec/op    │   sec/op     vs base               │
BackendBlockTraceQL/spanAttValMatch-11                94.58m ± 0%   95.15m ± 1%  +0.60% (p=0.007 n=10)
BackendBlockTraceQL/spanAttValNoMatch-11              4.913m ± 1%   4.979m ± 1%  +1.34% (p=0.001 n=10)
BackendBlockTraceQL/spanAttIntrinsicMatch-11          71.23m ± 0%   72.79m ± 1%  +2.18% (p=0.000 n=10)
BackendBlockTraceQL/spanAttIntrinsicNoMatch-11        4.940m ± 0%   5.031m ± 0%  +1.83% (p=0.000 n=10)
BackendBlockTraceQL/resourceAttValMatch-11            408.4m ± 1%   413.0m ± 1%  +1.13% (p=0.000 n=10)
BackendBlockTraceQL/resourceAttValNoMatch-11          5.070m ± 1%   5.181m ± 1%  +2.20% (p=0.000 n=10)
BackendBlockTraceQL/resourceAttIntrinsicMatch-11      37.06m ± 0%   37.80m ± 1%  +2.00% (p=0.000 n=10)
BackendBlockTraceQL/resourceAttIntrinsicMatch#01-11   4.891m ± 1%   4.954m ± 1%  +1.29% (p=0.001 n=10)
BackendBlockTraceQL/traceOrMatch-11                   238.5m ± 0%   241.3m ± 0%  +1.17% (p=0.000 n=10)
BackendBlockTraceQL/traceOrNoMatch-11                 238.7m ± 1%   241.4m ± 0%  +1.14% (p=0.000 n=10)
BackendBlockTraceQL/mixedValNoMatch-11                179.1m ± 0%   179.2m ± 0%       ~ (p=0.190 n=10)
BackendBlockTraceQL/mixedValMixedMatchAnd-11          4.949m ± 0%   5.030m ± 0%  +1.64% (p=0.000 n=10)
BackendBlockTraceQL/mixedValMixedMatchOr-11           148.9m ± 1%   148.6m ± 0%       ~ (p=0.529 n=10)
BackendBlockTraceQL/count-11                          340.5m ± 3%   341.2m ± 0%       ~ (p=0.218 n=10)
BackendBlockTraceQL/struct-11                         432.6m ± 2%   432.8m ± 3%       ~ (p=0.796 n=10)
BackendBlockTraceQL/||-11                             165.8m ± 0%   166.1m ± 0%       ~ (p=0.089 n=10)
BackendBlockTraceQL/mixed-11                          28.86m ± 1%   29.08m ± 0%       ~ (p=0.123 n=10)
BackendBlockTraceQL/complex-11                        4.918m ± 4%   4.969m ± 5%       ~ (p=0.123 n=10)
BackendBlockTraceQL/select-11                         4.918m ± 0%   4.999m ± 0%  +1.64% (p=0.000 n=10)
geomean                                               42.28m        42.72m       +1.06%

                                                    │  before.txt  │              after.txt              │
                                                    │     B/s      │     B/s       vs base               │
BackendBlockTraceQL/spanAttValMatch-11                236.8Mi ± 0%   235.4Mi ± 1%  -0.60% (p=0.007 n=10)
BackendBlockTraceQL/spanAttValNoMatch-11              343.9Mi ± 1%   339.4Mi ± 1%  -1.32% (p=0.001 n=10)
BackendBlockTraceQL/spanAttIntrinsicMatch-11          327.0Mi ± 0%   320.0Mi ± 1%  -2.14% (p=0.000 n=10)
BackendBlockTraceQL/spanAttIntrinsicNoMatch-11        501.5Mi ± 0%   492.4Mi ± 0%  -1.80% (p=0.000 n=10)
BackendBlockTraceQL/resourceAttValMatch-11            53.81Mi ± 1%   53.21Mi ± 1%  -1.12% (p=0.000 n=10)
BackendBlockTraceQL/resourceAttValNoMatch-11          177.4Mi ± 1%   173.6Mi ± 1%  -2.16% (p=0.000 n=10)
BackendBlockTraceQL/resourceAttIntrinsicMatch-11      594.1Mi ± 0%   582.4Mi ± 1%  -1.96% (p=0.000 n=10)
BackendBlockTraceQL/resourceAttIntrinsicMatch#01-11   190.9Mi ± 1%   188.5Mi ± 1%  -1.28% (p=0.001 n=10)
BackendBlockTraceQL/traceOrMatch-11                   7.010Mi ± 0%   6.928Mi ± 0%  -1.16% (p=0.000 n=10)
BackendBlockTraceQL/traceOrNoMatch-11                 7.005Mi ± 1%   6.924Mi ± 0%  -1.16% (p=0.000 n=10)
BackendBlockTraceQL/mixedValNoMatch-11                11.01Mi ± 0%   11.00Mi ± 0%       ~ (p=0.303 n=10)
BackendBlockTraceQL/mixedValMixedMatchAnd-11          180.4Mi ± 0%   177.5Mi ± 0%  -1.61% (p=0.000 n=10)
BackendBlockTraceQL/mixedValMixedMatchOr-11           18.52Mi ± 1%   18.56Mi ± 0%       ~ (p=0.492 n=10)
BackendBlockTraceQL/count-11                          64.51Mi ± 2%   64.39Mi ± 0%       ~ (p=0.197 n=10)
BackendBlockTraceQL/struct-11                         12.62Mi ± 2%   12.62Mi ± 3%       ~ (p=0.837 n=10)
BackendBlockTraceQL/||-11                             133.1Mi ± 0%   132.9Mi ± 0%       ~ (p=0.085 n=10)
BackendBlockTraceQL/mixed-11                          740.2Mi ± 1%   734.8Mi ± 0%       ~ (p=0.123 n=10)
BackendBlockTraceQL/complex-11                        183.0Mi ± 4%   181.1Mi ± 5%       ~ (p=0.123 n=10)
BackendBlockTraceQL/select-11                         183.0Mi ± 0%   180.0Mi ± 0%  -1.61% (p=0.000 n=10)
geomean                                               98.15Mi        97.12Mi       -1.05%

                                                    │ before.txt  │              after.txt               │
                                                    │  MB_io/op   │  MB_io/op    vs base                 │
BackendBlockTraceQL/spanAttValMatch-11                 23.48 ± 0%    23.48 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/spanAttValNoMatch-11               1.772 ± 0%    1.772 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/spanAttIntrinsicMatch-11           24.43 ± 0%    24.43 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/spanAttIntrinsicNoMatch-11         2.598 ± 0%    2.598 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/resourceAttValMatch-11             23.04 ± 0%    23.04 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/resourceAttValNoMatch-11          943.2m ± 0%   943.2m ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/resourceAttIntrinsicMatch-11       23.09 ± 0%    23.09 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/resourceAttIntrinsicMatch#01-11   979.0m ± 0%   979.0m ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/traceOrMatch-11                    1.753 ± 0%    1.753 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/traceOrNoMatch-11                  1.753 ± 0%    1.753 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/mixedValNoMatch-11                 2.067 ± 0%    2.067 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/mixedValMixedMatchAnd-11          936.1m ± 0%   936.1m ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/mixedValMixedMatchOr-11            2.893 ± 0%    2.893 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/count-11                           23.03 ± 0%    23.03 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/struct-11                          5.726 ± 0%    5.726 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/||-11                              23.14 ± 0%    23.14 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/mixed-11                           22.40 ± 0%    22.40 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/complex-11                        943.7m ± 0%   943.7m ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/select-11                         943.7m ± 0%   943.7m ± 0%       ~ (p=1.000 n=10) ¹
geomean                                                4.351         4.351       +0.00%
¹ all samples are equal

ndk · 2025-01-15T02:03:57Z

I'm fine with what you have now, but feel free to experiment with union if you want.

I'm not sure if it's worth it. I'd rather rely on your opinion here.

P.S. Do we care about comparisons with negative values? Should it also be covered?

Yes, are they not already? reviewing your code I think they would work fine.

Actually, it turned out they didn't work correctly with negative values. I've updated the shifting logic to fix this. Also, another edge case raises questions: what happens if a float hits MaxInt/MinInt? In some cases, it might cause jumps between MaxInt and MinInt.

I think my primary ask...

Done! Let me know if this aligns with what you had in mind.

Plus, I've added some tests in a separate commit. Feel free to let me know if they look odd or need adjustments.

I'm a bit impressed you're taking this on. I wouldn't have guessed someone outside of Grafana would have had the time and patience to find this.

Haha, thanks! Honestly, it's just curiosity. Tempo is a fascinating system, and I've wanted to dive into something challenging like this. It's fun to learn from real-world systems and see how they tackle performance and scalability. :)

joe-elliott · 2025-01-21T22:09:28Z

Also, another edge case raises questions: what happens if a float hits MaxInt/MinInt? In some cases, it might cause jumps between MaxInt and MinInt.

We could try to get tricky here. Like if you do { span.IntCol > IntMaxAsFloat } then we just don't do the int comparison. { span.IntCol < IntMaxAsFloat } would just return all values from the fetch layer. But I'm also fine with the easy path of just not attempting the float/int comparison is the float is outside the bounds of the int column. It feels like an acceptable edge case as long as we document it.

Done! Let me know if this aligns with what you had in mind.

Yup, I think this communicates better to a future reader what's going on. Thanks for the change.

Ok, I was running your branch on Friday to test and we do have one final thing to figure out. This query does not work:

{ span.http.status_code = 200.0 }

The reason is b/c we handle this special column here:

tempo/tempodb/encoding/vparquet4/block_traceql.go

Lines 1969 to 1986 in 14efba0

    
           if entry, ok := wellKnownColumnLookups[cond.Attribute.Name]; ok && entry.level != traceql.AttributeScopeResource { 
        
           	if cond.Op == traceql.OpNone { 
        
           		addPredicate(entry.columnPath, nil) // No filtering 
        
           		columnSelectAs[entry.columnPath] = cond.Attribute.Name 
        
           		continue 
        
           	} 
        
           	// Compatible type? 
        
           	if entry.typ == operandType(cond.Operands) { 
        
           		pred, err := createPredicate(cond.Op, cond.Operands) 
        
           		if err != nil { 
        
           			return nil, fmt.Errorf("creating predicate: %w", err) 
        
           		} 
        
           		addPredicate(entry.columnPath, pred) 
        
           		columnSelectAs[entry.columnPath] = cond.Attribute.Name 
        
           		continue 
        
           	} 
        
           }

All well known and dedicated columns are strings ... except this one unfortunately. To do this correctly we have to scan both the well known column as well as the general float attribute column if the static value being compared against http status code is a float. To do this performantly I think we will need to build a UnionIterator that joins two sub iterators. One that scans the well known column and one that scans the float attribute column with the appropriate predicate.

ndk · 2025-01-22T08:46:41Z

We could try to get tricky here. Like if you do { span.IntCol > IntMaxAsFloat } then we just don't do the int comparison. { span.IntCol < IntMaxAsFloat } would just return all values from the fetch layer.

Sounds like a plan. Will do it later. :)

{ span.http.status_code = 200.0 }
...
All well known and dedicated columns are strings ... except this one unfortunately. To do this correctly we have to scan both the well known column as well as the general float attribute column if the static value being compared against http status code is a float. To do this performantly I think we will need to build a UnionIterator that joins two sub iterators. One that scans the well known column and one that scans the float attribute column with the appropriate predicate.

Oh, that's a nice catch! But before rushing into handling this case, I want to address one quick concern. If a user specifies span.http.status_code = 200.0, isn't that likely just a typo? Automatically converting floats to ints might hide the mistake instead of surfacing it. Even though status codes are technically numbers, they're more like categorical values. 199 isn't "slightly less successful" than 200. It's a completely different outcome.

Anyway, if you see real value in covering this edge case, I'm happy to implement it. Let me know what you think!

P.S. I found out that I should convert int to float64 carefully if 2^53 < int <= MaxInt

joe-elliott · 2025-02-05T21:15:41Z

It's funny because I really want this PR in, but the only thing blocking it is handling http status code correctly. However, I'd really like to cut a vparquet5 that removes all well known columns (and other cleanup) which would unblock this PR.

ndk · 2025-02-05T21:22:19Z

I believe that I finally learn on how to use UnionInterator to handle http status codes properly, but need more time to cover with tests to ensure if I didn't screw something up. I'll move the PR from draft state when I push the updated version. :)

joe-elliott · 2025-02-06T16:30:03Z

What's vparquet5? Is something coming?

I wish we had time to work on this. It's an undefined cleanup pass on vParquet with a focus on reducing complexity, number of columns and footer size. One of the things I'd like accomplished is removing the well known columns and instead relying on dedicated columns.

joe-elliott · 2025-02-07T16:52:03Z

Tested and works! but there's definitely some cleanup to do.

{ span.http.status_code = 200 } and { span.http.status_code = 200. } are doing more work than necessary. They are creating this iterator structure:

UnionIterator: 3: %!s(<nil>)
	SyncIterator: rs.list.element.ss.list.element.Spans.list.element.HttpStatusCode : IntEqualPredicate{200}
	LeftJoinIterator: 4: attributeCollector{}
	required:
		SyncIterator: rs.list.element.ss.list.element.Spans.list.element.Attrs.list.element.Key : StringInPredicate{http.status_code}
	optional:
		SyncIterator: rs.list.element.ss.list.element.Spans.list.element.Attrs.list.element.ValueDouble.list.element : FloatEqualPredicate{200.000000}
		SyncIterator: rs.list.element.ss.list.element.Spans.list.element.Attrs.list.element.ValueInt.list.element : IntEqualPredicate{200})

but we don't need to scan the generic attribute column for an int. Int values are guaranteed to be stored in the dedicated column for this attribute name so we only need to scan the generic column for a float. This should simplify the iterators to something like:

UnionIterator: 3: %!s(<nil>)
	SyncIterator: rs.list.element.ss.list.element.Spans.list.element.HttpStatusCode : IntEqualPredicate{200}
	JoinIterator: 4: attributeCollector{}
		SyncIterator: rs.list.element.ss.list.element.Spans.list.element.Attrs.list.element.Key : StringInPredicate{http.status_code}
		SyncIterator: rs.list.element.ss.list.element.Spans.list.element.Attrs.list.element.ValueDouble.list.element : FloatEqualPredicate{200.000000}

unsure why you're seeing nils. I can dig into that a bit. we shouldn't need the filter nil thing.

ndk · 2025-02-07T17:32:56Z

Oh my gosh! This is what happens when a review lasting too long. I started forgetting what I've been doing. :D Fixed.

{ span.http.status_code = 200 && span.http.status_code = 200. } -> iterator structure

LeftJoinIterator: 3: spanCollector(1)
required: 
	UnionIterator: 3: %!s(<nil>)	
		SyncIterator: rs.list.element.ss.list.element.Spans.list.element.HttpStatusCode : IntEqualPredicate{200}
		SyncIterator: rs.list.element.Resource.Attrs.list.element.ValueDouble.list.element : FloatEqualPredicate{200.000000})
	UnionIterator: 3: %!s(<nil>)	
		SyncIterator: rs.list.element.ss.list.element.Spans.list.element.HttpStatusCode : IntEqualPredicate{200}
		SyncIterator: rs.list.element.Resource.Attrs.list.element.ValueDouble.list.element : FloatEqualPredicate{200.000000})

If it looks good, there's one more step remaining - need to update vparquet3 and vparquet2

joe-elliott

Awesome! it looks like we were able to get rid of those nil filter shenanigans. I think this is very very close. All functionality is accounted for. I did run some benchmarks and find a regression we should spend some time to understand. I do expect a bit of overhead due to this change but one particular query is showing a 20% increase in cpu.

I can help dig into this.

These are the queries used in the benches. The regression occurred on traceOrMatch which you can see below. As you can tell they are crafted for internal data, but they can be rewritten for any block where they get some matches.

statuscode:  { span.http.status_code = 200 }
traceOrMatch: { rootServiceName = `tempo-gateway` && (status = error || span.http.status_code = 500)}
complex: {resource.cluster=~"prod.*" && resource.namespace = "tempo-prod" && resource.container="query-frontend" && name = "HTTP GET - tempo_api_v2_search_tags" && span.http.status_code = 200 && duration > 1s}

benches

> benchstat before.txt after.txt
goos: darwin
goarch: arm64
pkg: github.com/grafana/tempo/tempodb/encoding/vparquet4
cpu: Apple M3 Pro
                                    │ before.txt  │              after.txt              │
                                    │   sec/op    │   sec/op     vs base                │
BackendBlockTraceQL/statuscode-11     64.28m ± 2%   64.81m ± 1%   +0.83% (p=0.043 n=10)
BackendBlockTraceQL/traceOrMatch-11   249.8m ± 8%   303.7m ± 9%  +21.59% (p=0.000 n=10)
BackendBlockTraceQL/complex-11        4.944m ± 5%   4.880m ± 1%        ~ (p=0.190 n=10)
geomean                               42.98m        45.80m        +6.56%

                                    │  before.txt  │              after.txt               │
                                    │     B/s      │     B/s       vs base                │
BackendBlockTraceQL/statuscode-11     341.4Mi ± 2%   338.8Mi ± 1%        ~ (p=0.052 n=10)
BackendBlockTraceQL/traceOrMatch-11   6.695Mi ± 8%   5.541Mi ± 9%  -17.24% (p=0.000 n=10)
BackendBlockTraceQL/complex-11        182.0Mi ± 5%   184.4Mi ± 1%        ~ (p=0.190 n=10)
geomean                               74.66Mi        70.22Mi        -5.94%

                                    │ before.txt  │              after.txt               │
                                    │  MB_io/op   │  MB_io/op    vs base                 │
BackendBlockTraceQL/statuscode-11      23.01 ± 0%    23.02 ± 0%  +0.04% (p=0.000 n=10)
BackendBlockTraceQL/traceOrMatch-11    1.753 ± 0%    1.766 ± 0%  +0.74% (p=0.000 n=10)
BackendBlockTraceQL/complex-11        943.7m ± 0%   943.7m ± 0%       ~ (p=1.000 n=10) ¹
geomean                                3.364         3.373       +0.26%
¹ all samples are equal

                                    │   before.txt   │              after.txt               │
                                    │      B/op      │     B/op       vs base               │
BackendBlockTraceQL/statuscode-11      31.19Mi ±  1%   31.29Mi ±  1%       ~ (p=0.436 n=10)
BackendBlockTraceQL/traceOrMatch-11   10.597Mi ± 17%   9.867Mi ± 35%       ~ (p=0.579 n=10)
BackendBlockTraceQL/complex-11         5.387Mi ±  4%   5.413Mi ±  2%       ~ (p=0.631 n=10)
geomean                                12.12Mi         11.87Mi        -2.09%

                                    │ before.txt  │             after.txt              │
                                    │  allocs/op  │  allocs/op   vs base               │
BackendBlockTraceQL/statuscode-11     378.4k ± 0%   378.6k ± 0%  +0.04% (p=0.000 n=10)
BackendBlockTraceQL/traceOrMatch-11   86.49k ± 1%   86.57k ± 1%       ~ (p=0.218 n=10)
BackendBlockTraceQL/complex-11        79.81k ± 0%   79.83k ± 0%  +0.02% (p=0.000 n=10)
geomean                               137.7k        137.8k       +0.05%

tempodb/encoding/vparquet4/testqt/floatattr.txt

ndk · 2025-02-07T20:51:16Z

I cannot reproduce it. Could you tell me how you generated traces?

I scribbled such a Frankenstein monster

package vparquet4

import (
	"bytes"
	"context"
	"io"
	"math/rand"
	"os"
	"sort"
	"testing"
	"time"

	"github.com/google/uuid"
	"github.com/stretchr/testify/require"

	"github.com/grafana/tempo/pkg/tempopb"
	"github.com/grafana/tempo/pkg/traceql"
	"github.com/grafana/tempo/pkg/util/test"
	"github.com/grafana/tempo/tempodb/backend"
	"github.com/grafana/tempo/tempodb/backend/local"
	"github.com/grafana/tempo/tempodb/encoding/common"

	v1_common "github.com/grafana/tempo/pkg/tempopb/common/v1"
	v1_resource "github.com/grafana/tempo/pkg/tempopb/resource/v1"
	v1_trace "github.com/grafana/tempo/pkg/tempopb/trace/v1"
)

type testTrace struct {
	traceID common.ID
	trace   *tempopb.Trace
}

type testIterator2 struct {
	traces []testTrace
}

func (i *testIterator2) Next(context.Context) (common.ID, *tempopb.Trace, error) {
	if len(i.traces) == 0 {
		return nil, nil, io.EOF
	}
	tr := i.traces[0]
	i.traces = i.traces[1:]
	return tr.traceID, tr.trace, nil
}

func (i *testIterator2) Close() {
}

func newTestTraces(traceCount int) []testTrace {
	traces := make([]testTrace, 0, traceCount)

	for i := 0; i < traceCount; i++ {
		traceID := test.ValidTraceID(nil)

		if i%2 == 0 {
			trace := MakeTraceWithCustomTags(traceID, "tempo-gateway", int64(i), true, true)
			traces = append(traces, testTrace{traceID: traceID, trace: trace})
		} else {
			trace := MakeTraceWithCustomTags(traceID, "megaservice", int64(i), false, false)
			traces = append(traces, testTrace{traceID: traceID, trace: trace})
		}
	}

	sort.Slice(traces, func(i, j int) bool {
		return bytes.Compare(traces[i].traceID, traces[j].traceID) == -1
	})

	return traces
}

var (
	blockID = uuid.MustParse("6757b4d9-8d6b-4984-a2d7-8ef6294ca503")
)

func TestGenerateBlocks(t *testing.T) {
	const (
		traceCount = 10000
	)

	blockDir, ok := os.LookupEnv("TRACEQL_BLOCKDIR")
	require.True(t, ok, "TRACEQL_BLOCKDIR env var must be set")

	rawR, rawW, _, err := local.New(&local.Config{
		Path: blockDir,
	})
	require.NoError(t, err)

	r := backend.NewReader(rawR)
	w := backend.NewWriter(rawW)
	ctx := context.Background()

	cfg := &common.BlockConfig{
		BloomFP:             0.01,
		BloomShardSizeBytes: 100 * 1024,
	}

	traces := newTestTraces(traceCount)
	iter := &testIterator2{traces: traces}
	meta := backend.NewBlockMeta(tenantID, blockID, VersionString, backend.EncNone, "")
	meta.TotalObjects = int64(len(iter.traces))
	_, err = CreateBlock(ctx, cfg, meta, iter, r, w)
	require.NoError(t, err)
}

func MakeTraceWithCustomTags(traceID []byte, service string, intValue int64, isError bool, setHTTP500 bool) *tempopb.Trace {
	now := time.Now()
	traceID = test.ValidTraceID(traceID)

	trace := &tempopb.Trace{
		ResourceSpans: make([]*v1_trace.ResourceSpans, 0),
	}

	var attributes []*v1_common.KeyValue

	attributes = append(attributes,
		&v1_common.KeyValue{
			Key: "stringTag",
			Value: &v1_common.AnyValue{
				Value: &v1_common.AnyValue_StringValue{StringValue: "value1"},
			},
		},
		&v1_common.KeyValue{
			Key: "intTag",
			Value: &v1_common.AnyValue{
				Value: &v1_common.AnyValue_IntValue{IntValue: intValue},
			},
		},
	)

	if setHTTP500 {
		attributes = append(attributes,
			&v1_common.KeyValue{
				Key: "http.status_code",
				Value: &v1_common.AnyValue{
					Value: &v1_common.AnyValue_IntValue{IntValue: 500},
				},
			},
		)
	}

	statusCode := v1_trace.Status_STATUS_CODE_OK
	statusMsg := "OK"
	if isError {
		statusCode = v1_trace.Status_STATUS_CODE_ERROR
		statusMsg = "Internal Error"
	}

	trace.ResourceSpans = append(trace.ResourceSpans, &v1_trace.ResourceSpans{
		Resource: &v1_resource.Resource{
			Attributes: []*v1_common.KeyValue{
				{
					Key: "service.name",
					Value: &v1_common.AnyValue{
						Value: &v1_common.AnyValue_StringValue{
							StringValue: service,
						},
					},
				},
				{
					Key: "other",
					Value: &v1_common.AnyValue{
						Value: &v1_common.AnyValue_StringValue{
							StringValue: "other-value",
						},
					},
				},
			},
		},
		ScopeSpans: []*v1_trace.ScopeSpans{
			{
				Spans: []*v1_trace.Span{
					{
						Name:         "test",
						TraceId:      traceID,
						SpanId:       make([]byte, 8),
						ParentSpanId: make([]byte, 8),
						Kind:         v1_trace.Span_SPAN_KIND_CLIENT,
						Status: &v1_trace.Status{
							Code:    statusCode,
							Message: statusMsg,
						},
						StartTimeUnixNano:      uint64(now.UnixNano()),
						EndTimeUnixNano:        uint64(now.Add(time.Second).UnixNano()),
						Attributes:             attributes,
						DroppedLinksCount:      rand.Uint32(),
						DroppedAttributesCount: rand.Uint32(),
					},
				},
			},
		},
	})
	return trace
}

func BenchmarkMixTraceQL(b *testing.B) {
	const query = "{ rootServiceName = `tempo-gateway` && (status = error || span.http.status_code = 500)}"

	blockDir, ok := os.LookupEnv("TRACEQL_BLOCKDIR")
	require.True(b, ok, "TRACEQL_BLOCKDIR env var must be set")

	ctx := context.TODO()

	r, _, _, err := local.New(&local.Config{Path: blockDir})
	require.NoError(b, err)

	rr := backend.NewReader(r)
	meta, err := rr.BlockMeta(ctx, blockID, tenantID)
	require.NoError(b, err)

	opts := common.DefaultSearchOptions()
	opts.StartPage = 3
	opts.TotalPages = 2

	block := newBackendBlock(meta, rr)
	_, _, err = block.openForSearch(ctx, opts)
	require.NoError(b, err)

	b.ResetTimer()
	bytesRead := 0

	for i := 0; i < b.N; i++ {
		e := traceql.NewEngine()

		resp, err := e.ExecuteSearch(ctx, &tempopb.SearchRequest{Query: query}, traceql.NewSpansetFetcherWrapper(func(ctx context.Context, req traceql.FetchSpansRequest) (traceql.FetchSpansResponse, error) {
			return block.Fetch(ctx, req, opts)
		}))
		require.NoError(b, err)
		require.NotNil(b, resp)

		// Read first 20 results (if any)
		bytesRead += int(resp.Metrics.InspectedBytes)
	}
	b.SetBytes(int64(bytesRead) / int64(b.N))
	b.ReportMetric(float64(bytesRead)/float64(b.N)/1000.0/1000.0, "MB_io/op")
}

generate

$ TRACEQL_BLOCKDIR=/workspaces/testblock go test -timeout 30m -run ^TestGenerateBlocks$ ./tempodb/encoding/vparquet4

after

TRACEQL_BLOCKDIR=/workspaces/testblock go test -benchmem -count 10 -run=^$ -bench ^BenchmarkMixTraceQL$ ./tempodb/encoding/vparquet4
BenchmarkMixTraceQL-16                 1        1239319102 ns/op         109.04 MB/s           135.1 MB_io/op   1026065080 B/op 17112874 allocs/op
BenchmarkMixTraceQL-16                 1        1228143811 ns/op         110.04 MB/s           135.1 MB_io/op   1026049224 B/op 17112718 allocs/op
BenchmarkMixTraceQL-16                 1        1228290452 ns/op         110.02 MB/s           135.1 MB_io/op   1026046936 B/op 17112705 allocs/op
BenchmarkMixTraceQL-16                 1        1327247874 ns/op         101.82 MB/s           135.1 MB_io/op   1026046680 B/op 17112701 allocs/op
BenchmarkMixTraceQL-16                 1        1258273596 ns/op         107.40 MB/s           135.1 MB_io/op   1026049528 B/op 17112715 allocs/op
BenchmarkMixTraceQL-16                 1        1240871840 ns/op         108.91 MB/s           135.1 MB_io/op   1026046776 B/op 17112703 allocs/op
BenchmarkMixTraceQL-16                 1        1236344582 ns/op         109.31 MB/s           135.1 MB_io/op   1026049048 B/op 17112715 allocs/op
BenchmarkMixTraceQL-16                 1        1240496677 ns/op         108.94 MB/s           135.1 MB_io/op   1026049208 B/op 17112717 allocs/op
BenchmarkMixTraceQL-16                 1        1223855401 ns/op         110.42 MB/s           135.1 MB_io/op   1026049128 B/op 17112717 allocs/op
BenchmarkMixTraceQL-16                 1        1253148161 ns/op         107.84 MB/s           135.1 MB_io/op   1026048200 B/op 17112706 allocs/op

before

TRACEQL_BLOCKDIR=/workspaces/testblock go test -benchmem -count 10 -run=^$ -bench ^BenchmarkMixTraceQL$ ./tempodb/encoding/vparquet4
BenchmarkMixTraceQL-16                 1        1192448670 ns/op         113.33 MB/s           135.1 MB_io/op   1026080664 B/op 17112732 allocs/op
BenchmarkMixTraceQL-16                 1        1228529159 ns/op         110.00 MB/s           135.1 MB_io/op   1026065144 B/op 17112580 allocs/op
BenchmarkMixTraceQL-16                 1        1221491370 ns/op         110.64 MB/s           135.1 MB_io/op   1026065448 B/op 17112579 allocs/op
BenchmarkMixTraceQL-16                 1        1203626980 ns/op         112.28 MB/s           135.1 MB_io/op   1026062664 B/op 17112565 allocs/op
BenchmarkMixTraceQL-16                 1        1216560603 ns/op         111.08 MB/s           135.1 MB_io/op   1026064640 B/op 17112574 allocs/op
BenchmarkMixTraceQL-16                 1        1209446194 ns/op         111.74 MB/s           135.1 MB_io/op   1026064312 B/op 17112571 allocs/op
BenchmarkMixTraceQL-16                 1        1249166815 ns/op         108.19 MB/s           135.1 MB_io/op   1026062632 B/op 17112564 allocs/op
BenchmarkMixTraceQL-16                 1        1220561026 ns/op         110.72 MB/s           135.1 MB_io/op   1026062552 B/op 17112563 allocs/op
BenchmarkMixTraceQL-16                 1        1245230127 ns/op         108.53 MB/s           135.1 MB_io/op   1026064712 B/op 17112575 allocs/op
BenchmarkMixTraceQL-16                 1        1291467390 ns/op         104.64 MB/s           135.1 MB_io/op   1026062632 B/op 17112565 allocs/op

UPD: I'm wondering how to check if a block has dedicated columns at all.

joe-elliott · 2025-02-10T19:46:44Z

I cannot reproduce it. Could you tell me how you generated traces?

we generally pull a block generated from internal tracing data which is why the benchmarks contain references to loki and tempo. these blocks generally cover a large range of organically created trace data.

nice work generating a large block. likely some pattern of data internally at Grafana is causing the regression. maybe you should write some float value http status codes and see what happens?

UPD: I'm wondering how to check if a block has dedicated columns at all.

the meta.json will list all dedicated columns in a block.

i do think there's a bug with the current implementation. fixing it may also resolve the regression. not sure. the query { span.http.status_code = 500 } is generating the following iterators:

LeftJoinIterator: 3: spanCollector(1)
required:
	UnionIterator: 3: %!s(<nil>)
		SyncIterator: rs.list.element.ss.list.element.Spans.list.element.HttpStatusCode : IntEqualPredicate{500}
		SyncIterator: rs.list.element.Resource.Attrs.list.element.ValueDouble.list.element : FloatEqualPredicate{500.000000})
optional:

I believe it should be this:

LeftJoinIterator: 3: spanCollector(1)
required:
	UnionIterator: 3: %!s(<nil>)
		SyncIterator: rs.list.element.ss.list.element.Spans.list.element.HttpStatusCode : IntEqualPredicate{500}
		JoinIterator: 4: attributeCollector{}
			SyncIterator: rs.list.element.ss.list.element.Spans.list.element.Attrs.list.element.Key : StringInPredicate{http.status_code}
			SyncIterator: rs.list.element.ss.list.element.Spans.list.element.Attrs.list.element.ValueDouble.list.element : FloatEqualPredicate{500.000000}
optional:

I'm looking into the regression now.

ndk · 2025-02-10T19:54:30Z

LeftJoinIterator: 3: spanCollector(1)
required:
	UnionIterator: 3: %!s(<nil>)
		SyncIterator: rs.list.element.ss.list.element.Spans.list.element.HttpStatusCode : IntEqualPredicate{500}
		JoinIterator: 4: attributeCollector{}
			SyncIterator: rs.list.element.ss.list.element.Spans.list.element.Attrs.list.element.Key : StringInPredicate{http.status_code}
			SyncIterator: rs.list.element.ss.list.element.Spans.list.element.Attrs.list.element.ValueDouble.list.element : FloatEqualPredicate{500.000000}
optional:

I'm looking into the regression now.

Yeah, I have the same hypothesis. I've just been looking into how to correctly attach filtering by key.

UPD: Done. ~~Let this serve as a lesson for me: I should jot down my ideas and hypotheses, otherwise I'll forget them later. :)~~

ndk · 2025-02-10T20:27:31Z

nice work generating a large block. likely some pattern of data internally at Grafana is causing the regression. maybe you should write some float value http status codes and see what happens?

Yep, I also think so. Roaming around the code base I got an impression that dedicated columns aren't something by default. I feel I'm missing something.

joe-elliott · 2025-02-13T22:23:14Z

Yep, I also think so. Roaming around the code base I got an impression that dedicated columns aren't something by default. I feel I'm missing something.

Dedicated columns need to be configured manually. They allow us to move data from the main attribute columns into their own "dedicated" column. The primary thing we use this for is to move very large attributes out of the main columns like sql queries or large json objects. The secondary use is to isolate important columns for querying. The docs have some details on how we pick which columns to configure.

So I reran benchmarks and the regression is fairly intense and likely not something we can accept. I am still working on determining what is causing the regression and seeing if we can improve performance, but my heads down time is limited. This is really cool work and it's so close to mergeable, but the span.http.status_code column is frequently queried and we can't sustain a hit like the benchmarks are suggesting. I have not given up hope yet. I'm chasing a few threads, but I do wonder if this work will be more easily mergeable once we drop the "well known" columns. I'm starting a discussion about that here:

#4694

benches

goos: darwin
goarch: arm64
pkg: github.com/grafana/tempo/tempodb/encoding/vparquet4
cpu: Apple M3 Pro
                                                    │ before.txt  │              after.txt              │
                                                    │   sec/op    │   sec/op     vs base                │
BackendBlockTraceQL/spanAttValMatch-11                82.76m ± 1%   81.73m ± 2%   -1.25% (p=0.011 n=10)
BackendBlockTraceQL/spanAttValNoMatch-11              5.468m ± 1%   5.503m ± 1%   +0.64% (p=0.029 n=10)
BackendBlockTraceQL/spanAttIntrinsicMatch-11          58.59m ± 0%   58.41m ± 1%   -0.30% (p=0.005 n=10)
BackendBlockTraceQL/spanAttIntrinsicNoMatch-11        5.383m ± 2%   5.383m ± 1%        ~ (p=0.579 n=10)
BackendBlockTraceQL/resourceAttValMatch-11            394.6m ± 1%   392.6m ± 1%        ~ (p=0.052 n=10)
BackendBlockTraceQL/resourceAttValNoMatch-11          5.095m ± 0%   5.119m ± 0%   +0.46% (p=0.009 n=10)
BackendBlockTraceQL/resourceAttIntrinsicMatch-11      24.86m ± 0%   24.71m ± 0%   -0.59% (p=0.000 n=10)
BackendBlockTraceQL/resourceAttIntrinsicMatch#01-11   5.170m ± 0%   5.167m ± 0%        ~ (p=0.579 n=10)
BackendBlockTraceQL/traceOrMatch-11                   240.7m ± 1%   362.6m ± 0%  +50.66% (p=0.000 n=10)
BackendBlockTraceQL/traceOrNoMatch-11                 241.4m ± 0%   364.0m ± 0%  +50.74% (p=0.000 n=10)
BackendBlockTraceQL/mixedValNoMatch-11                178.4m ± 0%   180.6m ± 0%   +1.18% (p=0.000 n=10)
BackendBlockTraceQL/mixedValMixedMatchAnd-11          5.009m ± 1%   5.021m ± 1%        ~ (p=0.436 n=10)
BackendBlockTraceQL/mixedValMixedMatchOr-11           148.5m ± 0%   149.5m ± 0%   +0.62% (p=0.000 n=10)
BackendBlockTraceQL/count-11                          328.5m ± 0%   327.4m ± 0%   -0.34% (p=0.000 n=10)
BackendBlockTraceQL/struct-11                         429.4m ± 2%   424.2m ± 1%   -1.20% (p=0.011 n=10)
BackendBlockTraceQL/||-11                             152.5m ± 0%   154.4m ± 0%   +1.24% (p=0.000 n=10)
BackendBlockTraceQL/mixed-11                          25.34m ± 0%   25.23m ± 0%   -0.44% (p=0.000 n=10)
BackendBlockTraceQL/complex-11                        4.975m ± 1%   5.002m ± 1%        ~ (p=0.105 n=10)
BackendBlockTraceQL/select-11                         5.009m ± 1%   5.023m ± 1%        ~ (p=0.912 n=10)
geomean                                               40.72m        42.53m        +4.44%

                                                    │  before.txt  │              after.txt              │
                                                    │     B/s      │     B/s       vs base               │
BackendBlockTraceQL/spanAttValMatch-11                270.6Mi ± 1%   274.0Mi ± 2%  +1.27% (p=0.011 n=10)
BackendBlockTraceQL/spanAttValNoMatch-11              309.0Mi ± 1%   307.0Mi ± 1%  -0.64% (p=0.028 n=10)
BackendBlockTraceQL/spanAttIntrinsicMatch-11          397.6Mi ± 0%   398.8Mi ± 1%  +0.30% (p=0.005 n=10)
BackendBlockTraceQL/spanAttIntrinsicNoMatch-11        460.2Mi ± 2%   460.2Mi ± 1%       ~ (p=0.579 n=10)
BackendBlockTraceQL/resourceAttValMatch-11            55.69Mi ± 1%   55.98Mi ± 1%       ~ (p=0.050 n=10)
BackendBlockTraceQL/resourceAttValNoMatch-11          176.5Mi ± 0%   175.7Mi ± 0%  -0.46% (p=0.007 n=10)
BackendBlockTraceQL/resourceAttIntrinsicMatch-11      885.8Mi ± 0%   891.1Mi ± 0%  +0.59% (p=0.000 n=10)
BackendBlockTraceQL/resourceAttIntrinsicMatch#01-11   180.6Mi ± 0%   180.7Mi ± 0%       ~ (p=0.579 n=10)
BackendBlockTraceQL/traceOrMatch-11                   6.948Mi ± 0%   6.876Mi ± 0%  -1.03% (p=0.000 n=10)
BackendBlockTraceQL/traceOrNoMatch-11                 6.924Mi ± 0%   6.852Mi ± 0%  -1.03% (p=0.000 n=10)
BackendBlockTraceQL/mixedValNoMatch-11                11.05Mi ± 0%   10.91Mi ± 0%  -1.21% (p=0.000 n=10)
BackendBlockTraceQL/mixedValMixedMatchAnd-11          178.2Mi ± 1%   177.8Mi ± 0%       ~ (p=0.436 n=10)
BackendBlockTraceQL/mixedValMixedMatchOr-11           18.57Mi ± 0%   18.46Mi ± 0%  -0.62% (p=0.000 n=10)
BackendBlockTraceQL/count-11                          66.86Mi ± 0%   67.09Mi ± 0%  +0.34% (p=0.000 n=10)
BackendBlockTraceQL/struct-11                         12.72Mi ± 2%   12.87Mi ± 1%  +1.24% (p=0.014 n=10)
BackendBlockTraceQL/||-11                             144.8Mi ± 0%   143.0Mi ± 0%  -1.22% (p=0.000 n=10)
BackendBlockTraceQL/mixed-11                          843.2Mi ± 0%   846.9Mi ± 0%  +0.45% (p=0.000 n=10)
BackendBlockTraceQL/complex-11                        180.9Mi ± 1%   179.9Mi ± 1%       ~ (p=0.093 n=10)
BackendBlockTraceQL/select-11                         179.7Mi ± 1%   179.2Mi ± 1%       ~ (p=0.869 n=10)
geomean                                               101.9Mi        101.8Mi       -0.14%

                                                    │ before.txt  │               after.txt               │
                                                    │  MB_io/op   │  MB_io/op    vs base                  │
BackendBlockTraceQL/spanAttValMatch-11                 23.48 ± 0%    23.48 ± 0%        ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/spanAttValNoMatch-11               1.772 ± 0%    1.772 ± 0%        ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/spanAttIntrinsicMatch-11           24.43 ± 0%    24.43 ± 0%        ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/spanAttIntrinsicNoMatch-11         2.598 ± 0%    2.598 ± 0%        ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/resourceAttValMatch-11             23.04 ± 0%    23.04 ± 0%        ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/resourceAttValNoMatch-11          943.2m ± 0%   943.2m ± 0%        ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/resourceAttIntrinsicMatch-11       23.09 ± 0%    23.09 ± 0%        ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/resourceAttIntrinsicMatch#01-11   979.0m ± 0%   979.0m ± 0%        ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/traceOrMatch-11                    1.753 ± 0%    2.615 ± 0%  +49.17% (p=0.000 n=10)
BackendBlockTraceQL/traceOrNoMatch-11                  1.753 ± 0%    2.615 ± 0%  +49.17% (p=0.000 n=10)
BackendBlockTraceQL/mixedValNoMatch-11                 2.067 ± 0%    2.067 ± 0%        ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/mixedValMixedMatchAnd-11          936.1m ± 0%   936.1m ± 0%        ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/mixedValMixedMatchOr-11            2.893 ± 0%    2.893 ± 0%        ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/count-11                           23.03 ± 0%    23.03 ± 0%        ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/struct-11                          5.726 ± 0%    5.726 ± 0%        ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/||-11                              23.14 ± 0%    23.14 ± 0%        ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/mixed-11                           22.40 ± 0%    22.40 ± 0%        ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/complex-11                        943.7m ± 0%   943.7m ± 0%        ~ (p=1.000 n=10) ¹
BackendBlockTraceQL/select-11                         943.7m ± 0%   943.7m ± 0%        ~ (p=1.000 n=10) ¹
geomean                                                4.351         4.538        +4.30%
¹ all samples are equal

                                                    │  before.txt   │               after.txt                │
                                                    │     B/op      │      B/op       vs base                │
BackendBlockTraceQL/spanAttValMatch-11                45.28Mi ±  1%    45.01Mi ±  1%        ~ (p=0.353 n=10)
BackendBlockTraceQL/spanAttValNoMatch-11              6.693Mi ±  1%    6.716Mi ±  1%        ~ (p=0.796 n=10)
BackendBlockTraceQL/spanAttIntrinsicMatch-11          39.78Mi ±  1%    39.65Mi ±  1%        ~ (p=0.123 n=10)
BackendBlockTraceQL/spanAttIntrinsicNoMatch-11        7.021Mi ±  2%    7.032Mi ±  1%        ~ (p=0.912 n=10)
BackendBlockTraceQL/resourceAttValMatch-11            576.8Mi ±  0%    576.0Mi ±  1%        ~ (p=0.481 n=10)
BackendBlockTraceQL/resourceAttValNoMatch-11          5.308Mi ±  1%    5.362Mi ±  1%        ~ (p=0.063 n=10)
BackendBlockTraceQL/resourceAttIntrinsicMatch-11      10.52Mi ±  1%    10.48Mi ±  1%        ~ (p=0.912 n=10)
BackendBlockTraceQL/resourceAttIntrinsicMatch#01-11   6.768Mi ±  1%    6.763Mi ±  1%        ~ (p=0.912 n=10)
BackendBlockTraceQL/traceOrMatch-11                   9.428Mi ± 20%   12.960Mi ± 12%  +37.46% (p=0.001 n=10)
BackendBlockTraceQL/traceOrNoMatch-11                 10.02Mi ± 23%    13.47Mi ± 15%  +34.47% (p=0.000 n=10)
BackendBlockTraceQL/mixedValNoMatch-11                8.002Mi ± 11%    8.127Mi ±  8%        ~ (p=0.853 n=10)
BackendBlockTraceQL/mixedValMixedMatchAnd-11          5.743Mi ±  1%    5.738Mi ±  1%        ~ (p=0.684 n=10)
BackendBlockTraceQL/mixedValMixedMatchOr-11           7.971Mi ±  8%    8.119Mi ±  8%        ~ (p=0.436 n=10)
BackendBlockTraceQL/count-11                          418.7Mi ±  0%    417.9Mi ±  0%        ~ (p=0.165 n=10)
BackendBlockTraceQL/struct-11                         13.66Mi ± 17%    15.59Mi ± 10%        ~ (p=0.143 n=10)
BackendBlockTraceQL/||-11                             16.74Mi ±  5%    17.05Mi ±  4%        ~ (p=0.912 n=10)
BackendBlockTraceQL/mixed-11                          7.048Mi ±  3%    7.015Mi ±  2%        ~ (p=0.165 n=10)
BackendBlockTraceQL/complex-11                        5.613Mi ±  1%    5.499Mi ±  3%        ~ (p=0.075 n=10)
BackendBlockTraceQL/select-11                         5.554Mi ±  2%    5.531Mi ±  2%        ~ (p=0.631 n=10)
geomean                                               14.61Mi          15.21Mi         +4.11%

                                                    │ before.txt  │             after.txt              │
                                                    │  allocs/op  │  allocs/op   vs base               │
BackendBlockTraceQL/spanAttValMatch-11                503.6k ± 0%   498.5k ± 0%  -1.02% (p=0.000 n=10)
BackendBlockTraceQL/spanAttValNoMatch-11              79.47k ± 0%   79.47k ± 0%       ~ (p=0.179 n=10)
BackendBlockTraceQL/spanAttIntrinsicMatch-11          288.1k ± 0%   287.4k ± 0%  -0.25% (p=0.000 n=10)
BackendBlockTraceQL/spanAttIntrinsicNoMatch-11        79.42k ± 0%   79.42k ± 0%       ~ (p=0.155 n=10)
BackendBlockTraceQL/resourceAttValMatch-11            3.450M ± 0%   3.445M ± 0%  -0.16% (p=0.000 n=10)
BackendBlockTraceQL/resourceAttValNoMatch-11          79.47k ± 0%   79.47k ± 0%       ~ (p=0.300 n=10)
BackendBlockTraceQL/resourceAttIntrinsicMatch-11      126.1k ± 0%   125.7k ± 0%  -0.31% (p=0.000 n=10)
BackendBlockTraceQL/resourceAttIntrinsicMatch#01-11   79.46k ± 0%   79.46k ± 0%  -0.00% (p=0.002 n=10)
BackendBlockTraceQL/traceOrMatch-11                   86.53k ± 2%   87.34k ± 2%  +0.93% (p=0.019 n=10)
BackendBlockTraceQL/traceOrNoMatch-11                 86.36k ± 0%   87.04k ± 4%  +0.78% (p=0.000 n=10)
BackendBlockTraceQL/mixedValNoMatch-11                80.29k ± 0%   80.28k ± 0%       ~ (p=0.870 n=10)
BackendBlockTraceQL/mixedValMixedMatchAnd-11          79.46k ± 0%   79.46k ± 0%  -0.00% (p=0.010 n=10)
BackendBlockTraceQL/mixedValMixedMatchOr-11           80.29k ± 0%   80.28k ± 0%       ~ (p=0.516 n=10)
BackendBlockTraceQL/count-11                          1.991M ± 0%   1.985M ± 0%  -0.26% (p=0.000 n=10)
BackendBlockTraceQL/struct-11                         93.63k ± 7%   93.42k ± 2%       ~ (p=0.529 n=10)
BackendBlockTraceQL/||-11                             161.9k ± 0%   161.5k ± 0%  -0.27% (p=0.000 n=10)
BackendBlockTraceQL/mixed-11                          87.86k ± 0%   87.83k ± 0%  -0.02% (p=0.000 n=10)
BackendBlockTraceQL/complex-11                        79.82k ± 0%   79.86k ± 0%  +0.04% (p=0.000 n=10)
BackendBlockTraceQL/select-11                         79.70k ± 0%   79.70k ± 0%  -0.00% (p=0.016 n=10)
geomean                                               147.6k        147.5k       -0.04%

ndk · 2025-02-14T14:58:13Z

I removed the previous comments because I discovered something promising.

ndk · 2025-02-24T15:53:06Z

Ok, I'm back! :) Since last time, I've learned a lot more about how Tempo stores traces in Parquet and how it fetches data. Here's a concise summary of what I found.

TL;DR

The dictionary load/time is the bottleneck. Even when the key isn't found. I'd love to see if your dataset behaves similarly (tons of NULL, a small set of real keys, yet a large dictionary overhead).

The dataset

I reproduced a very similar performance regression where scanning for a key (e.g. http.status_code) that doesn't exist still costs a lot of time. The culprit is that checking the dictionary becomes expensive, even if the key isn't actually there.

Below is an example of my attribute-keys column (Key) metadata. It's mostly NULLs and has only two distinct strings (error and grpc.status). Still, the dictionary size is about 316123, and Distinct Values: 0 also looks odd:

Column 59

File Name: ./tempo-data2/blocks/single-tenant/404debee-4748-4987-b850-b2f491cdf579/data.parquet
Version: 2.6
Created By: github.com/parquet-go/parquet-go version 0.24.0(build )
Total rows: 21700
Number of RowGroups: 15
Number of Real Columns: 9
Number of Columns: 103
Number of Selected Columns: 1
Column 59: rs.list.element.ss.list.element.Spans.list.element.Attrs.list.element.Key (BYTE_ARRAY / String / UTF8)
--- Row Group: 0 ---
--- Total Bytes: 84004588 ---
--- Total Compressed Bytes: 70087933 ---
--- Rows: 4742 ---
Column 59
  Values: 1790445, Null Values: 1752208, Distinct Values: 0
  Max: grpc.status, Min: error
  Compression: SNAPPY, Encodings: PLAIN(DICT_PAGE) RLE_DICTIONARY
  Uncompressed Size: 316123, Compressed Size: 316125

It would be really helpful if you share a similar dump of your dataset. For example:

Getting parquet metadata

parquet-reader --only-metadata /Users/joe/testblock/1/030c8c4f-9d47-4916-aadc-26b90b1d2bc4/data.parquet > metadata.txt
parquet-reader --columns=59 --dump /Users/joe/testblock/1/030c8c4f-9d47-4916-aadc-26b90b1d2bc4/data.parquet > column_59.txt
parquet-reader --columns=63 --dump /Users/joe/testblock/1/030c8c4f-9d47-4916-aadc-26b90b1d2bc4/data.parquet > column_63.txt
parquet-reader --columns=92 --dump /Users/joe/testblock/1/030c8c4f-9d47-4916-aadc-26b90b1d2bc4/data.parquet > column_92.txt

That way we can see if you're also dealing with a large dictionary block for just a handful of real values (and lots of NULL).

The regression

In my case, I create an iterator to scan for span.http.status_code in the float column. But first, I must check the Key column to see if that attribute exists:

Looking up float `http.status_code`

subIters = append(subIters,
  parquetquery.NewJoinIterator(
    DefinitionLevelResourceSpansILSSpanAttrs,
    []parquetquery.Iterator{
      makeIter(columnPathSpanAttrKey, parquetquery.NewStringInPredicate([]string{cond.Attribute.Name}), "key"),
      makeIter(columnPathSpanAttrDouble, pred, "float"),
    },
    &attributeCollector{},
    parquetquery.WithPool(pqAttrPool),
  ),
)

If we replace real predicates with false stubs, the problem vanishes:

Both predicates are disabled

subIters = append(subIters,
  parquetquery.NewJoinIterator(
    DefinitionLevelResourceSpansILSSpanAttrs,
    []parquetquery.Iterator{
      makeIter(columnPathSpanAttrKey, parquetquery.NewCallbackPredicate(func() bool { return false }), "key"),
      makeIter(columnPathSpanAttrDouble, parquetquery.NewCallbackPredicate(func() bool { return false }), "float"),
    },
    &attributeCollector{},
    parquetquery.WithPool(pqAttrPool),
  ),
)

But if we keep the key scanning, the slowdown remains:

key's predicate is enabled, value's predicate is disabled

subIters = append(subIters,
  parquetquery.NewJoinIterator(
    DefinitionLevelResourceSpansILSSpanAttrs,
    []parquetquery.Iterator{
      makeIter(columnPathSpanAttrKey, parquetquery.NewStringInPredicate([]string{cond.Attribute.Name}), "key"),
      makeIter(columnPathSpanAttrDouble, parquetquery.NewCallbackPredicate(func() bool { return false }), "float"),
    },
    &attributeCollector{},
    parquetquery.WithPool(pqAttrPool),
  ),
)

Even though the dataset doesn't have span.http.status_code, the predicate still opens the dictionary:

func (p *StringInPredicate) KeepColumnChunk(cc *ColumnChunkHelper) bool {
	if d := cc.Dictionary(); d != nil {
		return keepDictionary(d, p.KeepValue)
	}

	ci, err := cc.ColumnIndex()

Benchmarking indicates dictionary loading overhead. Even though there's no matching key, opening and parsing the dictionary itself can be quite expensive. For example:

bench

$ go run golang.org/x/perf/cmd/benchstat@latest before.txt after.txt
goos: linux
goarch: amd64
pkg: github.com/grafana/tempo/tempodb/encoding/vparquet4
cpu: AMD Ryzen 7 7800X3D 8-Core Processor
       │ before.txt  │             after.txt              │
       │   sec/op    │   sec/op     vs base               │
Mix-16   1.982m ± 1%   2.175m ± 1%  +9.73% (p=0.000 n=10)

       │  before.txt  │              after.txt               │
       │     B/s      │     B/s       vs base                │
Mix-16   199.4Mi ± 1%   320.4Mi ± 1%  +60.64% (p=0.000 n=10)

       │ before.txt  │              after.txt              │
       │  MB_io/op   │  MB_io/op    vs base                │
Mix-16   414.4m ± 0%   730.6m ± 0%  +76.30% (p=0.000 n=10)

       │  before.txt  │           after.txt            │
       │     B/op     │     B/op      vs base          │
Mix-16   1.862Mi ± 1%   1.888Mi ± 2%  ~ (p=0.105 n=10)

       │ before.txt  │             after.txt              │
       │  allocs/op  │  allocs/op   vs base               │
Mix-16   27.64k ± 0%   27.73k ± 0%  +0.31% (p=0.000 n=10)

joe-elliott · 2025-02-25T21:48:56Z

The dictionary load/time is the bottleneck.

Yes, dictionaries can be expensive. For low/medium cardinality columns we have found that they can be a massive performance improvement. Especially when searching for rarely occurring data. They also provide very nice compression.
The biggest issues with dictionaries are that:

They take the first page of each row group and are loaded as one indivisible unit. You can't partially load a dictionary even if its enormous.
The byte buffers are not pooled in parquet-go. I have been wanting to PR pooled buffers to parquet-go for awhile, but have not got around to it.

In vParquet5 I am proposing a set of non-dictionary encoded dedicated columns to put high cardinality data to combat the first issue.

The culprit is that checking the dictionary becomes expensive, even if the key isn't actually there.

The key not being there is actually one of the fastest things that can occur. It allows us to skip the entire row group.

Below is an example of my attribute-keys column (Key) metadata. It's mostly NULLs and has only two distinct strings (error and grpc.status). Still, the dictionary size is about 316123, and Distinct Values: 0

I believe these stats are for the entire column in that row group and not just the dictionary. Distinct values 0 is weird. I wonder if that's a parquet-go bug. I'm seeing quite similar values for our internal datasets. The reason for the large number of nulls is b/c of the way parquet nests values. Everytime the structure iterates at a lower level than the column you are currently iterating there is a "null" value for this column b/c it didn't have a value.

These values are not actually encoded into the column's pages. They are encoded into the repetition and definition levels and reinserted by parquet-go when you read the values. I have been digging deeper into this b/c of the PR you submitted. I have tried a few approaches not evaluating these nulls at all, but none of them come back with the expected perf improvements. While thinking about this I stumbled on this perf improvement which also improves our benchmarks. Once this is merged I'm going to try again b/c I think I'm close to a nice improvement for any situation in which heavy null iteration occurs (like yours!).

If we replace real predicates with false stubs, the problem vanishes:

Yes! but these are also not doing anything :). Your example is returning false from all these functions which is basically telling the iterator to skip everything.

Even though the dataset doesn't have span.http.status_code, the predicate still opens the dictionary:

The dictionary must be opened in order to read the column. Even if we didn't have code that dealt with it explicitly it would still happen behind the scenes in parquet-go

Great work here. You are pushing me into details of parquet-go I hadn't reviewed in awhile and finding some nice improvements.

ndk · 2025-02-25T23:06:00Z

I've regenerated my dataset to make it more realistic. Then I ran a series of tests comparing different scenarios. Overall, it looks like your improvements not only speed up queries in general but also mitigate (somewhat) a regression introduced by my code.

Dataset Generation

k6 script

Probably, you'll need this one to be able to generate massive number of traces: grafana/xk6-client-tracing#32

ENDPOINT=127.0.0.1:4317 ./k6 run --iterations=100000 --vus=1000 ./template.js

template.js:

import { randomIntBetween } from 'https://jslib.k6.io/k6-utils/1.2.0/index.js';
import { sleep } from 'k6';
import tracing from 'k6/x/tracing';

export const options = {
    vus: 1,
    duration: "20m",
};

const endpoint = __ENV.ENDPOINT || "otel-collector:4317"
const orgid = __ENV.TEMPO_X_SCOPE_ORGID || "k6-test"
const client = new tracing.Client({
    endpoint,
    exporter: tracing.EXPORTER_OTLP,
    tls: {
        insecure: true,
    },
    headers: {
        "X-Scope-Orgid": orgid
    }
});

const traceDefaults = {
    attributeSemantics: tracing.SEMANTICS_HTTP,
    attributes: { "one": "three", "intAttr": 123, "floatAttr": 123.4 },
    randomAttributes: { count: 2, cardinality: 5 },
    randomEvents: { count: 0.1, exceptionCount: 0.2, randomAttributes: { count: 6, cardinality: 20 } },
}

const traceTemplates = [
    {
        defaults: traceDefaults,
        spans: [
            { service: "shop-backend", name: "list-articles", duration: { min: 200, max: 900 }, attributes: { "http.status_code": 403 } },
            { service: "shop-backend", name: "authenticate", duration: { min: 50, max: 100 }, attributes: { "http.status_code": 412.0, "prettyFloat": 214.0 } },
            { service: "auth-service", name: "authenticate", attributes: { "http.status_code": 500 } },
            { service: "shop-backend", name: "fetch-articles", parentIdx: 0, attributes: { "http.status_code": 500.3, "zerovalue": 0.0 } },
            {
                service: "article-service",
                name: "list-articles",
                attributes: { "http.status_code": 200 },
                links: [{ attributes: { "link-type": "parent-child" }, randomAttributes: { count: 2, cardinality: 5 } }]
            },
            { service: "article-service", name: "select-articles", attributeSemantics: tracing.SEMANTICS_DB },
            { service: "postgres", name: "query-articles", attributeSemantics: tracing.SEMANTICS_DB, randomAttributes: { count: 5 } },
        ]
    },
    {
        defaults: {
            attributes: { "numbers": ["one", "two", "three"] },
            attributeSemantics: tracing.SEMANTICS_HTTP,
            randomEvents: { count: 2, randomAttributes: { count: 3, cardinality: 10 } },
        },
        spans: [
            { service: "shop-backend", name: "article-to-cart", duration: { min: 400, max: 1200 } },
            { service: "shop-backend", name: "authenticate", duration: { min: 70, max: 200 } },
            { service: "auth-service", name: "authenticate" },
            { service: "shop-backend", name: "get-article", parentIdx: 0 },
            { service: "article-service", name: "get-article" },
            { service: "article-service", name: "select-articles", attributeSemantics: tracing.SEMANTICS_DB },
            { service: "postgres", name: "query-articles", attributeSemantics: tracing.SEMANTICS_DB, randomAttributes: { count: 2 } },
            { service: "shop-backend", name: "place-articles", parentIdx: 0 },
            { service: "cart-service", name: "place-articles", attributes: { "article.count": 1, "http.status_code": 201 } },
            { service: "cart-service", name: "persist-cart" }
        ]
    },
    {
        defaults: traceDefaults,
        spans: [
            { service: "shop-backend", attributes: { "http.status_code": 403 } },
            { service: "shop-backend", name: "authenticate", attributes: { "http.request.header.accept": ["application/json"] } },
            {
                service: "auth-service",
                name: "authenticate",
                attributes: { "http.status_code": 403 },
                randomEvents: { count: 0.5, exceptionCount: 2, randomAttributes: { count: 5, cardinality: 5 } }
            },
        ]
    },
    {
        defaults: traceDefaults,
        spans: [
            { service: "shop-backend" },
            { service: "shop-backend", name: "authenticate", attributes: { "http.request.header.accept": ["application/json"] } },
            { service: "auth-service", name: "authenticate" },
            {
                service: "cart-service",
                name: "checkout",
                randomEvents: { count: 0.5, exceptionCount: 2, exceptionOnError: true, randomAttributes: { count: 5, cardinality: 5 } }
            },
            {
                service: "billing-service",
                name: "payment",
                randomLinks: { count: 0.5, randomAttributes: { count: 3, cardinality: 10 } },
                randomEvents: { exceptionOnError: true, randomAttributes: { count: 4 } }
            }
        ]
    },
]

export default function () {
    const templateIndex = randomIntBetween(0, traceTemplates.length - 1)
    const gen = new tracing.TemplatedGenerator(traceTemplates[templateIndex])
    client.push(gen.traces())

    sleep(randomIntBetween(1, 5));
}

export function teardown() {
    client.shutdown();
}

This yields a dataset that looks a bit more diverse. The new columns are less “weird”:

Key column (#59)

File Name: ./tempo-data/blocks/single-tenant/ffd6c4e3-b711-4a4a-a46b-ea395dc54bf3/data.parquet
Version: 2.6
Created By: github.com/parquet-go/parquet-go version 0.24.0(build )
Total rows: 100000
Number of RowGroups: 2
Number of Real Columns: 9
Number of Columns: 103
Number of Selected Columns: 1
Column 59: rs.list.element.ss.list.element.Spans.list.element.Attrs.list.element.Key (BYTE_ARRAY / String / UTF8)
--- Row Group: 0 ---
--- Total Bytes: 200510083 ---
--- Total Compressed Bytes: 174227800 ---
--- Rows: 75392 ---
Column 59
  Values: 5829076, Null Values: 0, Distinct Values: 0
  Max: zerovalue, Min: article.count
  Compression: SNAPPY, Encodings: PLAIN(DICT_PAGE) RLE_DICTIONARY
  Uncompressed Size: 12627661, Compressed Size: 12566780

Doubles column (#63)

File Name: ./tempo-data/blocks/single-tenant/ffd6c4e3-b711-4a4a-a46b-ea395dc54bf3/data.parquet
Version: 2.6
Created By: github.com/parquet-go/parquet-go version 0.24.0(build )
Total rows: 100000
Number of RowGroups: 2
Number of Real Columns: 9
Number of Columns: 103
Number of Selected Columns: 1
Column 63: rs.list.element.ss.list.element.Spans.list.element.Attrs.list.element.ValueDouble.list.element (DOUBLE)
--- Row Group: 0 ---
--- Total Bytes: 200510083 ---
--- Total Compressed Bytes: 174227800 ---
--- Rows: 75392 ---
Column 63
  Values: 5829076, Null Values: 5527646, Distinct Values: 0
  Max: 500.3, Min: 123.4
  Compression: SNAPPY, Encodings: PLAIN
  Uncompressed Size: 6157440, Compressed Size: 3892138

Benchmark and Test

Here’s the (truncated) test I used.

Benchmark code

I run it like:

go test -benchmem -count=10 -run=^$ -bench ^BenchmarkMix$ github.com/grafana/tempo/tempodb/encoding/vparquet4

func newQueryExecuter(t require.TestingT) func(query string) (*tempopb.SearchResponse, error) {
	blockID := uuid.MustParse("ffd6c4e3-b711-4a4a-a46b-ea395dc54bf3")

	ctx := context.TODO()

	r, _, _, err := local.New(&local.Config{
		Path: path.Join("/home/nd/develop/01/tempo-data/blocks"),
	})
	require.NoError(t, err)

	rr := backend.NewReader(r)
	meta, err := rr.BlockMeta(ctx, blockID, tenantID)
	require.NoError(t, err)

	opts := common.DefaultSearchOptions()
	opts.StartPage = 0
	opts.TotalPages = 1

	block := newBackendBlock(meta, rr)
	_, _, err = block.openForSearch(ctx, opts)
	require.NoError(t, err)

	e := traceql.NewEngine()

	return func(query string) (*tempopb.SearchResponse, error) {
		return e.ExecuteSearch(ctx, &tempopb.SearchRequest{Query: query}, traceql.NewSpansetFetcherWrapper(func(ctx context.Context, req traceql.FetchSpansRequest) (traceql.FetchSpansResponse, error) {
			return block.Fetch(ctx, req, opts)
		}))
	}
}

func TestMix(t *testing.T) {
	execute := newQueryExecuter(t)
	{
		resp, err := execute(`{span.http.status_code != 500}`)
		require.NoError(t, err)
		require.NotNil(t, resp)
		require.Len(t, resp.Traces, 75392)
	}
	{
		resp, err := execute(`{span.http.status_code = 500}`)
		require.NoError(t, err)
		require.NotNil(t, resp)
		require.Len(t, resp.Traces, 18871)
	}
}

func BenchmarkMix(b *testing.B) {
	b.ResetTimer()
	bytesRead := 0
	b.StopTimer()
	for i := 0; i < b.N; i++ {
		execute := newQueryExecuter(b)
		b.StartTimer()

		resp, err := execute("{ span.http.status_code = 500 }")

		b.StopTimer()
		require.NoError(b, err)
		require.NotNil(b, resp)
		bytesRead += int(resp.Metrics.InspectedBytes)
	}
	b.SetBytes(int64(bytesRead) / int64(b.N))
	b.ReportMetric(float64(bytesRead)/float64(b.N)/1000.0/1000.0, "MB_io/op")
}

Results

after\before	original code	original code + your improvements
original code + my code	230.6m - 142.6m = +88.0m	230.6m - 136.0m = +94.6m
original code + your improvements	136.0m - 142.6m = -6.6m
original code + my code + your improvements	219.6m - 142.6m = +77.0m	219.6m - 136.0m = +83.6m

Original means the current main branch.
“Your fixes” indicates the patch you mentioned.

It's somewhat speculative, but from these numbers, your improvements consistently lower overall query time and also reduce the additional regression my code introduces by ~4-5m.

ndk changed the title ~~WIP: Proposal to address mixed-type attribute querying limitations~~ TraceQL: Proposal to address mixed-type attribute querying limitations Nov 27, 2024

ndk mentioned this pull request Dec 10, 2024

TraceQL: Ints can't be compared to floats #4332

Open

ndk changed the title ~~TraceQL: Proposal to address mixed-type attribute querying limitations~~ WIP: Proposal to address mixed-type attribute querying limitations Dec 19, 2024

ndk force-pushed the mixed-type-attr-query branch 2 times, most recently from 3d8f31d to 812d768 Compare January 10, 2025 16:41

ndk changed the title ~~WIP: Proposal to address mixed-type attribute querying limitations~~ Mixed-type attribute querying (int/float) Jan 12, 2025

ndk marked this pull request as ready for review January 12, 2025 11:42

ndk requested review from joe-elliott, mdisibio, mapno, yvrhdn, zalegrala, electron0zero, ie-pham, stoewer and javiermolinar as code owners January 12, 2025 11:42

ndk force-pushed the mixed-type-attr-query branch from 812d768 to 171afab Compare January 12, 2025 12:42

ndk changed the title ~~Mixed-type attribute querying (int/float)~~ TraceQL: support mixed-type attribute querying (int/float) Jan 12, 2025

ndk force-pushed the mixed-type-attr-query branch 3 times, most recently from 4970fbd to 50f5ae5 Compare January 14, 2025 16:09

ndk force-pushed the mixed-type-attr-query branch from b0ffcde to 10b04c5 Compare January 15, 2025 01:25

ndk force-pushed the mixed-type-attr-query branch 2 times, most recently from c05b608 to a679803 Compare January 17, 2025 14:01

ndk force-pushed the mixed-type-attr-query branch 2 times, most recently from 7222fa2 to 9ac7d86 Compare February 6, 2025 15:38

This comment was marked as outdated.

Sign in to view

ndk force-pushed the mixed-type-attr-query branch 3 times, most recently from 63aa82c to 69aa605 Compare February 6, 2025 23:15

ndk marked this pull request as ready for review February 6, 2025 23:22

ndk changed the title ~~WIP TraceQL: support mixed-type attribute querying (int/float)~~ TraceQL: support mixed-type attribute querying (int/float) Feb 6, 2025

ndk force-pushed the mixed-type-attr-query branch from 69aa605 to 7e2974d Compare February 7, 2025 17:27

joe-elliott reviewed Feb 7, 2025

View reviewed changes

tempodb/encoding/vparquet4/testqt/floatattr.txt Outdated Show resolved Hide resolved

ndk force-pushed the mixed-type-attr-query branch from 8e1feb4 to 495c234 Compare February 8, 2025 11:46

TraceQL: support mixed-type attribute querying (int/float)

4fc7cbd

ndk force-pushed the mixed-type-attr-query branch from 495c234 to ff2bc15 Compare February 10, 2025 20:18

ndk force-pushed the mixed-type-attr-query branch from 11f11e5 to e7828db Compare February 17, 2025 16:39

wip

fde9918

ndk force-pushed the mixed-type-attr-query branch from e7828db to fde9918 Compare February 17, 2025 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TraceQL: support mixed-type attribute querying (int/float) #4391

TraceQL: support mixed-type attribute querying (int/float) #4391

ndk commented Nov 27, 2024 •

edited

Loading

joe-elliott commented Jan 2, 2025

ndk commented Jan 12, 2025 •

edited

Loading

joe-elliott commented Jan 14, 2025 •

edited

Loading

ndk commented Jan 15, 2025 •

edited

Loading

joe-elliott commented Jan 21, 2025

ndk commented Jan 22, 2025 •

edited

Loading

joe-elliott commented Feb 5, 2025

ndk commented Feb 5, 2025 •

edited

Loading

joe-elliott commented Feb 6, 2025

This comment was marked as outdated.

joe-elliott commented Feb 7, 2025

ndk commented Feb 7, 2025 •

edited

Loading

joe-elliott left a comment

ndk commented Feb 7, 2025 •

edited

Loading

joe-elliott commented Feb 10, 2025

ndk commented Feb 10, 2025 •

edited

Loading

ndk commented Feb 10, 2025

joe-elliott commented Feb 13, 2025

ndk commented Feb 14, 2025 •

edited

Loading

ndk commented Feb 24, 2025 •

edited

Loading

joe-elliott commented Feb 25, 2025

ndk commented Feb 25, 2025 •

edited

Loading

TraceQL: support mixed-type attribute querying (int/float) #4391

Are you sure you want to change the base?

TraceQL: support mixed-type attribute querying (int/float) #4391

Conversation

ndk commented Nov 27, 2024 • edited Loading

joe-elliott commented Jan 2, 2025

ndk commented Jan 12, 2025 • edited Loading

joe-elliott commented Jan 14, 2025 • edited Loading

ndk commented Jan 15, 2025 • edited Loading

joe-elliott commented Jan 21, 2025

ndk commented Jan 22, 2025 • edited Loading

joe-elliott commented Feb 5, 2025

ndk commented Feb 5, 2025 • edited Loading

joe-elliott commented Feb 6, 2025

This comment was marked as outdated.

joe-elliott commented Feb 7, 2025

ndk commented Feb 7, 2025 • edited Loading

joe-elliott left a comment

Choose a reason for hiding this comment

ndk commented Feb 7, 2025 • edited Loading

joe-elliott commented Feb 10, 2025

ndk commented Feb 10, 2025 • edited Loading

ndk commented Feb 10, 2025

joe-elliott commented Feb 13, 2025

ndk commented Feb 14, 2025 • edited Loading

ndk commented Feb 24, 2025 • edited Loading

TL;DR

The dataset

The regression

joe-elliott commented Feb 25, 2025

ndk commented Feb 25, 2025 • edited Loading

Dataset Generation

Benchmark and Test

Results

ndk commented Nov 27, 2024 •

edited

Loading

ndk commented Jan 12, 2025 •

edited

Loading

joe-elliott commented Jan 14, 2025 •

edited

Loading

ndk commented Jan 15, 2025 •

edited

Loading

ndk commented Jan 22, 2025 •

edited

Loading

ndk commented Feb 5, 2025 •

edited

Loading

ndk commented Feb 7, 2025 •

edited

Loading

ndk commented Feb 7, 2025 •

edited

Loading

ndk commented Feb 10, 2025 •

edited

Loading

ndk commented Feb 14, 2025 •

edited

Loading

ndk commented Feb 24, 2025 •

edited

Loading

ndk commented Feb 25, 2025 •

edited

Loading