Review ETH #4

m90 · 2021-08-02T13:10:03Z

Review feedback from @cyrill-k (thank you):

Comments IETF draft analytics.txt

https://datatracker.ietf.org/doc/draft-ring-analyticstxt/

Technical Comments

1. Does the analytics.txt follow the Internet Message Format? If yes, I would make this more clear/explicit in the beginning of this Section. If no, it might be useful to state the differences to this document.
1. Is there an RFC that defines the chaining of fields values?
1. Use same notation as RFC8615: "well-known location" instead of ".well-known path/location"

Clarifications

1.1) is there no standard way to access information about what software is being used?
1.1) Hiding presence seems always possible for websites
- Can a client or auditor prove that a specific policy is not violated (i.e., is there an overlap with the automated audit approaches)?
1.2) What are the regulations in particular? What is the overlap with analytics.txt?
1.3) "Analytics or user tracking as referred to in this document does not refer to the identification of users in order to deliver customized advertising or content across websites of any kind." seems confusing to me. What is the distinction here? I.e., if a website uses a specific tracking software, then it should be able to track users using this information, right?
3.4.3.1.5) Meaning of cache? Reference to ETag headers?
4.3) Could the mapping to desktop or mobile applications be specified more clearly?

Content

3.4.1) Could there be an automatic way to fetch additional information (e.g., an automated email reply)?
3.4.2) how about the time of visit?
3.4.2) are all signals disjoint? Or are some overlapping?
3.4.2.1.4) in particular, if the geo location is derived from the IP address, would it still need to be mentioned in addition to the ip address?
3.4.4.1.3) Nit: at the application layer
3.4.4.1.4) Why the distinction between server-side and logs?
3.4.5) Is a binary decision between opt-in and opt-out realistic? Isn't this decision in practice be more fine-grained (I'm thinking about the typical "required-cookies" only setting that many websites have)?
3.4.6, 3.4.8, and 3.4.10) Are single-value fields expressive enough for these properties?
3.4.9) Could this property actually be verified, i.e., by fetching the website from different vantage points?
3.4.11 and 3.4.12) This statement seems quite vague: "This field SHOULD only be added if it makes the setup described by the file easier to understand for human users."

Personal Thoughts

1.1) One of the motivation statements is that "... creating incentives for software to find workarounds, thus allowing them to hide their presence from users". Wouldn't analytics.txt have the same issue (i.e., websites simply state that they do not track, collect, etc)?
3.4.2.1.10) The system seems to be based on trust, but websites that clients trust to provide correct analytics.txt information are typically not of a concern for the client in terms of privacy. Maybe elaborate on this point?

m90 · 2021-08-05T08:09:56Z

Thanks again for your review @cyrill-k. I'm currently working through these points and updating the draft accordingly.

Some followups to what you have written down coming up:

Meaning of cache? Reference to ETag headers?

There is a slightly exotic way of identifying users by leveraging HTTP caches, sometimes called "Super cookies", which is what this is referring to. It works in the way:

the client requests https://example.com/supercookie.gif
the server sends the resource along with a unique ETag header (that is not connected to the actual content)
the next time the client requests the supercookie.gif resource it will send the ETag header that is actually a user identifier
the server can now identify the user and collect data that is tied to the user identifier given in the ETag header

Is this clear enough in the spec or do you think we should be more explicit here?

Could there be an automatic way to fetch additional information (e.g., an automated email reply)?

Could you elaborate on this and what kind of additional information a user could be requesting that is not part of the analytics.txt file?

Why the distinction between server-side and logs?

Server-Side analytics does not necessarily mean using logs. In web applications you can use "analytics middleware" at application layer that records every request along with some metadata and also identifies users.

Is a binary decision between opt-in and opt-out realistic? Isn't this decision in practice be more fine-grained (I'm thinking about the typical "required-cookies" only setting that many websites have)?

This decision is not really binary: if you don't allow opt-in or opt-out, you have to specify none.

opt-in means no data is collected before giving consent (you might have to set a cookie to store that preference, yes, but that's not what we are looking at)
opt-out means you can opt out of everything at every time
everything else is classified as none

We aren't concerned about software setting cookies, but we're looking at the consequences of this.

3.4.6, 3.4.8, and 3.4.10) Are single-value fields expressive enough for these properties?

Not entirely sure, but in draft version 01 these are multi-value fields.

cyrill-k · 2021-08-27T17:54:35Z

Hi,

I'm really sorry for the late response, I was quite busy but I hope I can reply more quickly in the future ;)

Is this clear enough in the spec or do you think we should be more explicit here?

If this way of identifying users is a common approach, then I think the current draft is clear enough. Otherwise you could add a short explanation. I just didn't know about ETag before. Maybe you could add a reference to RFC7232? For readers who don't know the term.

Could you elaborate on this and what kind of additional information a user could be requesting that is not part of the analytics.txt file?

I was thinking if there is (could be) a way for people (or tools) interested in a more detailed privacy information to fetch additional information (e.g., via email). But this would probably be out-of-scope for this draft.

Server-Side analytics does not necessarily mean using logs. In web applications you can use "analytics middleware" at application layer that records every request along with some metadata and also identifies users.

I still don't really see how this differentiation impacts the privacy assessment of a user. Maybe it would be good to briefly elaborate on the difference for a user.

This decision is not really binary: if you don't allow opt-in or opt-out, you have to specify none.

opt-in means no data is collected before giving consent (you might have to set a cookie to store that preference, yes, but that's not what we are looking at)

opt-out means you can opt out of everything at every time

everything else is classified as none

We aren't concerned about software setting cookies, but we're looking at the consequences of this.

Ok, that makes sense. Maybe it would be good to clarify this in the report as well using the cookies consent example.

Not entirely sure, but in draft version 01 these are multi-value fields.

I was thinking of the scenario where a website retains different collected values for a different time (e.g., retain visited urls for one year, but delete ip-addresses after one month). But I'm not sure how common and/or how useful such a fine granularity would be.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review ETH #4

Review ETH #4

m90 commented Aug 2, 2021

m90 commented Aug 5, 2021

cyrill-k commented Aug 27, 2021