Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impossible to parse cyrillic Uri #86

Open
akka-ci opened this issue Sep 8, 2016 · 14 comments
Open

Impossible to parse cyrillic Uri #86

akka-ci opened this issue Sep 8, 2016 · 14 comments
Labels
1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted help wanted Identifies issues that the core team will likely not have time to work on t:core Issues related to the akka-http-core module t:model Issues around the model classes and its functionality

Comments

@akka-ci
Copy link

akka-ci commented Sep 8, 2016

Issue by RomanIakovlev
Tuesday Feb 02, 2016 at 19:17 GMT
Originally opened as akka/akka#19677


Consider this:

"com.typesafe.akka" %% "akka-http-experimental" % "2.4.2-RC1"

import akka.http.scaladsl.model.Uri

scala> Uri("http://президент.рф/", Uri.ParsingMode.Relaxed)
akka.http.scaladsl.model.IllegalUriException: Illegal URI reference: Invalid input 'п', expected 'EOI', '#', '?', path-abempty or authority (line 1, column 8): http://президент.рф/
       ^
  at akka.http.scaladsl.model.IllegalUriException$.apply(ErrorInfo.scala:40)
  at akka.http.scaladsl.model.Uri$.fail(Uri.scala:741)
  at akka.http.impl.model.parser.UriParser.fail(UriParser.scala:62)
  at akka.http.impl.model.parser.UriParser.parseUriReference(UriParser.scala:33)
  at akka.http.scaladsl.model.Uri$.apply(Uri.scala:209)
  at akka.http.scaladsl.model.Uri$.apply(Uri.scala:199)
  ... 43 elided

Any insights on how to tackle this?

@akka-ci akka-ci added this to the 2.4.x milestone Sep 8, 2016
@akka-ci akka-ci added the t:http label Sep 8, 2016
@akka-ci
Copy link
Author

akka-ci commented Sep 8, 2016

Comment by johanandren
Wednesday Feb 03, 2016 at 07:03 GMT


If you read the docs of Uri.apply they say

"Parses a valid URI string into a normalized URI reference as defined by http://tools.ietf.org/html/rfc3986#section-4.1. Percent-encoded octets are decoded using the given charset (where specified by the RFC). If strict is false, accepts unencoded visible 7-bit ASCII characters in addition to the RFC"

So this means you will need to transform the unicode hostname into ascii before passing it to apply or construct your Uri instance manually instead of parsing it, like this: Uri(scheme = "http", authority = Uri.Authority(Uri.NamedHost("президент.рф"))). NamedHost will perform the required "punicode" encoding of the IDN (https://en.wikipedia.org/wiki/Internationalized_domain_name).

@akka-ci
Copy link
Author

akka-ci commented Sep 8, 2016

Comment by johanandren
Wednesday Feb 03, 2016 at 08:03 GMT


We should discuss if we can improve this when @ktoso is back.

@akka-ci
Copy link
Author

akka-ci commented Sep 8, 2016

Comment by RomanIakovlev
Wednesday Feb 03, 2016 at 11:14 GMT


Thanks, understood. While it works, it is a workaround of sorts, because if I use the akka.http.scaladsl.model.Uri#apply(input: ParserInput): Uri, it would automatically distinguish between absolute and relative URIs, a functionality on which I rely. I'm writing a web crawler, and it's nice to just throw org.jsoup.nodes.Element#select("a[href^=\"/\"]") to the aforementioned Uri#apply method and allow it figure out if it's absolute or relative one.

@akka-ci
Copy link
Author

akka-ci commented Sep 8, 2016

Comment by rkuhn
Wednesday Feb 03, 2016 at 11:30 GMT


RFC 3986 is very strict on what is allowed within a URI, so I would conclude that the attribute values extracted by JSoup need to be sanitized before they can be used in this context. Adding that code to Uri.apply() does not seem right to me.

@akka-ci
Copy link
Author

akka-ci commented Sep 8, 2016

Comment by johanandren
Wednesday Feb 03, 2016 at 11:34 GMT


I agree, my thoughts was that there will be people who will want to parse URI:s that contain IDN host parts, and maybe we should/could provide a separate way to do that easily.

@akka-ci
Copy link
Author

akka-ci commented Sep 8, 2016

Comment by RomanIakovlev
Wednesday Feb 03, 2016 at 11:36 GMT


@johanandren my thoughts exactly. And it's not only about the hosts, there are non-ASCII characters in other URI components, like paths, in the wild.

@akka-ci
Copy link
Author

akka-ci commented Sep 8, 2016

Comment by rkuhn
Wednesday Feb 03, 2016 at 11:36 GMT


Yes, true. What makes me wonder (in general) is why this punycode thing was even invented, given percent encoding.

@akka-ci
Copy link
Author

akka-ci commented Sep 8, 2016

Comment by drewhk
Wednesday Feb 03, 2016 at 11:59 GMT


Because it is for DNS, and % was a no-go for backwards compatibility (if I
understand correctly).

On Wed, Feb 3, 2016 at 12:36 PM, Roland Kuhn [email protected]
wrote:

Yes, true. What makes me wonder (in general) is why this punycode thing
was even invented, given percent encoding.


Reply to this email directly or view it on GitHub
akka/akka#19677 (comment).

@ktoso ktoso added 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted and removed t:http labels Sep 8, 2016
@ktoso ktoso removed this from the 2.4.x milestone Sep 12, 2016
@eiennohito
Copy link

eiennohito commented Oct 2, 2016

Actual spec for URLs is: https://url.spec.whatwg.org/
It allows unescaped utf-8 unicode code points, at least in fragments. In wild there are completely unescaped URLs as well.

@RustedBones
Copy link

I also ran into the same issue.
As mentioned above, the URI RFC 3986 is pretty strict. The IRI RFC 3987 though provides better internationalization support. Would it be possible to migrate the model to the later standard?

I've made an attempt in the capturl library to create a IRI model, very inspired from the akka-http Uri, also using parboiled2 parser.

If this is considered relevant, I can try to contribute it into the akka-http project.

@jrudolph
Copy link
Contributor

@RustedBones, thanks for sharing. I wonder how you would use that model in the context of akka-http? The HTTP spec is also pretty strict about how URIs used in the protocol have to look like. How are IRIs used in the HTTP protocol?

@RustedBones
Copy link

On the HTTP layer, we don't have to use IRIs directly.
The conversion can be done internally by akka-http like this:

http://президент.рф/пре: IRI -> http://xn--d1abbgf6aiiy.xn--p1ai/%D0%BF%D1%80%D0%B5: URI

At the moment this conversion must be done by users. For better usability, It would be nice that the akka-http-client accepts IRIs which are more user-friendly.

@jrudolph
Copy link
Contributor

I see. Indeed that would be nice. Could one solution be to "just" offer a new constructor for Uri that can parse IRIs and then instantly converts them to URIs? An alternative could be that the URIs itself present the content of IRIs (but what to do about the naming then) and only convert to URI when rendering (or specifically asked to do that)?

@gaeljw
Copy link

gaeljw commented Dec 2, 2020

Coming here in 2020, I'm wondering if there is any way using pure Akka to sort this out now, or if we still need another library with IRI support?

@jrudolph jrudolph added help wanted Identifies issues that the core team will likely not have time to work on t:core Issues related to the akka-http-core module t:model Issues around the model classes and its functionality labels Dec 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted help wanted Identifies issues that the core team will likely not have time to work on t:core Issues related to the akka-http-core module t:model Issues around the model classes and its functionality
Projects
None yet
Development

No branches or pull requests

6 participants