Setters can cause failure to roundtrip #651

TimothyGu · 2021-09-24T23:46:14Z

Generally, we assume setters will keep the URL valid and roundtrippable. However, this turns out not to be true in a specific case:

u = new URL('data:text/html,hello #hash');
u.hash = '';
console.log(JSON.stringify(u.href)); // "data:text/html,hello "

u2 = new URL(u); // constructor strips leading/trailing spaces as a first step
console.log(JSON.stringify(u2.href)); // "data:text/html,hello" without the trailing space

This is only an issue because we allow spaces to appear verbatim in cannot-be-a-base URL paths.

(first discovered in https://crbug.com/291747)

TimothyGu · 2021-09-24T23:56:59Z

A few options here:

Concede that setters may make URLs non-roundtrippable, and be okay with that.
In a setter, if the particular component is completely removed, check if the serialization ends with a space, and trim the path if needed.
Don't allow trailing spaces in cannot-be-a-base URL paths in the initial parse, so parsing data:text/html,hello #hash would result in data:text/html,hello#hash.
Disallow hash and search setters on cannot-be-a-base URLs, like how path is already a no-op.

Option 1 is probably the most web-compatible for the most content, but leaves a weird edge case and a sour taste in my mouth. Option 2 can be a bit surprising to the user (the pathname can change even when only hash is being set), but probably also web-compatible too. Option 3 is the most theoretically pleasing, but I would not be surprised if some web content depends on data:text/html,hello ? world being rendered as hello ? world. (After the change, it would be rendered instead as hello? world.) Option 4 is perhaps too radical to fix such an edge case.

I'm leaning towards option 2.

annevk · 2021-09-27T08:19:36Z

We could also say that in such cases the mapping of empty string to null does not happen. I.e., we would preserve # or ? in the serialized URL.
We percent-encode the final code point of the path or query as needed.

I think I like these a bit better than 2. 4 would be problematic for fragment navigation, which is definitely a thing in data: URLs. 3 seems a little surprising. Agreed on 1.

TimothyGu · 2021-09-28T00:41:04Z

I think I prefer 2 over 5. With 2, users have more options: if they want to keep the space and the delimiter, then they can explicitly opt into it by setting .hash = '#'. With 5, there's no way to remove the # through setters alone.

6 seems a little surprising – we never only percent-encode one character. Chrome is also already running into some trouble trying to percent-encode more characters in non-special URLs (https://crbug.com/1252739), so ideally I'd like to keep the path the way it is.

karwa · 2021-09-28T01:18:57Z

Start percent-encoding spaces in the paths of cannot-be-a-base URLs

I can't think of anywhere else where naked spaces are allowed (not in the scheme, query, fragment, userinfo, not in domains, not in opaque hostnames, etc) and RFC-3986 also doesn't allow them. Are there use-cases which absolutely depend on non-encoded spaces being preserved verbatim in CBAB URL paths?

TimothyGu · 2021-09-28T01:37:01Z

@karwa The ability to percent-encode unusual characters in CBAB paths would be the ideal case. Sadly, I doubt that will be web-compatible. In fact, I think it'll be difficult just to encode spaces in hierarchical non-special URL paths (in Chrome and Firefox that is – Safari already does that).

annevk · 2021-10-21T07:51:41Z

@achristensen07 @valenting thoughts? I could live with 2.

valenting · 2021-10-21T10:59:58Z

I think that no.2 is the least likely to cause breakage, so that's a ➕ for me.

alwinb · 2021-11-22T11:13:29Z

Generally, we assume setters will keep the URL valid

Not particularly relevant to the discussion here, but be aware that setters do not preserve validity.
You can get a \ into a special URL's path or a # into a fragment, to name a few. See #379.

TimothyGu · 2021-11-22T20:06:47Z

@alwinb I think we are using different definitions of "valid" here. When I say "setters will keep the URL valid" I don't mean "valid URL string" as defined in the spec, but rather "roundtrippable". If you have other examples of where setters make the URL non-roundtrippable, I'd certainly like to hear it.

karwa · 2021-11-22T20:27:32Z

Setters being round-trippable is nice. I'd like to add that I think setters should preserve the structure of the URL component, so that urlA.component = urlB.component should strive to not change the component's meaning.

One issue I've been wondering about bringing up is that non-special paths can have unescaped backslashes, but if you go through the setter, the URL parser will then interpret those backslashes as path separators and normalise them to forwardslashes. This violates the urlA.component = urlB.component idea if the path is set on a special URL (urlA is special, urlB is not special). An example from my library:

// Backslashes are not percent-encoded in non-special URLs,
// so copying the path to a special URL can cause it to be interpreted differently.

var specialURL = WebURL("https://example.com/")!
specialURL.serialized()  // "https://example.com/"
specialURL.path,         // "/"
specialURL.pathComponents.count // 1

let nonSpecialURL = WebURL(#"foo://example.com/baz\qux"#)!
nonSpecialURL.serialized()  // "foo://example.com/baz\qux"
nonSpecialURL.path          // "/baz\qux"
nonSpecialURL.pathComponents.count. // 1

specialURL.path = nonSpecialURL.path

specialURL.serialized()  // "https://example.com/baz/qux" (!)
specialURL.path          // "/baz/qux" (!)
specialURL.pathComponents.count  // 2 (!)

We could get around this if the URL parser percent-encoded each component in list-style paths with a "path component" encode-set, which would consist of the path set + backslash. Currently the path set does not include path separators themselves.

TimothyGu · 2021-11-23T05:06:04Z

@karwa Escaping \ in non-special paths sounds like a reasonable idea, but I'm not sure about the web compatibility. Browsers are consistent in not escaping \, even though RFC 3986 disallows \ in paths and several non-browser implementations (e.g., Go) do escape it.

Either way it should probably be in a separate ticket. (Edit: created #675 to track this.)

achristensen07 · 2021-11-23T16:21:34Z

I like 2, which is effectively what Safari already does and we haven't seen any compatibility issues from it.

TimothyGu · 2021-11-23T16:41:26Z

@achristensen07 Oh interesting! Didn't realize Safari already fixed this in some way. Sounds like we have alignment on 2 then.

As opaque paths can end in U+0020, those trailing U+0020 code points need to be removed from the path when both query and fragment become null. Tests: ... Fixes #651.

annevk · 2022-12-16T15:30:32Z

I got around to creating tests for this as well as a PR: #728. Safari does indeed seem to do this correctly (although there are some unrelated pathname setter issues I found while testing).

I noted Gecko down as supportive per @valenting's comment above.

annevk · 2022-12-22T07:48:22Z

Those pathname setter issues were not quite unrelated as @domenic noticed during review (thanks!). I mistakenly thought url.pathname = "blah " would already result in "blah" (trailing space removed), but that is not the case.

It turns out that you cannot use pathname with opaque paths. And when the path is not opaque percent-encoding of U+0020 happens, though that is not what WebKit currently implements. (And other browsers are generally poor for non-special URLs. Hopefully that'll improve this year!)

Tests will be updated to reflect all this.

As opaque paths can end in U+0020, those trailing U+0020 code points need to be removed from the path when both query and fragment become null. Tests: web-platform-tests/wpt#37556. Fixes #651.

annevk added the topic: api label Sep 27, 2021

TimothyGu mentioned this issue Nov 30, 2021

Should we escape \ in non-special non-opaque paths? #675

Open

annevk added a commit that referenced this issue Dec 16, 2022

Ensure the API roundtrips for opaque paths

6577017

As opaque paths can end in U+0020, those trailing U+0020 code points need to be removed from the path when both query and fragment become null. Tests: ... Fixes #651.

annevk mentioned this issue Dec 16, 2022

Ensure the API roundtrips for opaque paths #728

Merged

4 tasks

annevk closed this as completed in #728 Jan 2, 2023

karwa mentioned this issue Sep 5, 2023

URL serialiser does not produce idempotent strings if an opaque path has a trailing space and "exclude fragment" is true #784

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setters can cause failure to roundtrip #651

Setters can cause failure to roundtrip #651

Setters can cause failure to roundtrip #651

Setters can cause failure to roundtrip #651

Comments