The menu "Download" of the Wikidata Query Service (WDQS) UI lets users export the results of their queries... in an unknown encoding.
This encoding should be UTF-8.
A few days ago, this used to be in UTF-8.
abian | |
May 13 2017, 6:15 PM |
F8135585: current-results-test.csv | |
May 20 2017, 7:59 PM |
F8135195: query.csv | |
May 20 2017, 6:50 PM |
F8124816: Capture du 2017-05-19 22-57-49.png | |
May 19 2017, 8:59 PM |
F8062414: wdqs-download.png | |
May 13 2017, 6:15 PM |
The menu "Download" of the Wikidata Query Service (WDQS) UI lets users export the results of their queries... in an unknown encoding.
This encoding should be UTF-8.
A few days ago, this used to be in UTF-8.
I've tried several encodings, including all the ISO-8859 (from ISO-8859-1 to ISO-8859-15) but none seems to match the encoding used...
@abian could you please add:
@Smalyshev in my case, I tried several queries on Ubuntu and Windows, with Chrome and Firefox and for every option (JSON TSV, CSV, verbose or not), but always opened with LibreOffice Calc (version: 5.1.6.2), the problem is always the same.
The original query (on Wikidata:Bistro) was this one (instance of family name with writing system Latin script)
I can reproduce it with the link Ash_Crow shared. I run that query and choose download -> CSV in the menu. I've attached the resulting file. When I open the file in Notepad++ it already looks strange with line break within values.
I'm using Firefox 53.0.2 (64-bit) on Windows 10. Doing the same on Chrome 58.0.3029.110 (64-bit) on the same computer results in the same result. Values with line breaks in them.
None of the provided formats (verbose or non-verbose JSON file, verbose or non-verbose TSV file, CSV file) is correct. This problem doesn't seem to depend on the web browser nor on the operating system.
You can also use this query for testing. You should be able to download and properly read all the test characters from the results using UTF-8 encoding and without any line break between them.
However, these are my current results.
I suspect wrong version of download.js was deployed on last GUI deployment. I'll redeploy GUI and see if it fixes things.
Mentioned in SAL (#wikimedia-operations) [2017-05-21T09:06:45Z] <smalyshev@tin> Started deploy [wdqs/wdqs@227ab25]: Redeploy GUI due to breakage in T165228
Mentioned in SAL (#wikimedia-operations) [2017-05-21T09:07:04Z] <smalyshev@tin> Finished deploy [wdqs/wdqs@227ab25]: Redeploy GUI due to breakage in T165228 (duration: 00m 19s)
Seems to be fixed now after redeploy. Please reload GUI (clean cache, etc.) and try again. If it still happens, please reopen.
I continue getting the same results with any computer. Should we wait a few hours or days?
Mentioned in SAL (#wikimedia-operations) [2017-05-22T06:00:29Z] <smalyshev@tin> Started deploy [wdqs/wdqs@e4301da]: Redeploy GUI due to breakage in T165228
Mentioned in SAL (#wikimedia-operations) [2017-05-22T06:02:19Z] <smalyshev@tin> Finished deploy [wdqs/wdqs@e4301da]: Redeploy GUI due to breakage in T165228 (duration: 01m 50s)
It seems that this bug is back... Or a very similar one at least (maybe not the same cause but clearly the same effect).
Today I did this query : http://tinyurl.com/yad7ah6w and it's apparently not UTF-8.
Looks like there was some breakage between 1.4.4 (which worked) and 1.4.7 (which doesn't) in download.js. I'll try to figure out where it was broken and downgrade the build to a fixed working version.
1.4.4 seems to work fine, 1.4.6 is broken.
Reported it as: https://github.com/rndme/download/issues/56
Change 364832 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/gui@master] Fix downloader.js to 1.4.4 to resolve bad non-ASCII downloads
Change 364832 merged by jenkins-bot:
[wikidata/query/gui@master] Fix downloader.js to 1.4.4 to resolve bad non-ASCII downloads
Mentioned in SAL (#wikimedia-operations) [2017-07-13T19:59:16Z] <smalyshev@tin> Started deploy [wdqs/wdqs@a32dbeb]: Redeploy GUI due to breakage in T165228
Mentioned in SAL (#wikimedia-operations) [2017-07-13T20:01:35Z] <smalyshev@tin> Finished deploy [wdqs/wdqs@a32dbeb]: Redeploy GUI due to breakage in T165228 (duration: 02m 19s)
Results are downloading in non-UTF-8 encoding. Please see output of this example query.
@Lucas_Werkmeister_WMDE TSV download format. Opened in Notepadd++, and encoding is detailed as ANSI when it has always been UTF-8. Some of the encoding issues are causing new lines in the download, corrupting both the entities and the file, for example (sample included from example query linked previously):
http://www.wikidata.org/entity/Q3157864 Jacques-Antoine-Marie Lemoine 3 Jacques-antoine-marie lemoine
http://www.wikidata.org/entity/Q3157864 Jacques-Antoine-Marie Lemoine 3 Jacques-Antoine-Marie Lemoyne
http://www.wikidata.org/entity/Q3161723 Jan Ml
och 3 Jan Mlcoch
http://www.wikidata.org/entity/Q1964408 Nan Hoover 6 Nancy Dodge Browne
This is even prior to converting back to UTF-8.
This is still occurring. I just queried and downloaded (in chrome, TSV) 4 different results. 2 of the outputs were encoded in UTF-8, 2 in ANSI.
Example queries:
Hmm I think this may be related to version of downloadjs being bumped to 1.4.7 in 56f9d9aea62e2b4100ea3be3fd728c5fd2116082. 1.4.7 I think is buggy - see https://github.com/rndme/download/issues/56.
Change 395591 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/gui@master] DownloadJS back to 1.4.4
Hm, that might also be why T178564: SVG Image query result downloads use incorrect encoding still seems to be broken. I’ll try it out tomorrow.
Change 395595 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/gui-deploy@production] Merging from 36c776f28febfa6e837c099a5f479f63c35ff225:
Change 395596 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/gui@master] Add comment about downloadjs bug
Change 395595 merged by Smalyshev:
[wikidata/query/gui-deploy@production] Merging from 36c776f28febfa6e837c099a5f479f63c35ff225:
Change 395591 merged by jenkins-bot:
[wikidata/query/gui@master] DownloadJS back to 1.4.4
Change 395596 merged by jenkins-bot:
[wikidata/query/gui@master] Add comment about downloadjs bug
Okay, switching between downloadjs 1.4.4 and 1.4.7 fixes and breaks T178564: SVG Image query result downloads use incorrect encoding locally, respectively. But it’s still broken on query.wikidata.org – I take it the version change isn’t deployed yet?
Suggestion for the future: would it be possible to add an automated test to find out possible encoding issues before they are discovered by users?
Change 396325 had a related patch set uploaded (by Jonas Kress (WMDE); owner: Jonas Kress (WMDE)):
[wikidata/query/gui@master] Add test for DownloadJS utf-8
Change 396325 merged by jenkins-bot:
[wikidata/query/gui@master] Add test for DownloadJS utf-8