Barnaby Walters • Notes

Here’s a python snippet for analysing an iNaturalist export file and exporting an HTML-formatted list of species which only have observations from a single person (e.g. this list for the CNC Wien 2021)

# coding: utf-8

import argparse
import pandas as pd

"""
Find which species in an iNaturalist export only have observations from a single observer.

Get an export from here: https://www.inaturalist.org/observations/export with a query such
as quality_grade=research&identifications=any&rank=species&projects[]=92926 and at least the
following columns: taxon_id, scientific_name, common_name, user_login

Download it, extract the CSV, then run this script with the file name as its argument. It will
output basic stats formatted as HTML.

The only external module required is pandas.

Example usage:

		py uniquely_observed_species.py wien_cnc_2021.csv > wien_cnc_2021_results.html

If you provide the --project-id (-p) argument, the taxa links in the output list will link to 
a list of observations of that taxa within that project. Otherwise, they default to linking
to the taxa page.

If a quality_grade column is included, non-research-grade observations will be included in the
analysis. Uniquely observed species with no research-grade observations will be marked. Species
which were observed by multiple people, only one of which has research-grade observation(s) will
also be marked.

By Barnaby Walters waterpigs.co.uk
"""

if __name__ == "__main__":
	parser = argparse.ArgumentParser(description='Given an iNaturalist observation export, find species which were only observed by a single person.')
	parser.add_argument('export_file')
	parser.add_argument('-p', '--project-id', dest='project_id', default=None)

	args = parser.parse_args()

	uniquely_observed_species = {}

	df = pd.read_csv(args.export_file)

	# If quality_grade isn’t given, assume that the export contains only RG observations.
	if 'quality_grade' not in df.columns:
		df.loc[:, 'quality_grade'] = 'research'

	# Filter out casual observations.
	df = df.query('quality_grade != "casual"')

	# Create a local species reference from the dataframe.
	species = df.loc[:, ('taxon_id', 'scientific_name', 'common_name')].drop_duplicates()
	species = species.set_index(species.loc[:, 'taxon_id'])
	
	for tid in species.index:
		observers = df.query('taxon_id == @tid').loc[:, 'user_login'].drop_duplicates()
		research_grade_observers = df.query('taxon_id == @tid and quality_grade == "research"').loc[:, 'user_login'].drop_duplicates()

		if observers.shape[0] == 1:
			# Only one person made any observations of this species.
			observer = observers.squeeze()
			if observer not in uniquely_observed_species:
				uniquely_observed_species[observer] = []

			uniquely_observed_species[observer].append({
				'id': tid,
				'has_research_grade': (not research_grade_observers.empty),
				'num_other_observers': 0
			})
		elif research_grade_observers.shape[0] == 1:
			# Multiple people observed the species, but only one person has research-grade observation(s).
			rg_observer = research_grade_observers.squeeze()
			if rg_observer not in uniquely_observed_species:
				uniquely_observed_species[rg_observer] = []
			
			uniquely_observed_species[rg_observer].append({
				'id': tid,
				'has_research_grade': True,
				'num_other_observers': observers.shape[0] - 1
			})
	
	# Sort observers by number of unique species.
	sorted_observations = sorted(uniquely_observed_species.items(), key=lambda t: len(t[1]), reverse=True)

	print(f"<p>{sum([len(t) for _, t in sorted_observations])} taxa uniquely observed by {len(sorted_observations)} observers.</p>")

	print('<p>')
	for observer, _ in sorted_observations:
		print(f"@{observer} ", end='')
	print('</p>')

	print('<p><b>bold</b> species are ones for which the given observer has one or more research-grade observations.</p>')
	print('<p>If only one person has RG observations of a species, but other people have observations which need ID, the number of needs-ID observers are indicated in parentheses.')

	for observer, taxa in sorted_observations:
		print(f"""\n\n<p><a href="https://www.inaturalist.org/people/{observer}">@{observer}</a> ({len(taxa)} taxa):</p><ul>""")
		for tobv in sorted(taxa, key=lambda t: species.loc[t['id']]['scientific_name']):
			tid = tobv['id']
			t = species.loc[tid]

			if args.project_id:
				taxa_url = f"https://www.inaturalist.org/observations?taxon_id={tid}&amp;project_id={args.project_id}"
			else:
				taxa_url = f'https://www.inaturalist.org/taxa/{tid}'
			
			rgb, rge = ('<b>', '</b>') if tobv.get('has_research_grade') else ('', '')
			others = f" ({tobv.get('num_other_observers', 0)})" if tobv.get('num_other_observers', 0) > 0 else ''

			if not pd.isnull(t['common_name']):
				print(f"""<li><a href="{taxa_url}">{rgb}<i>{t['scientific_name']}</i> ({t['common_name']}){rge}{others}</a></li>""")
			else:
				print(f"""<li><a href="{taxa_url}">{rgb}<i>{t['scientific_name']}</i>{rge}{others}</a></li>""")
		print("</ul>")

#protip for anyone using QtMultimedia QAudioInput with python’s wave module to write PCM data to a wave file: to convert between QAudioFormat’s sampleSize() number and wave’s sample width number, divide by 8, e.g:

wave_file_to_write.setsampwidth(audio_format.sampleSize() / 8)

QAudioFormat’s sampleRate() number works as it is.

More #SPSS fun: use the apparently undocumented /QUALIFIER = '"' (that’s a double quote inside two single quotes) option in a GET DATA to make CSV lines like

1,5,3,4,"A text value with a comma, which is still only one cell"

work correctly.

Finally solved a long-standing problem getting Icelandic characters to work properly in files being downloaded onto Windows machines for use as SPSS syntax. Turns out the solution is to explicitly set the download charset to UTF-8, and to prepend an unnecessary BOM (yuk) to the beginning of the file as so (context: Django view):

import codecs

def export_spss(request):
    response = HttpResponse(export_spss(), status=200, mimetype="application/x-spss; charset=utf-8")
    response['Content-Disposition'] = 'attachment; filename=syntax.sps'
    response.content = codecs.BOM_UTF8 + response.content
    return response

Why is a BOM, which should be completely unnecessary in a UTF-8 file (it has no variable byte order after all) apparently required by some Windows software in order to tell it that the file is UTF-8 encoded, despite Unicode mode being on? Sigh.

↪

Tantek Çelik: new home page * 100 posts via flat bim files * <64KB HTML * <1s page load no DB XHR ∞scroll needed beat that, silos :)

@t excellent minimal Like implementation! Whilst your homepage performance is admirable, I don’t think you can meaningfully compare it to silo infinite scroll untill there’s some sort of pagination :) Currently, without rel-prev[ious] links, there’s no way for crawlers (e.g. readers like Shrewdness, semi-hypothetical “your year in indieweb”) tools to find your old posts other than fetching each one individually, which for many cases takes too long to provide a good experience — e.g. crawling your years worth of content takes ≈162s, verifiable with the following bash+PHP code:

curl -Ss https://getcomposer.org/installer | php

./composer.phar require taproot/subscriptions

php -a  # Start an interactive shell, paste in following code (alternatively save into file):

@(require 'vendor/autoload.php'); $start = microtime(1); echo "Starting crawl…\n"; Taproot\Subscriptions\crawl('http://tantek.com/2014/365/t1/indieweb-like-posts-2015-commitment-done', function ($r) { echo "."; if (substr($r['mf2']['items'][0]['properties']['published'][0], 0, 4) == '2013') { return false; } else { return true; } }); $total = microtime(1) - $start; echo "\nYear crawl for 2014 took {$total}s";

The Heroku python client library is horribly out of date, and many simple things which should work don’t, throwing confusing errors. Here’s my version:

# coding: utf-8

import requests

HEROKU_URL = 'https://api.heroku.com'


class Client():
    def __init__(self, api_key, heroku_url=None):
        self.heroku_url = HEROKU_URL if heroku_url is None else heroku_url
        self.session = requests.Session()
        self.session.headers.update({'Accept': 'application/vnd.heroku+json; version=3'})
        self.session.auth = ('', api_key)

    def get(self, path):
        r = self.session.get('%s/%s' % (self.heroku_url.rstrip('/'), path.lstrip('/')))
        r.raise_for_status()
        return r.json()

    def post(self, path, data=None):
        r = self.session.post('%s/%s' % (self.heroku_url.rstrip('/'), path.lstrip('/')), data=data)
        r.raise_for_status()
        return r.json()

Add similar methods for DELETE if you find you require it. I haven’t yet, as the idea of programatically being able to delete apps is much more worrying than the ability to create them.

How to query for all highways in #OSM Overpass API (demo: overpass-turbo.eu/s/3sB):

<osm-script output="json">
  <query type="way" into="highways">
    <bbox-query {{bbox}}/>
    <has-kv k="highway" />
  </query>
 
  <union>
    <item set="highways"/>
    <recurse from="highways" type="down"/>
  </union>
  
  <print mode="body" order="quadtile"/>
</osm-script>

This returns all highways without filtering, check out the OSM Highway docs for different possible types of highway, and add a v="" attribute to the has-kv element to filter.

Javascript has no real Set or Dictionary implementation, which for someone spoiled by python’s set and dicts is rather frustrating. However, in leiu of the poorly supported js Set type, plain old objects can be massaged into acting as both sets and dicts:


// Python: d = dict()
var d = {};

// d['key'] = 'value'
d['key'] = 'value';

// d.get('nonexistent', 'fallback')
d.hasOwnProperty('nonexistent') ? d['nonexistent'] : 'fallback';

// d.keys()
Object.keys(d);

// s = set()
var s = {};

// s.add(1)
s[1] = true;

// 1 in s
s.hasOwnProperty(1);

// Accessing all values in set:
Object.keys(s);

Notes: the in operator can be used to test membership, but will incorrectly return true for __proto__, as well as all properties up the prototype chain, i.e. all properties and methods of Object. hasOwnProperty is much safer to use.

Similarly, the use of the ternary operator for get-item-with-fallback could in theory be replaced with d['item'] || 'fallback', unless of course the value stored was falsey, in which case the or will incorrectly return a truthier fallback.

Considering the possibility that preferring to use spaces over tabs for code indentation is indicative of excess complacency with the status quo of using complex text formats to express behaviour, and of the viewpoint that code is a static, inflexible material (think raster images vs vector)

I am inordinately proud of this tiny #namecoin progress bash snippet:

watch 'echo "$(namecoind getblockcount) / $(curl -s http://explorer.dot-bit.org/stats/block_count.txt) * 100" | bc -l'

↪

Sandeep Shetty: @BarnabyWalters Any chance you might share the "~210 lines of code" with me? If not publicly, maybe a private gist?

@sandeepshetty sure! gist.github.com/barnabywalters/7863676 — included the basic functions plus the convenience class I use and a little demo. Very specific to chronological post storage/indexing, and very much in flux. I’d be interested to hear your thoughts about it.

#microformats2 bookmarklet: drag this to your bookmarks bar for one-click mf2 parsing:

View microformats2

Code (readable):

javascript:(function() {
	if(document.location.hostname == 'pin13.net' && document.location.pathname == '/mf2/') {
		document.location.href = decodeURIComponent(document.location.search.slice(5));
    } else {
    document.location.href = 'http://pin13.net/mf2?url=' + encodeURIComponent(document.location.href);
    }
}())

Github’s new-issue UI supports URL query param auto-filling of name, content and labels (as `title`, `body` and multiple `labels[]` parameters):

https://github.com/barnabywalters/weave/issues/new
?title=Name Of Issue
&body=Blah blah blah problems
&labels[]=bug
&labels[]=enhancement

How to emulate standard #php front-controller behaviour of routing static assets statically, otherwise calling index.php using the PHP 5.4 built-in server:

// file: index.php
// Route static assets from CLI server
if (PHP_SAPI === 'cli-server') {
    if (file_exists(__DIR__ . $_SERVER['REQUEST_URI']) and !is_dir(__DIR__ . $_SERVER['REQUEST_URI'])) {
        return false;
    }
}

// do usual front-controller stuff

Watch out for #python dict-based string interpolation examples which look like this:

'Hello, %(name)s' % {'name': 'Otter'}

That s after the brackets isn’t pluralising one adorable aquatic mammal into a whole bunch of them, it’s actually part of the interpolation placeholder — the equivalent of

'Hello, %s' % 'Otter'

Note also that for some reason, python lets you put spaces between the closing bracket and the type signifying character. This can cause extremely weird bugs when the string being interpolated is also being translated. For example:

_('%(customer) shared a thing') % {'customer': 'Mr. Bean'}

If not translated, this will produce this confusing but fairly easy to debug output

'Mr. Beanhared a thing'

But if 'shared' is translated into a word beginning with, for example, d, you’ll just get an exception like TypeError: A float is required

If you’re working with undocumented lat/long coordinate data and, when plotted, everything’s coming out sort of in the right place but a little way off, check to see whether or not what looks like decimal lat long data is actually traditional DMS data.

For example, I recently had to parse and plot a bunch of coordinates which looked like this: 6359550-2154605. I initially thought it was decimal lat/long data missing decimal points for some reason, so I plotted it as 63.59550, -21.54605. All of the coordinates were in the right place relative to each other, but about 1/3rd of a degree off. Turns out the data actually needed to be plotted as 63˚ 59' 55.0", -21˚ 54' 60.5".

Here’s the python I wrote to clumsily convert the strange original form into decimal:

def dms_to_decimal(old):
    if old[0] == '-':
        old = old[1:]
        multiplier = -1
    else:
        multiplier = 1

    return (int(old[0:2])+int(old[2:4])/60.0+int(old[4:6])/3600.0) * multiplier

I get a little annoyed at #python every now and again (grr package management) but then I come across things like nested tuple unpacking which are just so lovely they make up for it:

for i, (key, value) in enumerate(list_of_tuples):
    print i, key, value

In reply to a post on gist.github.com

My take on generic prev/next controls on keyup, using only bean for events, based on previous work by Aaron Parecki and Tantek Çelik:


// Generic prev/next navigation on arrow key press
bean.on(document.body, 'keyup', function (e) {
  var prevEl, nextEl;
  
  if (document.activeElement !== document.body) return;
  if (e.metaKey || e.ctrlKey || e.altKey || e.shiftKey) return;
  
  if (e.keyCode === 37) {
    prevEl = document.querySelector('[rel~=previous]');
    if (prevEl) bean.fire(prevEl, 'click');
  } else if (e.keyCode === 39) {
    nextEl = document.querySelector('[rel~=next]');
    if (nextEl) bean.fire(nextEl, 'click');
  }
});

Yesterday we at Vísar tested the neat SVG image element hack on all the devices and browsers we had at hand to see how it performed and whether or not it was viable to use in production.

Given this markup:

<svg>
    <image xlink:href="http://example.com/the-image.svg" src="http://example.com/the-image.png" width="100" height="100" />
</svg>

Here’s a table of what each browser+device downloaded:

Browser	Format Requested
Mob. Safari iOS 4.2.1	PNG
Mob. Safari iOS 6.1.3	SVG
Chrome 28 Mac	SVG
Safari 5.1.9 Mac	SVG
Safari 6.0.5 Mac	SVG
Firefox 26 Mac	SVG
Firefox 22 Mac	SVG
IE 8.0.6	PNG
IE 10	SVG+PNG
Kindle (3rd gen)	PNG

Note that the Kindle downloaded the PNG despite having pretty good SVG support. Tests carried out locally by watching the Django request logs.

At first, this looked perfect — browsers which supported SVG only downloaded the SVG (apart from IE 10), and other browsers just got the PNG. However, it seems that SVG image elements can’t be sized with percentages, meaning our flexible layouts were never going to work. I tried to fix it using the dreaded viewBox and user units (as I have previously to compensate for percentage-based units not being allowed in SVG paths), but that just led to everything being completely the wrong size.

So, (unless someone can show me how to fix this), whilst we think this is a great hack, it’s not going to work out for our product due to the weirdness of SVG sizing limitations.

I just faked having a task queue for #taproot #indieweb note posting tasks using Symfony HttpKernel::terminate() and it was the easiest thing ever.

Instances or subclasses of HttpKernel have a terminate($request, $response) method which, if called in the front controller after $response->send(); triggers a kernel.terminate event on the app’s event dispatcher. Listeners attached to this event carry out their work after the content has been sent to the client, making it the perfect place to put time-consuming things like POSSE and webmention sending.

Once you’ve created your new content and it’s ready to be sent to the client, create a new closure which carries out all the the time consuming stuff and attach it as a listener to your event dispatcher, like this:

$dispatcher->addListener('kernel.terminate', function() use ($note) {
    $note = sendPosse($note);
    sendWebmentions($note);
    $note->save();
}

Then, provided you’re calling $kernel->terminate($req, $res); in index.php, your callback will get executed after the response has been sent to the client.

If you’re not using HttpKernel and HttpFoundation, the exact same behaviour can of course be carried out in pure PHP — just let the client know you’ve finished sending content and execute code after that. Check out these resources to learn more about how to do this:

FPM, specifically fastcgi_finish_request()
flush()
HttpFoundation\Request::send() as a sample implementation

Further ideas: if the time consuming tasks alter the content which will be shown in any way, set a header or something to let the client side know that async stuff is happening. It could then re-fetch the content after a few seconds and update it.

Sure, this isn’t as elegant as a message queue. But as I showed, it’s super easy and portable, requiring the addition of three or four lines of code.