Phonos can generate the audio files for the <phonos> tags either immediately during wikitext parsing, or later in a job.
The job queue path was implemented in order to allow adding a rate limit to the generation, which may use a rate-limited API (T318086). However, its current implementation has no way to surface errors occurring during audio generation to the user, therefore it's only used when the parse itself was triggered from a job. This causes problems (T325464), as there is no supported way to detect that a parse is happening inside a job (because the results are not supposed to be different).
This work should be done via the job queue unconditionally, after figuring out a way for that mode to surface the errors. That seems normal for me for a parser extension, and e.g. thumbnailing images or timed media transcoding works in a similar way, where the parser doesn't wait for thumbnails/transcodes to be ready. Life would be much simpler if we could do that here.